DiffPool

Motivation

Current GNN methods are inherently flat and do not learn hierarchical representations of graphs - a limitation that is especially problematic for the task of graph classification, where the goal is to predict the label associated with an entire graph.

Contribution

The authors propose DIFFPOOL, a differentiable graph pooling module that can generate hierarchical representations of graphs and can be combined with various graph neural network architectures in an end-to-end fashion. DIFFPOOL learns a differentiable soft cluster assignment for nodes at each layer of a deep GCNN, mapping nodes to a set of clusters, which then form the coarsened input for the next GNN layer.

Idea

Given $G=(V,E)$ , denote its adjacency matrix by $A\in\{0,1\}^{|V|\times|V|}$ and its initial node feature matrix by $X\in\mathbb{R}^{|V|\times d}$ . We want to learn a mapping $f:\mathcal{G}\rightarrow\mathcal{Y}$ from a set of labeled graphs $\mathcal{D}=\{(G_1,y_1),\cdots\}$ , where $\mathcal{G}$ is the set of all possible graphs and $\mathcal{Y}$ is the set of all possible graph labels.

Start with $A^{(0)}=A, H^{(0)}=H$ , we repeatedly perform the following update rules for $L$ times:

\begin{align} Z^{(l)}&=\text{GNN}_{\text{l, embed}}(A^{(l)}, H^{(l)}),\\ S^{(l)}&=\text{softmax}\left(\text{GNN}_{\text{l, pool}}(A^{(l)}, H^{(l)})\right),\\ H^{(l+1)}&=S^{(l)^T}Z^{(l)},\\ A^{(l+1)}&=S^{(l)^T}A^{(l)}S^{(l)}, \end{align}

where:

$A^{(l)}\in\mathbb{R}^{n_{l}\times n_{l}}$
$S^{(l)}\in\mathbb{R}^{n_{l}\times n_{l+1}}$ is considered as a soft assignment matrix assigning nodes in layer $l$ to clusters in layer $l+1$ . Note that $n_l$ 's for $1\leq l<L$ are hyperparameters.
$\text{GNN}_{\text{l, embed}}, \text{GNN}_{\text{l, pool}}$ are two separate GNNs.
$S^{L-1}$ is a vector of $1$ 's, i.e., all nodes at the final layer are assigned to a single cluster.
$H^{L}$ is a final embedding vector corresponding to the entire graph and can be passed to a differentiable classifier.

The authors consider the last two equations as a $\text{DIFFPOOL}$ layer: $(A^{(l)}, Z^{(l)})\mapsto (A^{(l+1)}, H^{(l+1)})$ .

Each round of performing the $4$ equations can be viewed as a round of graph coarsening, until $n_{L}=1$ . Below is a nice visualization of $3$ rounds of coarsening:

Elementwise, the last two equations above correspond to

\begin{align} H_{i}^{(l+1)}&=\sum_{j=1}^{n_{l}}S^{(l)^{T}}_{i, :}Z^{(l)}_{:,j}\\ A_{ij}^{(l+1)}&=\sum_{p=1}^{n_l}\sum_{k=1}^{n_l}s_{p,j}^{(l)}s_{k, i}^{(l)}a_{k, p}^{(l)} \end{align}

Permutation Invariance: The authors present a nice result that any deep GNN model based on $\text{DIFFPOOL}$ is permutation invariant, as long as the component GNNs are permutation invariant.

Proposition 1. Let $P\in\{0, 1\}^{n\times n}$ be any permutation matrix, then $\text{DIFFPOOL}(A, Z)=\text{DIFFPOOL}(PAP^{T}, PX)$ as long as $\text{GNN}(A, X)=\text{GNN}(PAP^{T}, X)$ .

I actually did not understand why this claim is True and has not been persuaded by the authors' claim.

Auxiliary Link Prediction Objective and Entropy Regularization

In practice, it can be difficult to train the pooling GNN using only gradient signal from the graph classification task. To alleviate this issue, the authors train the pooling GNN with an auxiliary link prediction objective, which encodes the intuition that nearby nodes should be pooled together. In particular, at each layer $l$ , the authors minimize $L_{\text{LP}}=||A^{(l)}-S^{(l)}S^{(l)^{T}}||_{F}$ . Note that $(S^{(l)}S^{(l)^{T}})_{ij}=\sum_{k=1}^{n_{l+1}}S_{ik}^{(l)}S_{kj}^{(l)}$ , corresponds to the inner product between the cluster assignment for node $v_i$ and $v_j$ . Intuitively, the larger $(A^{(l)})_{ij}$ is, the stronger the connectivity between the nodes, the similar their cluster assignment should be.

The authors also believe that the output cluster assignment for each node should generally be close to a one-hot vector, so that the membership for each cluster or subgraph is clearly defined. The authors therefore regularize the entropy of the cluster assignment by minimizing $L_{E}=\frac{1}{n}\sum_{i=1}^{n}H(s_i)$ , where $H$ denotes the entropy function and $S_i$ is the $i$ -th row of $S$ .

During training, $L_{\text{LP}}$ and $L_{\text{E}}$ from each layer are added to the classification loss, which yields a longer time to converge as well as a better performance and more interpretable cluster assignments.

Experiments

The authors evaluate the benefits of $\text{DIFFPOOL}$ against a number of state-of-the-art graph classification approaches, with the goal of answering the following questions: 1. How does DIFFPOOL compare to other pooling methods proposed for GNNs (e.g., using sort pooling or the SET2SET method)? 2. How does DIFFPOOL combined with GNNs compare to the state-of-the-art for graph classification task, including both GNNs and kernel-based methods? 3. Does DIFFPOOL compute meaningful and interpretable clusters on the input graphs?

Datasets

ENZYMES, PROTEINS, D&D, the social network dataset REDDIT-MULTI-12K, and the scientific collaboration dataset COLLAB.
$10$ % of the graphs are sampled to use as a validation set, with the remaining graphs used to perform $10$ -fold cross-validation to evaluate model performance.

Model configurations

GNN model: "mean" variant of GRAPHSAGE, which gives better performance than GCN
A DIFFPOOL layer is applied after every two GRAPHSAGE layers. A total of $2$ DIFFPOOL layers are used for the datasets. The number of clusters is set as $25$ % of the number of nodes before applying DIFFPOOL.
For small datasets, one DIFFPOOL layer is enough. $3$ layers of graph convolutions are performed after the DIFFPOOL layer. The number of clusters is set as $10$ % of the number of nodes before applying DIFFPOOL.
Batch Normalization is applied after every layer of GRAPHSAGE.
Adding an $\ell_2$ normalization to the node embeddings at each layer made the training more stable.
All models are trained for 3000 epochs with early stopping applied when the validation loss starts to drop.
Two model variants are considered
- DIFFPOOL-DET, a DIFFPOOL model where assignment matrices are generated using a deterministic graph clustering algorithm
- DIFFPOOL-NOLP, a variant where the link prediction side objective is turned off

Performance Statistics

The other GNN based models for performance comparison include:

GRAPHSAGE with global mean-pooling
Edge-conditioned filters in CNN for graphs (ECC) incorporates edge information into the GCN model and performs pooling using coarsening algorithm
PATCHYSAN defines a receptive field for each node and use a canonical node ordering, applies convolutions on linear sequences of node embeddings
SET2SET replaces global mean-pooling by the aggregation used in SET2SET. GRAPHSAGE is used as the base GNN model.
SORTPOOL applies a GNN architecture and performs a single layer of soft pooling followed by 1D convolution on sorted node embeddings

Differentiable Pooling on STRUCTURE2VEC

The authors also tested applying DIFFPOOL on Structure2Vec (S2V), a state-of-the-art graph representation learning algorithm that combines a latent variable model with GNNs and uses global mean pooling.

Analysis of Cluster Assignment in DIFFPOOL

Hierarchical cluster structure. In figure $2$ we can find a visualization of cluster assignment. Node cluster membership is determined by taking the $\text{argmax}$ of its cluster assignment probabilities. The authors observe that even when learning cluster assignment based solely on the graph classification objective, DIFFPOOL can still capture the hierarchical community structure. They also observe significant improvement in membership assignment quality with link prediction auxiliary objectives.

Dense vs. sparse subgraph structure. The authors observe that DIFFPOOL tends to collapse densely-connected subgraphs into clusters.

Assignment for nodes with similar representations. This is expected based on the update rule. The authors also observe that auxiliary link prediction helps discouraging nodes that are far away to be pooled together.

Sensitivity of the Pre-defined Maximum Number of Clusters. The assignment varies according to the depth of the network and the max number of clusters. Larger $C$ helps modeling more complex structure but also brings noise and is less efficiency. When there are too many clusters, it's possible that all nodes have low values for these clusters in the assignment matrix.

PreviousGraphSAGE NextRRN

Last updated 6 years ago