Notes
  • README
  • Roadmap
  • Graph
    • GraphSAGE
    • DiffPool
    • RRN
    • Relational RL
    • Layerwise Adaptive Sampling
    • Representation Lerning on Graphs: Methods and Applications
    • GAT
    • How Powerful are Graph Neural Networks?
    • Pitfalls of Graph Neural Network Evaluation
    • Spectral Networks and Deep Locally Connected Networks on Graphs
    • Deep Convolutional Networks on Graph-Structured Data
  • Optimizations
    • Neural ODE
  • Tags
Powered by GitBook
On this page
  • Motivation
  • Contribution
  • Idea
  • Auxiliary Link Prediction Objective and Entropy Regularization
  • Experiments
  • Datasets
  • Model configurations
  • Performance Statistics
  • Differentiable Pooling on STRUCTURE2VEC
  • Analysis of Cluster Assignment in DIFFPOOL
  1. Graph

DiffPool

PreviousGraphSAGENextRRN

Last updated 6 years ago

Motivation

Current GNN methods are inherently flat and do not learn hierarchical representations of graphs - a limitation that is especially problematic for the task of graph classification, where the goal is to predict the label associated with an entire graph.

Contribution

The authors propose DIFFPOOL, a differentiable graph pooling module that can generate hierarchical representations of graphs and can be combined with various graph neural network architectures in an end-to-end fashion. DIFFPOOL learns a differentiable soft cluster assignment for nodes at each layer of a deep GCNN, mapping nodes to a set of clusters, which then form the coarsened input for the next GNN layer.

Idea

Given G=(V,E)G=(V,E)G=(V,E), denote its adjacency matrix by A∈{0,1}∣V∣×∣V∣A\in\{0,1\}^{|V|\times|V|}A∈{0,1}∣V∣×∣V∣ and its initial node feature matrix by X∈R∣V∣×dX\in\mathbb{R}^{|V|\times d}X∈R∣V∣×d. We want to learn a mapping f:G→Yf:\mathcal{G}\rightarrow\mathcal{Y}f:G→Y from a set of labeled graphs D={(G1,y1),⋯ }\mathcal{D}=\{(G_1,y_1),\cdots\}D={(G1​,y1​),⋯}, where G\mathcal{G}G is the set of all possible graphs and Y\mathcal{Y}Y is the set of all possible graph labels.

Start with A(0)=A,H(0)=HA^{(0)}=A, H^{(0)}=HA(0)=A,H(0)=H, we repeatedly perform the following update rules for LLL times:

Z(l)=GNNl, embed(A(l),H(l)),S(l)=softmax(GNNl, pool(A(l),H(l))),H(l+1)=S(l)TZ(l),A(l+1)=S(l)TA(l)S(l),\begin{align} Z^{(l)}&=\text{GNN}_{\text{l, embed}}(A^{(l)}, H^{(l)}),\\ S^{(l)}&=\text{softmax}\left(\text{GNN}_{\text{l, pool}}(A^{(l)}, H^{(l)})\right),\\ H^{(l+1)}&=S^{(l)^T}Z^{(l)},\\ A^{(l+1)}&=S^{(l)^T}A^{(l)}S^{(l)}, \end{align}Z(l)S(l)H(l+1)A(l+1)​=GNNl, embed​(A(l),H(l)),=softmax(GNNl, pool​(A(l),H(l))),=S(l)TZ(l),=S(l)TA(l)S(l),​​

where:

  • A(l)∈Rnl×nlA^{(l)}\in\mathbb{R}^{n_{l}\times n_{l}}A(l)∈Rnl​×nl​

  • S(l)∈Rnl×nl+1S^{(l)}\in\mathbb{R}^{n_{l}\times n_{l+1}}S(l)∈Rnl​×nl+1​ is considered as a soft assignment matrix assigning nodes in layer lll to clusters in layer l+1l+1l+1. Note that nln_lnl​'s for 1≤l<L1\leq l<L1≤l<L are hyperparameters.

  • GNNl, embed,GNNl, pool\text{GNN}_{\text{l, embed}}, \text{GNN}_{\text{l, pool}}GNNl, embed​,GNNl, pool​ are two separate GNNs.

  • SL−1S^{L-1}SL−1 is a vector of 111's, i.e., all nodes at the final layer are assigned to a single cluster.

  • HLH^{L}HL is a final embedding vector corresponding to the entire graph and can be passed to a differentiable classifier.

Elementwise, the last two equations above correspond to

I actually did not understand why this claim is True and has not been persuaded by the authors' claim.

Auxiliary Link Prediction Objective and Entropy Regularization

Experiments

Datasets

  • ENZYMES, PROTEINS, D&D, the social network dataset REDDIT-MULTI-12K, and the scientific collaboration dataset COLLAB.

Model configurations

  • GNN model: "mean" variant of GRAPHSAGE, which gives better performance than GCN

  • Batch Normalization is applied after every layer of GRAPHSAGE.

  • All models are trained for 3000 epochs with early stopping applied when the validation loss starts to drop.

  • Two model variants are considered

    • DIFFPOOL-DET, a DIFFPOOL model where assignment matrices are generated using a deterministic graph clustering algorithm

    • DIFFPOOL-NOLP, a variant where the link prediction side objective is turned off

Performance Statistics

The other GNN based models for performance comparison include:

  • GRAPHSAGE with global mean-pooling

  • Edge-conditioned filters in CNN for graphs (ECC) incorporates edge information into the GCN model and performs pooling using coarsening algorithm

  • PATCHYSAN defines a receptive field for each node and use a canonical node ordering, applies convolutions on linear sequences of node embeddings

  • SET2SET replaces global mean-pooling by the aggregation used in SET2SET. GRAPHSAGE is used as the base GNN model.

  • SORTPOOL applies a GNN architecture and performs a single layer of soft pooling followed by 1D convolution on sorted node embeddings

Differentiable Pooling on STRUCTURE2VEC

The authors also tested applying DIFFPOOL on Structure2Vec (S2V), a state-of-the-art graph representation learning algorithm that combines a latent variable model with GNNs and uses global mean pooling.

Analysis of Cluster Assignment in DIFFPOOL

Dense vs. sparse subgraph structure. The authors observe that DIFFPOOL tends to collapse densely-connected subgraphs into clusters.

Assignment for nodes with similar representations. This is expected based on the update rule. The authors also observe that auxiliary link prediction helps discouraging nodes that are far away to be pooled together.

The authors consider the last two equations as a DIFFPOOL\text{DIFFPOOL}DIFFPOOL layer: (A(l),Z(l))↦(A(l+1),H(l+1))(A^{(l)}, Z^{(l)})\mapsto (A^{(l+1)}, H^{(l+1)})(A(l),Z(l))↦(A(l+1),H(l+1)).

Each round of performing the 444 equations can be viewed as a round of graph coarsening, until nL=1n_{L}=1nL​=1. Below is a nice visualization of 333 rounds of coarsening:

Hi(l+1)=∑j=1nlSi,:(l)TZ:,j(l)Aij(l+1)=∑p=1nl∑k=1nlsp,j(l)sk,i(l)ak,p(l)\begin{align} H_{i}^{(l+1)}&=\sum_{j=1}^{n_{l}}S^{(l)^{T}}_{i, :}Z^{(l)}_{:,j}\\ A_{ij}^{(l+1)}&=\sum_{p=1}^{n_l}\sum_{k=1}^{n_l}s_{p,j}^{(l)}s_{k, i}^{(l)}a_{k, p}^{(l)} \end{align}Hi(l+1)​Aij(l+1)​​=j=1∑nl​​Si,:(l)T​Z:,j(l)​=p=1∑nl​​k=1∑nl​​sp,j(l)​sk,i(l)​ak,p(l)​​​

Permutation Invariance: The authors present a nice result that any deep GNN model based on DIFFPOOL\text{DIFFPOOL}DIFFPOOL is permutation invariant, as long as the component GNNs are permutation invariant.

Proposition 1. Let P∈{0,1}n×nP\in\{0, 1\}^{n\times n}P∈{0,1}n×n be any permutation matrix, then DIFFPOOL(A,Z)=DIFFPOOL(PAPT,PX)\text{DIFFPOOL}(A, Z)=\text{DIFFPOOL}(PAP^{T}, PX)DIFFPOOL(A,Z)=DIFFPOOL(PAPT,PX) as long as GNN(A,X)=GNN(PAPT,X)\text{GNN}(A, X)=\text{GNN}(PAP^{T}, X)GNN(A,X)=GNN(PAPT,X).

In practice, it can be difficult to train the pooling GNN using only gradient signal from the graph classification task. To alleviate this issue, the authors train the pooling GNN with an auxiliary link prediction objective, which encodes the intuition that nearby nodes should be pooled together. In particular, at each layer lll, the authors minimize LLP=∣∣A(l)−S(l)S(l)T∣∣FL_{\text{LP}}=||A^{(l)}-S^{(l)}S^{(l)^{T}}||_{F}LLP​=∣∣A(l)−S(l)S(l)T∣∣F​. Note that (S(l)S(l)T)ij=∑k=1nl+1Sik(l)Skj(l)(S^{(l)}S^{(l)^{T}})_{ij}=\sum_{k=1}^{n_{l+1}}S_{ik}^{(l)}S_{kj}^{(l)}(S(l)S(l)T)ij​=∑k=1nl+1​​Sik(l)​Skj(l)​, corresponds to the inner product between the cluster assignment for node viv_ivi​ and vjv_jvj​. Intuitively, the larger (A(l))ij(A^{(l)})_{ij}(A(l))ij​ is, the stronger the connectivity between the nodes, the similar their cluster assignment should be.

The authors also believe that the output cluster assignment for each node should generally be close to a one-hot vector, so that the membership for each cluster or subgraph is clearly defined. The authors therefore regularize the entropy of the cluster assignment by minimizing LE=1n∑i=1nH(si)L_{E}=\frac{1}{n}\sum_{i=1}^{n}H(s_i)LE​=n1​∑i=1n​H(si​), where HHH denotes the entropy function and SiS_iSi​ is the iii-th row of SSS.

During training, LLPL_{\text{LP}}LLP​ and LEL_{\text{E}}LE​ from each layer are added to the classification loss, which yields a longer time to converge as well as a better performance and more interpretable cluster assignments.

The authors evaluate the benefits of DIFFPOOL\text{DIFFPOOL}DIFFPOOL against a number of state-of-the-art graph classification approaches, with the goal of answering the following questions: 1. How does DIFFPOOL compare to other pooling methods proposed for GNNs (e.g., using sort pooling or the SET2SET method)? 2. How does DIFFPOOL combined with GNNs compare to the state-of-the-art for graph classification task, including both GNNs and kernel-based methods? 3. Does DIFFPOOL compute meaningful and interpretable clusters on the input graphs?

101010 % of the graphs are sampled to use as a validation set, with the remaining graphs used to perform 101010-fold cross-validation to evaluate model performance.

A DIFFPOOL layer is applied after every two GRAPHSAGE layers. A total of 222 DIFFPOOL layers are used for the datasets. The number of clusters is set as 252525% of the number of nodes before applying DIFFPOOL.

For small datasets, one DIFFPOOL layer is enough. 333 layers of graph convolutions are performed after the DIFFPOOL layer. The number of clusters is set as 101010% of the number of nodes before applying DIFFPOOL.

Adding an ℓ2\ell_2ℓ2​ normalization to the node embeddings at each layer made the training more stable.

Hierarchical cluster structure. In figure 222 we can find a visualization of cluster assignment. Node cluster membership is determined by taking the argmax\text{argmax}argmax of its cluster assignment probabilities. The authors observe that even when learning cluster assignment based solely on the graph classification objective, DIFFPOOL can still capture the hierarchical community structure. They also observe significant improvement in membership assignment quality with link prediction auxiliary objectives.

Sensitivity of the Pre-defined Maximum Number of Clusters. The assignment varies according to the depth of the network and the max number of clusters. Larger CCC helps modeling more complex structure but also brings noise and is less efficiency. When there are too many clusters, it's possible that all nodes have low values for these clusters in the assignment matrix.

Link to paper