Notes
  • README
  • Roadmap
  • Graph
    • GraphSAGE
    • DiffPool
    • RRN
    • Relational RL
    • Layerwise Adaptive Sampling
    • Representation Lerning on Graphs: Methods and Applications
    • GAT
    • How Powerful are Graph Neural Networks?
    • Pitfalls of Graph Neural Network Evaluation
    • Spectral Networks and Deep Locally Connected Networks on Graphs
    • Deep Convolutional Networks on Graph-Structured Data
  • Optimizations
    • Neural ODE
  • Tags
Powered by GitBook
On this page
  • Summary
  • Attention
  • Multi-head Attention
  • Experiments
  • Datasets
  • State-of-the-art Methods
  • Experimental Setup
  1. Graph

GAT

PreviousRepresentation Lerning on Graphs: Methods and ApplicationsNextHow Powerful are Graph Neural Networks?

Last updated 6 years ago

Summary

Incorporate multi-head attention into GCN, which clearly strengthens its model capacity. The authors claim that this is helpful for inductive learning: where we want to generalize a graph neural network to unseen nodes and graphs.

Attention

The key difference between GAT and GCN is how to aggregate the information from the one-hop neighbourhood.

For GCN, a convolution operation produces the normalized sum of the node features of neighbours.

hi(l+1)=σ(∑j∈N(i)1cijW(l)hj(l)),h_i^{(l+1)}=\sigma\left(\sum_{j\in \mathcal{N}(i)} {\frac{1}{c_{ij}} W^{(l)}h^{(l)}_j}\right),hi(l+1)​=σ​j∈N(i)∑​cij​1​W(l)hj(l)​​,

where N(i)\mathcal{N}(i)N(i) is the union of viv_ivi​ and all its one-hop neighbours, cij=1∣N(i)∣1∣N(j)∣c_{ij}=\frac{1}{\sqrt{|\mathcal{N}(i)|}}\frac{1}{\sqrt{|\mathcal{N}(j)|}}cij​=∣N(i)∣​1​∣N(j)∣​1​ is a normalization constant based on graph structure, σ\sigmaσ is an activation function such as ReLU, W(l)W^{(l)}W(l) is a shared weight matrix for node-wise feature transformation.

GAT introduces the attention mechanism as substitute to the statically normalized convolution operation to remove its dependency on graph structure. It aggregates the neighborhood features by assigning different weights to features from neighbor nodes depending on the features of the current node.

eij=LeakyReLU(a⃗T[Whi(l)]∥[Whj(l)]),αij=exp⁡(eij)∑k∈N(i)exp⁡(eik),hi(l+1)=σ(∑j∈N(i)αijW(l)hj(l)),\begin{align} e_{ij}&=\textbf{LeakyReLU}(\vec a^{T}[Wh_{i}^{(l)}]\|[Wh_{j}^{(l)}]),\\ \alpha_{ij}&=\frac{\exp(e_{ij})}{\sum_{k\in \mathcal{N}(i)}^{}\exp(e_{ik})},\\ h_i^{(l+1)}&=\sigma\left(\sum_{j\in \mathcal{N}(i)} {\alpha_{ij} W^{(l)} h^{(l)}_j }\right), \end{align}eij​αij​hi(l+1)​​=LeakyReLU(aT[Whi(l)​]∥[Whj(l)​]),=∑k∈N(i)​exp(eik​)exp(eij​)​,=σ​j∈N(i)∑​αij​W(l)hj(l)​​,​​

where ∥\|∥ is the concatenation operation, σ(⋅)\sigma(\cdot)σ(⋅) is some activation function, hi(l)∈RFh_{i}^{(l)}\in\mathbb{R}^{F}hi(l)​∈RF, W∈RF′×FW\in\mathbb{R}^{F'\times F}W∈RF′×F, a⃗T∈R1×2F′\vec a^{T}\in\mathbb{R}^{1\times 2F'}aT∈R1×2F′. Both WWW and a⃗\vec aa are learnable.

Note that employing attention mechanism allows dealing with directed graphs.

Multi-head Attention

Analogous to multiple channels in ConvNet, GAT introduces multi-head attention to enrich the model capacity. The authors claimed that this is beneficial for stabilizing the learning process of self-attention. Basically we can have multiple "suits" of parameters for computing αij\alpha_{ij}αij​ and W(l)W^{(l)}W(l) in hi(l+1)=σ(∑j∈N(i)αijW(l)hj(l))h_i^{(l+1)}=\sigma\left(\sum_{j\in \mathcal{N}(i)} {\alpha_{ij} W^{(l)} h^{(l)}_j }\right)hi(l+1)​=σ(∑j∈N(i)​αij​W(l)hj(l)​).

Formally this can be done by first performing computation for each head and then aggregate the result from each head with

concatenation:hi(l+1)=∥k=1Kσ(∑j∈N(i)αijkWkhj(l))\text{concatenation}: h^{(l+1)}_{i} =\|_{k=1}^{K}\sigma\left(\sum_{j\in \mathcal{N}(i)}\alpha_{ij}^{k}W^{k}h^{(l)}_{j}\right)concatenation:hi(l+1)​=∥k=1K​σ​j∈N(i)∑​αijk​Wkhj(l)​​

or

average:hi(l+1)=σ(1K∑k=1K∑j∈N(i)αijkWkhj(l)),\text{average}: h_{i}^{(l+1)}=\sigma\left(\frac{1}{K}\sum_{k=1}^{K}\sum_{j\in\mathcal{N}(i)}\alpha_{ij}^{k}W^{k}h^{(l)}_{j}\right),average:hi(l+1)​=σ​K1​k=1∑K​j∈N(i)∑​αijk​Wkhj(l)​​,

where KKK is the number of heads. The authors suggest using concatenation for intermediary layers and average for the final layer.

Experiments

Datasets

Transductive Learning: They used 333 standard citation benchmark datasets: Cora, Citeseer and Pubmed. The node features correspond to elements of a bag-of-words representation of a document. Each node has a class label. They allow 202020 nodes per class to be used for training but the training algorithm has access to all of the nodes' feature vectors. The test set consists of 100010001000 test nodes and the validation set consists of 500500500 nodes.

Inductive Learning: They make use of a protein-protein interaction (PPI) dataset that consists of graphs corresponding to different human tissues. For inductive learning, testing graphs remain completely unobserved during training. The node features are composed of positional gene sets, motif gene sets and immunological signatures. There are 121121121 labels for each node set from gene ontology, and a node can possess several labels simultaneously.

Below is a summary of dataset statistics:

State-of-the-art Methods

Transductive Learning:

  • label propagation

  • semi-supervised embedding

  • manifold regularization

  • skip-gram based graph embeddings

  • iterative classification algorithm

  • planetoid

  • GCN

  • ChebyNet

  • MoNet

  • MLP where no graph structure incorporated

Inductive Learning:

  • GraphSAGE-GCN (extends graph convolution to inductive learning): hi(l+1)=σ(1∣N(i)∣∑j∈N(i)W(l)hj(l))h_i^{(l+1)}=\sigma\left(\frac{1}{|\mathcal{N}(i)|}\sum_{j\in \mathcal{N}(i)} { W^{(l)} h^{(l)}_j }\right)hi(l+1)​=σ(∣N(i)∣1​∑j∈N(i)​W(l)hj(l)​)

  • GraphSAGE-mean (mean pooling)

  • GraphSAGE-LSTM

  • GraphSAGE-pool (max pooling)

  • MLP where no graph structure incorporated

Experimental Setup

Transductive Learning:

  • Model: two-layer GAT

  • Hyperparameters: optimized based on Cora

  • Model flow:

    • 888 head attention with F′=8F'=8F′=8 features for each head

    • An exponential linear unit (ELU)

    • Classification layer: linear layer with CCC out features corresponding to CCC classes, followed by a softmax activation

  • Regularization: L2L_2L2​ with λ=0.0005\lambda=0.0005λ=0.0005

  • Dropout: p=0.6p=0.6p=0.6 applied to both layers' inputs as well as to the normalized attention coefficients (at each training iteration, each node is exposed to a stochastically sampled neighborhood).

  • For Pubmed, 888 output attention heads are applied (rather than 111), with λ=0.001\lambda=0.001λ=0.001 for regularization

Inductive Learning:

  • Model: three-layer GAT

  • Model flow:

    • For the first two layers, 444 attention heads are used with F′=256F'=256F′=256.

    • An exponential linear unit (ELU)

    • (multi-label) classification layer: 666 attention heads computing 121121121 features each, that are averaged and followed by a logistic sigmoid activation

  • Dataset is large enough so no dropout or regularization is required

  • Skip connections across the intermediate attention layer

  • Batch size 222 during training

Both models use:

  • Glorot (Xavier) initialization

  • cross-entropy objective

  • Adam with initial learning rate of 0.010.010.01 for Pubmed and 0.0050.0050.005 for all other datasets

  • Early stopping on both the cross-entropy loss and accuracy (transductive) or micro-F1F_1F1​ (inductive) score on the validation nodes, with a patience of 100100100 epochs

For the transductive tasks, they report the mean accuracy (with standard deviation) on the test nodes after 100100100 runs. For the Chebyshev filter-based approach, they provide the maximum reported performance for filters of orders K=2K=2K=2 and K=3K=3K=3. For a fair comparison, they further evaluate a GCN model that computes 646464 hidden features, attempting both the ReLU and ELU activation, and reporting the better result after 100100100 runs.

For the inductive task, they reported the micro-averaged F1F_1F1​ score on the nodes of the two unseen test graphs, averaged after 101010 runs. When comparing against GraphSAGE, they retune the hyperparameters and denote that variant by GraphSAGE∗\text{GraphSAGE}^{\ast}GraphSAGE∗ They also tried a GAT variant with constant attention (111 for each node), denoted by Const-GAT\text{Const-GAT}Const-GAT.

The authors also visualized the representations extracted by the first layer of a GAT pre-trained on the Cora with t-SNE.