GAT

Summary

Incorporate multi-head attention into GCN, which clearly strengthens its model capacity. The authors claim that this is helpful for inductive learning: where we want to generalize a graph neural network to unseen nodes and graphs.

Attention

The key difference between GAT and GCN is how to aggregate the information from the one-hop neighbourhood.

For GCN, a convolution operation produces the normalized sum of the node features of neighbours.

hi(l+1)=σ(jN(i)1cijW(l)hj(l)),h_i^{(l+1)}=\sigma\left(\sum_{j\in \mathcal{N}(i)} {\frac{1}{c_{ij}} W^{(l)}h^{(l)}_j}\right),

where N(i)\mathcal{N}(i) is the union of viv_i and all its one-hop neighbours, cij=1N(i)1N(j)c_{ij}=\frac{1}{\sqrt{|\mathcal{N}(i)|}}\frac{1}{\sqrt{|\mathcal{N}(j)|}} is a normalization constant based on graph structure, σ\sigma is an activation function such as ReLU, W(l)W^{(l)} is a shared weight matrix for node-wise feature transformation.

GAT introduces the attention mechanism as substitute to the statically normalized convolution operation to remove its dependency on graph structure. It aggregates the neighborhood features by assigning different weights to features from neighbor nodes depending on the features of the current node.

eij=LeakyReLU(aT[Whi(l)][Whj(l)]),αij=exp(eij)kN(i)exp(eik),hi(l+1)=σ(jN(i)αijW(l)hj(l)),\begin{align} e_{ij}&=\textbf{LeakyReLU}(\vec a^{T}[Wh_{i}^{(l)}]\|[Wh_{j}^{(l)}]),\\ \alpha_{ij}&=\frac{\exp(e_{ij})}{\sum_{k\in \mathcal{N}(i)}^{}\exp(e_{ik})},\\ h_i^{(l+1)}&=\sigma\left(\sum_{j\in \mathcal{N}(i)} {\alpha_{ij} W^{(l)} h^{(l)}_j }\right), \end{align}

where \| is the concatenation operation, σ()\sigma(\cdot) is some activation function, hi(l)RFh_{i}^{(l)}\in\mathbb{R}^{F}, WRF×FW\in\mathbb{R}^{F'\times F}, aTR1×2F\vec a^{T}\in\mathbb{R}^{1\times 2F'}. Both WW and a\vec a are learnable.

Note that employing attention mechanism allows dealing with directed graphs.

Multi-head Attention

Analogous to multiple channels in ConvNet, GAT introduces multi-head attention to enrich the model capacity. The authors claimed that this is beneficial for stabilizing the learning process of self-attention. Basically we can have multiple "suits" of parameters for computing αij\alpha_{ij} and W(l)W^{(l)} in hi(l+1)=σ(jN(i)αijW(l)hj(l))h_i^{(l+1)}=\sigma\left(\sum_{j\in \mathcal{N}(i)} {\alpha_{ij} W^{(l)} h^{(l)}_j }\right).

Formally this can be done by first performing computation for each head and then aggregate the result from each head with

concatenation:hi(l+1)=k=1Kσ(jN(i)αijkWkhj(l))\text{concatenation}: h^{(l+1)}_{i} =\|_{k=1}^{K}\sigma\left(\sum_{j\in \mathcal{N}(i)}\alpha_{ij}^{k}W^{k}h^{(l)}_{j}\right)

or

average:hi(l+1)=σ(1Kk=1KjN(i)αijkWkhj(l)),\text{average}: h_{i}^{(l+1)}=\sigma\left(\frac{1}{K}\sum_{k=1}^{K}\sum_{j\in\mathcal{N}(i)}\alpha_{ij}^{k}W^{k}h^{(l)}_{j}\right),

where KK is the number of heads. The authors suggest using concatenation for intermediary layers and average for the final layer.

Experiments

Datasets

Transductive Learning: They used 33 standard citation benchmark datasets: Cora, Citeseer and Pubmed. The node features correspond to elements of a bag-of-words representation of a document. Each node has a class label. They allow 2020 nodes per class to be used for training but the training algorithm has access to all of the nodes' feature vectors. The test set consists of 10001000 test nodes and the validation set consists of 500500 nodes.

Inductive Learning: They make use of a protein-protein interaction (PPI) dataset that consists of graphs corresponding to different human tissues. For inductive learning, testing graphs remain completely unobserved during training. The node features are composed of positional gene sets, motif gene sets and immunological signatures. There are 121121 labels for each node set from gene ontology, and a node can possess several labels simultaneously.

Below is a summary of dataset statistics:

State-of-the-art Methods

Transductive Learning:

  • label propagation

  • semi-supervised embedding

  • manifold regularization

  • skip-gram based graph embeddings

  • iterative classification algorithm

  • planetoid

  • GCN

  • ChebyNet

  • MoNet

  • MLP where no graph structure incorporated

Inductive Learning:

  • GraphSAGE-GCN (extends graph convolution to inductive learning): hi(l+1)=σ(1N(i)jN(i)W(l)hj(l))h_i^{(l+1)}=\sigma\left(\frac{1}{|\mathcal{N}(i)|}\sum_{j\in \mathcal{N}(i)} { W^{(l)} h^{(l)}_j }\right)

  • GraphSAGE-mean (mean pooling)

  • GraphSAGE-LSTM

  • GraphSAGE-pool (max pooling)

  • MLP where no graph structure incorporated

Experimental Setup

Transductive Learning:

  • Model: two-layer GAT

  • Hyperparameters: optimized based on Cora

  • Model flow:

    • 88 head attention with F=8F'=8 features for each head

    • An exponential linear unit (ELU)

    • Classification layer: linear layer with CC out features corresponding to CC classes, followed by a softmax activation

  • Regularization: L2L_2 with λ=0.0005\lambda=0.0005

  • Dropout: p=0.6p=0.6 applied to both layers' inputs as well as to the normalized attention coefficients (at each training iteration, each node is exposed to a stochastically sampled neighborhood).

  • For Pubmed, 88 output attention heads are applied (rather than 11), with λ=0.001\lambda=0.001 for regularization

Inductive Learning:

  • Model: three-layer GAT

  • Model flow:

    • For the first two layers, 44 attention heads are used with F=256F'=256.

    • An exponential linear unit (ELU)

    • (multi-label) classification layer: 66 attention heads computing 121121 features each, that are averaged and followed by a logistic sigmoid activation

  • Dataset is large enough so no dropout or regularization is required

  • Skip connections across the intermediate attention layer

  • Batch size 22 during training

Both models use:

  • Glorot (Xavier) initialization

  • cross-entropy objective

  • Adam with initial learning rate of 0.010.01 for Pubmed and 0.0050.005 for all other datasets

  • Early stopping on both the cross-entropy loss and accuracy (transductive) or micro-F1F_1 (inductive) score on the validation nodes, with a patience of 100100 epochs

For the transductive tasks, they report the mean accuracy (with standard deviation) on the test nodes after 100100 runs. For the Chebyshev filter-based approach, they provide the maximum reported performance for filters of orders K=2K=2 and K=3K=3. For a fair comparison, they further evaluate a GCN model that computes 6464 hidden features, attempting both the ReLU and ELU activation, and reporting the better result after 100100 runs.

For the inductive task, they reported the micro-averaged F1F_1 score on the nodes of the two unseen test graphs, averaged after 1010 runs. When comparing against GraphSAGE, they retune the hyperparameters and denote that variant by GraphSAGE\text{GraphSAGE}^{\ast} They also tried a GAT variant with constant attention (11 for each node), denoted by Const-GAT\text{Const-GAT}.

The authors also visualized the representations extracted by the first layer of a GAT pre-trained on the Cora with t-SNE.

Last updated