GAT
Last updated
Incorporate multi-head attention into GCN, which clearly strengthens its model capacity. The authors claim that this is helpful for inductive learning: where we want to generalize a graph neural network to unseen nodes and graphs.
The key difference between GAT and GCN is how to aggregate the information from the one-hop neighbourhood.
For GCN, a convolution operation produces the normalized sum of the node features of neighbours.
where is the union of and all its one-hop neighbours, is a normalization constant based on graph structure, is an activation function such as ReLU, is a shared weight matrix for node-wise feature transformation.
GAT introduces the attention mechanism as substitute to the statically normalized convolution operation to remove its dependency on graph structure. It aggregates the neighborhood features by assigning different weights to features from neighbor nodes depending on the features of the current node.
Note that employing attention mechanism allows dealing with directed graphs.
Formally this can be done by first performing computation for each head and then aggregate the result from each head with
or
Below is a summary of dataset statistics:
Transductive Learning:
label propagation
semi-supervised embedding
manifold regularization
skip-gram based graph embeddings
iterative classification algorithm
planetoid
GCN
ChebyNet
MoNet
MLP where no graph structure incorporated
Inductive Learning:
GraphSAGE-mean (mean pooling)
GraphSAGE-LSTM
GraphSAGE-pool (max pooling)
MLP where no graph structure incorporated
Transductive Learning:
Model: two-layer GAT
Hyperparameters: optimized based on Cora
Model flow:
An exponential linear unit (ELU)
Inductive Learning:
Model: three-layer GAT
Model flow:
An exponential linear unit (ELU)
Dataset is large enough so no dropout or regularization is required
Skip connections across the intermediate attention layer
Both models use:
Glorot (Xavier) initialization
cross-entropy objective
The authors also visualized the representations extracted by the first layer of a GAT pre-trained on the Cora with t-SNE.
where is the concatenation operation, is some activation function, , , . Both and are learnable.
Analogous to multiple channels in ConvNet, GAT introduces multi-head attention to enrich the model capacity. The authors claimed that this is beneficial for stabilizing the learning process of self-attention. Basically we can have multiple "suits" of parameters for computing and in .
where is the number of heads. The authors suggest using concatenation for intermediary layers and average for the final layer.
Transductive Learning: They used standard citation benchmark datasets: Cora, Citeseer and Pubmed. The node features correspond to elements of a bag-of-words representation of a document. Each node has a class label. They allow nodes per class to be used for training but the training algorithm has access to all of the nodes' feature vectors. The test set consists of test nodes and the validation set consists of nodes.
Inductive Learning: They make use of a protein-protein interaction (PPI) dataset that consists of graphs corresponding to different human tissues. For inductive learning, testing graphs remain completely unobserved during training. The node features are composed of positional gene sets, motif gene sets and immunological signatures. There are labels for each node set from gene ontology, and a node can possess several labels simultaneously.
GraphSAGE-GCN (extends graph convolution to inductive learning):
head attention with features for each head
Classification layer: linear layer with out features corresponding to classes, followed by a softmax activation
Regularization: with
Dropout: applied to both layers' inputs as well as to the normalized attention coefficients (at each training iteration, each node is exposed to a stochastically sampled neighborhood).
For Pubmed, output attention heads are applied (rather than ), with for regularization
For the first two layers, attention heads are used with .
(multi-label) classification layer: attention heads computing features each, that are averaged and followed by a logistic sigmoid activation
Batch size during training
Adam with initial learning rate of for Pubmed and for all other datasets
Early stopping on both the cross-entropy loss and accuracy (transductive) or micro- (inductive) score on the validation nodes, with a patience of epochs
For the transductive tasks, they report the mean accuracy (with standard deviation) on the test nodes after runs. For the Chebyshev filter-based approach, they provide the maximum reported performance for filters of orders and . For a fair comparison, they further evaluate a GCN model that computes hidden features, attempting both the ReLU and ELU activation, and reporting the better result after runs.
For the inductive task, they reported the micro-averaged score on the nodes of the two unseen test graphs, averaged after runs. When comparing against GraphSAGE, they retune the hyperparameters and denote that variant by They also tried a GAT variant with constant attention ( for each node), denoted by .