Notes
  • README
  • Roadmap
  • Graph
    • GraphSAGE
    • DiffPool
    • RRN
    • Relational RL
    • Layerwise Adaptive Sampling
    • Representation Lerning on Graphs: Methods and Applications
    • GAT
    • How Powerful are Graph Neural Networks?
    • Pitfalls of Graph Neural Network Evaluation
    • Spectral Networks and Deep Locally Connected Networks on Graphs
    • Deep Convolutional Networks on Graph-Structured Data
  • Optimizations
    • Neural ODE
  • Tags
Powered by GitBook
On this page
  • Motivation
  • Experiment
  1. Graph

Pitfalls of Graph Neural Network Evaluation

PreviousHow Powerful are Graph Neural Networks?NextSpectral Networks and Deep Locally Connected Networks on Graphs

Last updated 6 years ago

Motivation

The authors intend to perform a fair evaluation of 444 prominent GNNs: GCN, MoNet, GraphSage, and GAT on node classification by:

  • using 100100100 random train/validation/test splits rather than a fixed train/validation/test split.

  • using a standardized training and hyperparameter tunning procedure for all models

Experiment

The authors used 888 datasets:

  • PubMed

  • CiteSeer

  • CORA

  • CORA-Full

  • Coauthor CS: co-authorship graphs based on the Microsoft Academic Graph from the KDD Cup 201620162016 challenge, where nodes are authors, edges indicate coauthorship, node features represent paper keywords for each author's papers, and class labels indicate most active fields of study for each author

  • Coauthor Physics: same as Coauthor CS

  • Amazon Computers: segments of the Amazon co-purchase graph, where nodes represent goods, edges indicate that two goods are frequently bought together, node features are bag-of-words encoded product reviews, and class labels are given by the product category

  • Amazon Photo: same as Amazon Computers

For all datasets, the authors treat them as undirected and only consider the largest connected component. The dataset statistics is included below:

Meanwhile, they keep the model architectures as they are in the original paper/reference implementations, including:

  • the type and sequence of layers

  • choice of activation functions

  • placement of dropout

For the rest training choices (optimizer, parameter initialization, learning rate decay, maximum number of training epochs, early stopping criterion, patience and validation frequency), they use the same for all models.

They also consider four baseline models:

  • Logistic regression

  • Multilayer perceptron

  • Label propagation

  • Normalized laplacian label propagation

The former two do not consider graph structure and the latter two only consider the graph structure and ignore the node attributes.

See below for the experiment result:

The authors performed an extensive grid search for learning rate, size of the hidden layer, strength of the L2L_2L2​ regularization, and dropout probability. They restricted the random search space to ensure that every model has at most the same given number of trainable parameters. For every model, they picked the hyperparameter configuration that achieved the best average accuracy on Cora and CiteSeer datasets (averaged over 100100100 random splits and 202020 random initialization for each split). In all cases, they used 202020 labeled nodes per class as the training set and 303030 nodes per class as the validation set, and the rest as the test set.

choices as to where to apply L2L_2L2​ regularization

the number of attention heads for GAT is fixed to be 888

the number of Gaussian kernels is fixed to be 222

all the models have 222 layers (input features →\rightarrow→ hidden layer →\rightarrow→ output layer)

Link to paper
Link to code