Notes
  • README
  • Roadmap
  • Graph
    • GraphSAGE
    • DiffPool
    • RRN
    • Relational RL
    • Layerwise Adaptive Sampling
    • Representation Lerning on Graphs: Methods and Applications
    • GAT
    • How Powerful are Graph Neural Networks?
    • Pitfalls of Graph Neural Network Evaluation
    • Spectral Networks and Deep Locally Connected Networks on Graphs
    • Deep Convolutional Networks on Graph-Structured Data
  • Optimizations
    • Neural ODE
  • Tags
Powered by GitBook
On this page
  • Adaptive Sampling Towards Fast Graph Representation Learning
  • Related Work
  • Method Formulation
  • Attention
  • Experiments
  1. Graph

Layerwise Adaptive Sampling

PreviousRelational RLNextRepresentation Lerning on Graphs: Methods and Applications

Last updated 6 years ago

tags: GCN, sampling, NeuralPS2018

Adaptive Sampling Towards Fast Graph Representation Learning

Multiple vertices may have some common neighbors so neighbor sampling can result in repeated samples. We can avoid the over-expansion and accelerate the training of GCN by controlling the size of the sampled neighborhoods in each layer.

The core of the method is to define an appropriate sampler for the layer-wise sampling. A common objective to design the sampler is to minimize the resulting variance. Unfortunately, the optimal sampler to minimize the variance is uncomputable due to the inconsistency between the top-down sampling and the bottom-up propagation in our network. To tackle this issue a parametrized sampler is used and the resulting variance, being a loss term, can be directly optimized with BP.

The authors also propose to include skip connection for message passing so that 2nd order proximity

To sum up, the contributions include:

  • Overexpansion of neighborhood ⇒\Rightarrow⇒ Layer-wise neighbor sampling

  • Uncomputable optimal sampler ⇒\Rightarrow⇒ Feature based parametrized sampler + directly optimizing variance in the objective

  • Preserving 2nd order proximity ⇒\Rightarrow⇒ Skip connection for message passing

Related Work

  • Spectral: defines the convolution operation in Fourier domain

    • : defines the convolution operation in Fourier domain

    • : enables localized filtering by applying efficient spectral filters

    • : employs Chebyshev expansion of the graph Laplacian to avoid the eigendecomposition

    • GCN: simplify previous methods with first-order expansion and re-parameterization trick

  • Non-spectral: define convolution on graph by using the spatial connections directly

    • : learns a weight matrix for each node degree

    • : defines multiple-hop neighborhoods by using the power series of a transition matrix

    • : extracts normalized neighborhoods that contain a fixed number of nodes

  • Leap of model capacity: implicitly weight node importance of a neighborhood

    • : build ConvNet on graphs using the patch operation

    • GAT: compute the hidden representations of each node on graph by attending over its neighbors following a self-attention strategy

Method Formulation

Monte Carlo GCN

In GCN, the update rule for node features is

hv(l+1)=σ(∑u∈N(v)⋃{v}A^(u,v)hu(l)W(l)),h_{v}^{(l+1)} = \sigma\left(\sum_{u\in\mathcal{N}(v)\bigcup\{v\}}\hat{A}(u,v)h_{u}^{(l)}W^{(l)}\right),hv(l+1)​=σ​u∈N(v)⋃{v}∑​A^(u,v)hu(l)​W(l)​,

where

A^(u,v)=(D^−12A^D^−12)u,v.\hat{A}(u, v)=\left(\hat{D}^{-\frac{1}{2}}\hat{A}\hat{D}^{-\frac{1}{2}}\right)_{u, v}.A^(u,v)=(D^−21​A^D^−21​)u,v​.

We can reformulate the rule above with expectation:

hv(l+1)=σ(N(v)Eu∼p(u∣v)[hu(l)]W(l)),h_{v}^{(l+1)} = \sigma\left(N(v)\mathbb{E}_{u\sim p(u|v)}[h_u^{(l)}]W^{(l)}\right),hv(l+1)​=σ(N(v)Eu∼p(u∣v)​[hu(l)​]W(l)),

where

N(v)=∑u∈N(v)⋃{v}A^(u,v),p(u∣v)=A^(u,v)N(v).N(v)=\sum_{u\in\mathcal{N}(v)\bigcup\{v\}}\hat{A}(u, v), p(u|v)=\frac{\hat{A}(u,v)}{N(v)}.N(v)=u∈N(v)⋃{v}∑​A^(u,v),p(u∣v)=N(v)A^(u,v)​.

The term Eu∼p(u∣v)[hu(l)]\mathbb{E}_{u\sim p(u|v)}[h_u^{(l)}]Eu∼p(u∣v)​[hu(l)​] may be approximated with Monte Carlo sampling:

1n∑i=1nhui(l),\frac{1}{n}\sum_{i=1}^{n}h_{u_i}^{(l)},n1​i=1∑n​hui​(l)​,

with u1,⋯ ,unu_1,\cdots,u_nu1​,⋯,un​ sampled with p(u∣v)p(u|v)p(u∣v).

Layer-wise sampling

The layer-wise sampling can be incorporated with importance sampling. Assume q(u∣v1,⋯ ,vm)q(u|v_1,\cdots, v_m)q(u∣v1​,⋯,vm​) is a conditional layer-wise sampling distribution, which we will introduce later. The original update rule is equivalent to

hv(l+1)=σ(N(v)Eu∼q(u∣v1,⋯ ,vm)[p(u∣v)q(u∣v1,⋯ ,vm)hu(l)]W(l)),h_v^{(l+1)}=\sigma\left(N(v)\mathbb{E}_{u\sim q(u|v_1,\cdots,v_m)}\left[\frac{p(u|v)}{q(u|v_1,\cdots,v_m)}h_u^{(l)}\right]W^{(l)}\right),hv(l+1)​=σ(N(v)Eu∼q(u∣v1​,⋯,vm​)​[q(u∣v1​,⋯,vm​)p(u∣v)​hu(l)​]W(l)),

where v∈{v1,⋯ ,vm}v\in\{v_1,\cdots,v_m\}v∈{v1​,⋯,vm​}. We can also perform Monte Carlo sampling as above.

Opposed to the node-wise Monte Carlo method where the nodes are sampled independently for each viv_ivi​, the layer-wise sampling is performed only once for v1,⋯ ,vmv_1,\cdots,v_mv1​,⋯,vm​. As a result, the total number of sampling nodes only grows linearly with the network depth if we fix the sampling size n.

Variance reduction

For simplicity we abbreviate q(u∣v1,⋯ ,vm)q(u|v_1,\cdots,v_m)q(u∣v1​,⋯,vm​) as q(u)q(u)q(u). For a good choice of q(u)q(u)q(u), we seek to reduce the induced variance of μ^q(vj)=1n∑i=1np(ui∣vj)q(ui)hui(l)\hat{\mu}_{q}(v_j)=\frac{1}{n}\sum_{i=1}^{n}\frac{p(u_i|v_j)}{q(u_i)}h_{u_i}^{(l)}μ^​q​(vj​)=n1​∑i=1n​q(ui​)p(ui​∣vj​)​hui​(l)​, vj∈{v1,⋯ ,vm}v_j\in\{v_1,\cdots,v_m\}vj​∈{v1​,⋯,vm​}, uiu_iui​'s are sampled from q(u)q(u)q(u).

Note that there is a small problem with the original argument of the authors where they treat hui(l)h_{u_i}^{(l)}hui​(l)​ as a scalar rather than a vector. The argumentation there was more for some motivation.

1nEq[(hui(l)p(ui∣vj)−q(ui)Ep[hui(l)])2q(ui)2]\frac{1}{n}\mathbb{E}_{q}\left[\frac{(h_{u_i}^{(l)}p(u_i|v_j)-q(u_i)\mathbb{E}_p[h_{u_i}^{(l)}])^2}{q(u_i)^2}\right]n1​Eq​[q(ui​)2(hui​(l)​p(ui​∣vj​)−q(ui​)Ep​[hui​(l)​])2​]

and the optimal qqq in terms of variance is given by

p(ui∣vj)∣hui(l)∣Ep[∣hui(l)∣].\frac{p(u_i|v_j)|h_{u_i}^{(l)}|}{\mathbb{E}_{p}[|h_{u_i}^{(l)}|]}.Ep​[∣hui​(l)​∣]p(ui​∣vj​)∣hui​(l)​∣​.

For both reasons that: 1. hui(l)h_{u_i}^{(l)}hui​(l)​ is a vector and we want a scalar 2. Even the above holds, Ep[∣hui(l)∣]\mathbb{E}_{p}[|h_{u_i}^{(l)}|]Ep​[∣hui​(l)​∣] cannot be evaluated efficiently.

The authors use a linear layer g(x(ui))=Wgx(ui)g(x(u_i))=W_gx(u_i)g(x(ui​))=Wg​x(ui​) to replace hui(l)h_{u_i}^{(l)}hui​(l)​ in the eqution above, where Wg∈R1×DW_g\in\mathbb{R}^{1\times D}Wg​∈R1×D. x(ui)x(u_i)x(ui​) is the node feature. We can then perform a Monte Carlo estimation

p(ui∣vj)∣g(x(uj))∣∑i=1Np(ui∣vj)∣g(x(ui))∣.\frac{p(u_i|v_j)|g(x(u_j))|}{\sum_{i=1}^{N}p(u_i|v_j)|g(x(u_i))|}.∑i=1N​p(ui​∣vj​)∣g(x(ui​))∣p(ui​∣vj​)∣g(x(uj​))∣​.

To make the estimation independent of vjv_jvj​ for the purpose of layerwise sampling, we can do

q(ui)=∑j=1mp(ui∣vj)∣g(x(uj))∣∑i=1N∑j=1mp(ui∣vj)∣g(x(ui))∣.q(u_i)=\frac{\sum_{j=1}^{m}p(u_i|v_j)|g(x(u_j))|}{\sum_{i=1}^{N}\sum_{j=1}^{m}p(u_i|v_j)|g(x(u_i))|}.q(ui​)=∑i=1N​∑j=1m​p(ui​∣vj​)∣g(x(ui​))∣∑j=1m​p(ui​∣vj​)∣g(x(uj​))∣​.

To make the variance reduction process adaptive, we can directly add the estimated variance to the objective function:

1n2∑i=1n(p(ui∣vj)g(x(ui))−μ^q(vj)q(ui))2q2(ui),\frac{1}{n^2}\sum_{i=1}^{n}\frac{\left(p(u_i|v_j)g(x(u_i))-\hat{\mu}_q(v_j)q(u_i)\right)^2}{q^2(u_i)},n21​i=1∑n​q2(ui​)(p(ui​∣vj​)g(x(ui​))−μ^​q​(vj​)q(ui​))2​,

where u1,⋯ ,unu_1,\cdots,u_nu1​,⋯,un​ are sampled from qqq.

Skip Connections

For the nodes of the (l+1)(l+1)(l+1) layer, we can add direct connections between them and the nodes in the (l−1)(l-1)(l−1) layer for feature update.

hskip(l+1)(vi)=∑j=1na^skip(vi,sj)hsj(l−1)Wskip(l−1),h_{skip}^{(l+1)}(v_i) = \sum_{j=1}^{n}\hat{a}_{skip}(v_i, s_j)h_{s_j}^{(l-1)}W_{skip}^{(l-1)},hskip(l+1)​(vi​)=j=1∑n​a^skip​(vi​,sj​)hsj​(l−1)​Wskip(l−1)​,

where:

  • {sj}j=1n\{s_j\}_{j=1}^{n}{sj​}j=1n​ are nodes sampled in the (l−1)(l-1)(l−1)-th layer.

  • a^skip(vi,sj)\hat{a}_{skip}(v_i,s_j)a^skip​(vi​,sj​) is approximated by ∑k=1na^(vi,uk)a^(uk,sj)\sum_{k=1}^{n}\hat{a}(v_i, u_k)\hat{a}(u_k, s_j)∑k=1n​a^(vi​,uk​)a^(uk​,sj​) where u1,⋯ ,uku_1,\cdots,u_ku1​,⋯,uk​ are nodes sampled in the lll-th layer.

  • Wskip(l−1)=W(l−1)W(l)W_{skip}^{(l-1)}=W^{(l-1)}W^{(l)}Wskip(l−1)​=W(l−1)W(l), where W(l)W^{(l)}W(l) and W(l−1)W^{(l-1)}W(l−1) are the filters of the lll-th and (l−1)(l-1)(l−1)-th layers

The final update is then

hv(l+1)=σ(hskip(l+1)(v)+∑u∈N(v)⋃{v}A^(u,v)hu(l)W(l)).h_{v}^{(l+1)} = \sigma\left(h_{skip}^{(l+1)}(v)+\sum_{u\in\mathcal{N}(v)\bigcup\{v\}}\hat{A}(u,v)h_{u}^{(l)}W^{(l)}\right).hv(l+1)​=σ​hskip(l+1)​(v)+u∈N(v)⋃{v}∑​A^(u,v)hu(l)​W(l)​.

Attention

In GAT, A^(u,v)\hat{A}(u,v)A^(u,v) is replaced by SoftMax(LeakyReLU(W1h(l)(vi),W2h(l)(uj)))\text{SoftMax}(\text{LeakyReLU}(W_{1}h^{(l)}(v_i), W_{2}h^{(l)}(u_j)))SoftMax(LeakyReLU(W1​h(l)(vi​),W2​h(l)(uj​))), the dependence on h(l)(vi),h(l)(uj)h^{(l)}(v_i), h^{(l)}(u_j)h(l)(vi​),h(l)(uj​) makes it intractable for the method proposed as the sampling of nodes in the lll th layer depends on the nodes in the l+1l+1l+1 th layer. The author proposed to use

1nReLU(W1(gx(vi))+W2g(x(uj)))\frac{1}{n}\text{ReLU}(W_{1}(g x(v_i)) + W_{2}g(x(u_j)))n1​ReLU(W1​(gx(vi​))+W2​g(x(uj​)))

instead, where W1W_{1}W1​ and W2W_2W2​ and x(vi),x(uj)x(v_i), x(u_j)x(vi​),x(uj​) are the node features.

Experiments

  • Categorizing academic papers in the citation network datasets -- Cora (O(103)O(10^3)O(103) nodes), Citeseer (O(103)O(10^3)O(103) nodes) and Pubmed (O(104)O(10^4)O(104) nodes)

  • Predicting which community different posts belong to in Reddit (O(105)O(10^5)O(105) nodes)

The sampling framework is inductive (separates out test data from training) rather than transductive (all vertices are included). In test time, the authors do not use sampling.

The experiments are conducted with random seeds over 202020 trials with mean and standard variances recorded.

From the learning curves, the performance using layer-wise sampling is comparable to the full training and is better than Node-Wise(re-implementation of GraphSage) and IID(re-implementation of FastGCN). However, the variance reudction seems not to help much here.

Note the leanrning curves above are based on a re-implementation of FastGCN and GraphSage for a fair comparison. The comparison with the official implementation is presented below:

For skip connections, the authors also add a skip connection between the first layer and the last layer. I think skip connection does not help too much on the datasets concerned.

If hui(l)h_{u_i}^{(l)}hui​(l)​ is a scalar, then as mentioned in , the variance is

Spectral networks and locally connected networks on graphs
Deep convolutional networks on graph-structured data
Convolutional neural networks on graphs with fast localized spectral filtering
Convolutional neural networks on graphs for learning molecular fingerprints
Diffusion-convolutional neural networks
Learning convolutional neural networks for graphs
Geometric deep learning on graphs and manifolds using mixture model cnns
page 6, Chapter 9 Importance Sampling, Monte Carlo theory, methods and examples