Layerwise Adaptive Sampling

tags: GCN, sampling, NeuralPS2018

Adaptive Sampling Towards Fast Graph Representation Learning

Multiple vertices may have some common neighbors so neighbor sampling can result in repeated samples. We can avoid the over-expansion and accelerate the training of GCN by controlling the size of the sampled neighborhoods in each layer.

The core of the method is to define an appropriate sampler for the layer-wise sampling. A common objective to design the sampler is to minimize the resulting variance. Unfortunately, the optimal sampler to minimize the variance is uncomputable due to the inconsistency between the top-down sampling and the bottom-up propagation in our network. To tackle this issue a parametrized sampler is used and the resulting variance, being a loss term, can be directly optimized with BP.

The authors also propose to include skip connection for message passing so that 2nd order proximity

To sum up, the contributions include:

  • Overexpansion of neighborhood \Rightarrow Layer-wise neighbor sampling

  • Uncomputable optimal sampler \Rightarrow Feature based parametrized sampler + directly optimizing variance in the objective

  • Preserving 2nd order proximity \Rightarrow Skip connection for message passing

Method Formulation

Monte Carlo GCN

In GCN, the update rule for node features is

hv(l+1)=σ(uN(v){v}A^(u,v)hu(l)W(l)),h_{v}^{(l+1)} = \sigma\left(\sum_{u\in\mathcal{N}(v)\bigcup\{v\}}\hat{A}(u,v)h_{u}^{(l)}W^{(l)}\right),

where

A^(u,v)=(D^12A^D^12)u,v.\hat{A}(u, v)=\left(\hat{D}^{-\frac{1}{2}}\hat{A}\hat{D}^{-\frac{1}{2}}\right)_{u, v}.

We can reformulate the rule above with expectation:

hv(l+1)=σ(N(v)Eup(uv)[hu(l)]W(l)),h_{v}^{(l+1)} = \sigma\left(N(v)\mathbb{E}_{u\sim p(u|v)}[h_u^{(l)}]W^{(l)}\right),

where

N(v)=uN(v){v}A^(u,v),p(uv)=A^(u,v)N(v).N(v)=\sum_{u\in\mathcal{N}(v)\bigcup\{v\}}\hat{A}(u, v), p(u|v)=\frac{\hat{A}(u,v)}{N(v)}.

The term Eup(uv)[hu(l)]\mathbb{E}_{u\sim p(u|v)}[h_u^{(l)}] may be approximated with Monte Carlo sampling:

1ni=1nhui(l),\frac{1}{n}\sum_{i=1}^{n}h_{u_i}^{(l)},

with u1,,unu_1,\cdots,u_n sampled with p(uv)p(u|v).

Layer-wise sampling

The layer-wise sampling can be incorporated with importance sampling. Assume q(uv1,,vm)q(u|v_1,\cdots, v_m) is a conditional layer-wise sampling distribution, which we will introduce later. The original update rule is equivalent to

hv(l+1)=σ(N(v)Euq(uv1,,vm)[p(uv)q(uv1,,vm)hu(l)]W(l)),h_v^{(l+1)}=\sigma\left(N(v)\mathbb{E}_{u\sim q(u|v_1,\cdots,v_m)}\left[\frac{p(u|v)}{q(u|v_1,\cdots,v_m)}h_u^{(l)}\right]W^{(l)}\right),

where v{v1,,vm}v\in\{v_1,\cdots,v_m\}. We can also perform Monte Carlo sampling as above.

Opposed to the node-wise Monte Carlo method where the nodes are sampled independently for each viv_i, the layer-wise sampling is performed only once for v1,,vmv_1,\cdots,v_m. As a result, the total number of sampling nodes only grows linearly with the network depth if we fix the sampling size n.

Variance reduction

For simplicity we abbreviate q(uv1,,vm)q(u|v_1,\cdots,v_m) as q(u)q(u). For a good choice of q(u)q(u), we seek to reduce the induced variance of μ^q(vj)=1ni=1np(uivj)q(ui)hui(l)\hat{\mu}_{q}(v_j)=\frac{1}{n}\sum_{i=1}^{n}\frac{p(u_i|v_j)}{q(u_i)}h_{u_i}^{(l)}, vj{v1,,vm}v_j\in\{v_1,\cdots,v_m\}, uiu_i's are sampled from q(u)q(u).

Note that there is a small problem with the original argument of the authors where they treat hui(l)h_{u_i}^{(l)} as a scalar rather than a vector. The argumentation there was more for some motivation.

If hui(l)h_{u_i}^{(l)} is a scalar, then as mentioned in page 6, Chapter 9 Importance Sampling, Monte Carlo theory, methods and examples, the variance is

1nEq[(hui(l)p(uivj)q(ui)Ep[hui(l)])2q(ui)2]\frac{1}{n}\mathbb{E}_{q}\left[\frac{(h_{u_i}^{(l)}p(u_i|v_j)-q(u_i)\mathbb{E}_p[h_{u_i}^{(l)}])^2}{q(u_i)^2}\right]

and the optimal qq in terms of variance is given by

p(uivj)hui(l)Ep[hui(l)].\frac{p(u_i|v_j)|h_{u_i}^{(l)}|}{\mathbb{E}_{p}[|h_{u_i}^{(l)}|]}.

For both reasons that: 1. hui(l)h_{u_i}^{(l)} is a vector and we want a scalar 2. Even the above holds, Ep[hui(l)]\mathbb{E}_{p}[|h_{u_i}^{(l)}|] cannot be evaluated efficiently.

The authors use a linear layer g(x(ui))=Wgx(ui)g(x(u_i))=W_gx(u_i) to replace hui(l)h_{u_i}^{(l)} in the eqution above, where WgR1×DW_g\in\mathbb{R}^{1\times D}. x(ui)x(u_i) is the node feature. We can then perform a Monte Carlo estimation

p(uivj)g(x(uj))i=1Np(uivj)g(x(ui)).\frac{p(u_i|v_j)|g(x(u_j))|}{\sum_{i=1}^{N}p(u_i|v_j)|g(x(u_i))|}.

To make the estimation independent of vjv_j for the purpose of layerwise sampling, we can do

q(ui)=j=1mp(uivj)g(x(uj))i=1Nj=1mp(uivj)g(x(ui)).q(u_i)=\frac{\sum_{j=1}^{m}p(u_i|v_j)|g(x(u_j))|}{\sum_{i=1}^{N}\sum_{j=1}^{m}p(u_i|v_j)|g(x(u_i))|}.

To make the variance reduction process adaptive, we can directly add the estimated variance to the objective function:

1n2i=1n(p(uivj)g(x(ui))μ^q(vj)q(ui))2q2(ui),\frac{1}{n^2}\sum_{i=1}^{n}\frac{\left(p(u_i|v_j)g(x(u_i))-\hat{\mu}_q(v_j)q(u_i)\right)^2}{q^2(u_i)},

where u1,,unu_1,\cdots,u_n are sampled from qq.

Skip Connections

For the nodes of the (l+1)(l+1) layer, we can add direct connections between them and the nodes in the (l1)(l-1) layer for feature update.

hskip(l+1)(vi)=j=1na^skip(vi,sj)hsj(l1)Wskip(l1),h_{skip}^{(l+1)}(v_i) = \sum_{j=1}^{n}\hat{a}_{skip}(v_i, s_j)h_{s_j}^{(l-1)}W_{skip}^{(l-1)},

where:

  • {sj}j=1n\{s_j\}_{j=1}^{n} are nodes sampled in the (l1)(l-1)-th layer.

  • a^skip(vi,sj)\hat{a}_{skip}(v_i,s_j) is approximated by k=1na^(vi,uk)a^(uk,sj)\sum_{k=1}^{n}\hat{a}(v_i, u_k)\hat{a}(u_k, s_j) where u1,,uku_1,\cdots,u_k are nodes sampled in the ll-th layer.

  • Wskip(l1)=W(l1)W(l)W_{skip}^{(l-1)}=W^{(l-1)}W^{(l)}, where W(l)W^{(l)} and W(l1)W^{(l-1)} are the filters of the ll-th and (l1)(l-1)-th layers

The final update is then

hv(l+1)=σ(hskip(l+1)(v)+uN(v){v}A^(u,v)hu(l)W(l)).h_{v}^{(l+1)} = \sigma\left(h_{skip}^{(l+1)}(v)+\sum_{u\in\mathcal{N}(v)\bigcup\{v\}}\hat{A}(u,v)h_{u}^{(l)}W^{(l)}\right).

Attention

In GAT, A^(u,v)\hat{A}(u,v) is replaced by SoftMax(LeakyReLU(W1h(l)(vi),W2h(l)(uj)))\text{SoftMax}(\text{LeakyReLU}(W_{1}h^{(l)}(v_i), W_{2}h^{(l)}(u_j))), the dependence on h(l)(vi),h(l)(uj)h^{(l)}(v_i), h^{(l)}(u_j) makes it intractable for the method proposed as the sampling of nodes in the ll th layer depends on the nodes in the l+1l+1 th layer. The author proposed to use

1nReLU(W1(gx(vi))+W2g(x(uj)))\frac{1}{n}\text{ReLU}(W_{1}(g x(v_i)) + W_{2}g(x(u_j)))

instead, where W1W_{1} and W2W_2 and x(vi),x(uj)x(v_i), x(u_j) are the node features.

Experiments

  • Categorizing academic papers in the citation network datasets -- Cora (O(103)O(10^3) nodes), Citeseer (O(103)O(10^3) nodes) and Pubmed (O(104)O(10^4) nodes)

  • Predicting which community different posts belong to in Reddit (O(105)O(10^5) nodes)

The sampling framework is inductive (separates out test data from training) rather than transductive (all vertices are included). In test time, the authors do not use sampling.

The experiments are conducted with random seeds over 2020 trials with mean and standard variances recorded.

From the learning curves, the performance using layer-wise sampling is comparable to the full training and is better than Node-Wise(re-implementation of GraphSage) and IID(re-implementation of FastGCN). However, the variance reudction seems not to help much here.

Note the leanrning curves above are based on a re-implementation of FastGCN and GraphSage for a fair comparison. The comparison with the official implementation is presented below:

For skip connections, the authors also add a skip connection between the first layer and the last layer. I think skip connection does not help too much on the datasets concerned.

Last updated