Notes
  • README
  • Roadmap
  • Graph
    • GraphSAGE
    • DiffPool
    • RRN
    • Relational RL
    • Layerwise Adaptive Sampling
    • Representation Lerning on Graphs: Methods and Applications
    • GAT
    • How Powerful are Graph Neural Networks?
    • Pitfalls of Graph Neural Network Evaluation
    • Spectral Networks and Deep Locally Connected Networks on Graphs
    • Deep Convolutional Networks on Graph-Structured Data
  • Optimizations
    • Neural ODE
  • Tags
Powered by GitBook
On this page
  • Motivation
  • Generalizing Convolutions to Graphs
  • Spectral Networks
  • Pooling with Hierarchical Graph Clustering
  • Graph Construction
  • Unsupervised Graph Estimation
  • Supervised Graph Estimation
  1. Graph

Deep Convolutional Networks on Graph-Structured Data

PreviousSpectral Networks and Deep Locally Connected Networks on GraphsNextOptimizations

Last updated 6 years ago

Motivation

To generalize ConvNets to high-dimensional general datasets where the graph structure is not known a priori. In this context, learning the graph structure amounts to estimating the similarity matrix, which has complexity O(N2)O(N^2)O(N2). The authors also attempt to experimentally answer the question whether the graph estimation followed by graph convolutions offers advantages with respect to learning directly from the data with fully connected layers.

Generalizing Convolutions to Graphs

Spectral Networks

Let WWW be a N×NN\times NN×N similarity matrix representing an undirected graph GGG, and let L=I−D−12WD−12L=I-D^{-\frac{1}{2}}WD^{-\frac{1}{2}}L=I−D−21​WD−21​ be its graph Laplacian where D=W⋅1D=W\cdot\textbf{1}D=W⋅1. Let U=(u1,⋯ ,uN)U=(u_1,\cdots, u_N)U=(u1​,⋯,uN​) be the eigenvectors of LLL. Then a graph convolution of input signals x∈RNx\in\mathbb{R}^{N}x∈RN with filters ggg on GGG is defined by x∗Gg=UT(Ux⊙Ug),x\ast_{G} g=U^{T}(Ux\odot Ug),x∗G​g=UT(Ux⊙Ug), where ⊙\odot⊙ represents a point-wise product. As for the theory behind the construction, see .

With such a construction, learning filters on a graph thus amounts to learning spectral multipliers wg=(w1,⋯ ,wN)w_g=(w_1,\cdots, w_N)wg​=(w1​,⋯,wN​) with wgw_gwg​ in

x∗Gg:=UTdiag(wg)Ux.x\ast_{G}g := U^{T}\text{diag}(w_g)Ux.x∗G​g:=UTdiag(wg​)Ux.

Extending the convolution to inputs xxx with multiple channels can be done by applying UUU on each channel and then using multipliers wg=(wi,j:i≤N,j≤M)w_g=(w_{i,j}:i\leq N, j\leq M)wg​=(wi,j​:i≤N,j≤M).

Note that one could simply replace UUU by an arbitrary unitary matrix, which is then optimized by BP directly. In this case the number of parameters will be O(N2)O(N^2)O(N2).

One problem with this formulation is that the multipliers will have O(N)O(N)O(N) parameters while the filter in ConvNet has O(1)O(1)O(1) parameters. To address this issue, it is necessary to restrict the class of spectral multipliers to those corresponding to localized filters. The theoretical explanation in the paper was not clear to me (Mufei), but their proposal is that we can use

wg=Kw~g,w_g=\mathcal{K}\tilde{w}_g,wg​=Kw~g​,

Pooling with Hierarchical Graph Clustering

The authors construct neighborhoods at different scales using multi-resolution spectral clustering and consider both average and max-pooling as in standard ConvNets.

Graph Construction

Unsupervised Graph Estimation

A Gaussian diffusion kernel can be built from the chosen distance:

In experiments, the authors consider the variant of self-tuning diffusion kernel

Supervised Graph Estimation

The rest can be performed as in the unsupervised construction.

K∈RN×N0\mathcal{K}\in\mathbb{R}^{N\times N_0}K∈RN×N0​ is a smoothing kernel like splines and w~g∈RN0×M\tilde{w}_g\in\mathbb{R}^{N_0\times M}w~g​∈RN0​×M.

According to , spline kernel is a popular choice for interpolation modeling. A finite spline of order ppp, with knots located at τs\tau_sτs​ is given by

K(x,y)=∑r=0pxryr+∑s=1N(x−τs)+p(y−τs)+p,K(x, y) = \sum_{r=0}^{p}x^{r}y^{r}+\sum_{s=1}^{N}(x-\tau_s)_{+}^{p}(y-\tau_s)_{+}^{p},K(x,y)=r=0∑p​xryr+s=1∑N​(x−τs​)+p​(y−τs​)+p​,

where (a)+=max⁡{a,0}(a)_{+}=\max\{a, 0\}(a)+​=max{a,0}.

When our non-Euclidean dataset does not have a prior graph structure, we need to estimate the similarity matrix WWW before constructing the spectral network. The authors presented two approaches for graph construction, one unsupervised by measuring joint feature statistics, and another one supervised using an initial network as a proxy for the estimation.

Given X∈RN×MX\in\mathbb{R}^{N\times M}X∈RN×M, NNN is the number of samples and MMM is the number of features, a distance can be defined between a pair of features iii, jjj by

d(i,j)=∣∣Xi−Xj∣∣2,d(i, j) = ||X_i-X_j||^2,d(i,j)=∣∣Xi​−Xj​∣∣2,

where XiX_iXi​ is the iii-th column of XXX. This definition can be problematic as it fails to capture higher-order statistics and might be fragile when the features are not normalized. To address these issues, we can replace the Euclidean distance by the ZZZ-score (thus renormalizing each feature by its standard deviation), the "square-correlation" (computing the correlation of squares of previously whitened features), or the mutual information.

w(i,j)=exp⁡−d(i,j)σ2.w(i, j)=\exp^{-\frac{d(i, j)}{\sigma^2}}.w(i,j)=exp−σ2d(i,j)​.
w(i,j)=exp⁡−d(i,j)σiσj,w(i, j)=\exp^{-\frac{d(i, j)}{\sigma_i\sigma_j}},w(i,j)=exp−σi​σj​d(i,j)​,

where σi\sigma_iσi​ is computed as the distance d(i,ik)d(i, i_k)d(i,ik​) corresponding to the kkk-th nearest neighbor iki_kik​ of feature iii. This defines a kernel whose variance is locally adapted around each feature point, as opposed to the previous one where the variance is shared.

Given a training set with normalized features (the authors used feature normalization by its standard deviation and they suggested the possibility of whitening completely the data) X∈RN×MX\in\mathbb{R}^{N\times M}X∈RN×M and labels y∈{1,⋯ ,C}Ly\in\{1,\cdots, C\}^{L}y∈{1,⋯,C}L, they initialized a FNN ϕ\phiϕ with KKK layers of weights W1,⋯ ,WKW_1,\cdots, W_KW1​,⋯,WK​ with ReLU\text{ReLU}ReLU and dropout\text{dropout}dropout. They then extracted the first layer features W1∈RN×M1W_1\in\mathbb{R}^{N\times M_1}W1​∈RN×M1​, where M1M_1M1​ is the number of first-layer hidden features and consider the distance

dsup(i,j)=∣∣W1,i−W1,j∣∣2.d_{sup}(i, j) = ||W_{1, i}-W_{1, j}||^2.dsup​(i,j)=∣∣W1,i​−W1,j​∣∣2.
Bruna et al., 2014
Gunn 1998