Deep Convolutional Networks on Graph-Structured Data

Motivation

To generalize ConvNets to high-dimensional general datasets where the graph structure is not known a priori. In this context, learning the graph structure amounts to estimating the similarity matrix, which has complexity O(N2)O(N^2). The authors also attempt to experimentally answer the question whether the graph estimation followed by graph convolutions offers advantages with respect to learning directly from the data with fully connected layers.

Generalizing Convolutions to Graphs

Spectral Networks

Let WW be a N×NN\times N similarity matrix representing an undirected graph GG, and let L=ID12WD12L=I-D^{-\frac{1}{2}}WD^{-\frac{1}{2}} be its graph Laplacian where D=W1D=W\cdot\textbf{1}. Let U=(u1,,uN)U=(u_1,\cdots, u_N) be the eigenvectors of LL. Then a graph convolution of input signals xRNx\in\mathbb{R}^{N} with filters gg on GG is defined by xGg=UT(UxUg),x\ast_{G} g=U^{T}(Ux\odot Ug), where \odot represents a point-wise product. As for the theory behind the construction, see Bruna et al., 2014.

With such a construction, learning filters on a graph thus amounts to learning spectral multipliers wg=(w1,,wN)w_g=(w_1,\cdots, w_N) with wgw_g in

xGg:=UTdiag(wg)Ux.x\ast_{G}g := U^{T}\text{diag}(w_g)Ux.

Extending the convolution to inputs xx with multiple channels can be done by applying UU on each channel and then using multipliers wg=(wi,j:iN,jM)w_g=(w_{i,j}:i\leq N, j\leq M).

Note that one could simply replace UU by an arbitrary unitary matrix, which is then optimized by BP directly. In this case the number of parameters will be O(N2)O(N^2).

One problem with this formulation is that the multipliers will have O(N)O(N) parameters while the filter in ConvNet has O(1)O(1) parameters. To address this issue, it is necessary to restrict the class of spectral multipliers to those corresponding to localized filters. The theoretical explanation in the paper was not clear to me (Mufei), but their proposal is that we can use

wg=Kw~g,w_g=\mathcal{K}\tilde{w}_g,

KRN×N0\mathcal{K}\in\mathbb{R}^{N\times N_0} is a smoothing kernel like splines and w~gRN0×M\tilde{w}_g\in\mathbb{R}^{N_0\times M}.

According to Gunn 1998, spline kernel is a popular choice for interpolation modeling. A finite spline of order pp, with knots located at τs\tau_s is given by

K(x,y)=r=0pxryr+s=1N(xτs)+p(yτs)+p,K(x, y) = \sum_{r=0}^{p}x^{r}y^{r}+\sum_{s=1}^{N}(x-\tau_s)_{+}^{p}(y-\tau_s)_{+}^{p},

where (a)+=max{a,0}(a)_{+}=\max\{a, 0\}.

Pooling with Hierarchical Graph Clustering

The authors construct neighborhoods at different scales using multi-resolution spectral clustering and consider both average and max-pooling as in standard ConvNets.

Graph Construction

When our non-Euclidean dataset does not have a prior graph structure, we need to estimate the similarity matrix WW before constructing the spectral network. The authors presented two approaches for graph construction, one unsupervised by measuring joint feature statistics, and another one supervised using an initial network as a proxy for the estimation.

Unsupervised Graph Estimation

Given XRN×MX\in\mathbb{R}^{N\times M}, NN is the number of samples and MM is the number of features, a distance can be defined between a pair of features ii, jj by

d(i,j)=XiXj2,d(i, j) = ||X_i-X_j||^2,

where XiX_i is the ii-th column of XX. This definition can be problematic as it fails to capture higher-order statistics and might be fragile when the features are not normalized. To address these issues, we can replace the Euclidean distance by the ZZ-score (thus renormalizing each feature by its standard deviation), the "square-correlation" (computing the correlation of squares of previously whitened features), or the mutual information.

A Gaussian diffusion kernel can be built from the chosen distance:

w(i,j)=expd(i,j)σ2.w(i, j)=\exp^{-\frac{d(i, j)}{\sigma^2}}.

In experiments, the authors consider the variant of self-tuning diffusion kernel

w(i,j)=expd(i,j)σiσj,w(i, j)=\exp^{-\frac{d(i, j)}{\sigma_i\sigma_j}},

where σi\sigma_i is computed as the distance d(i,ik)d(i, i_k) corresponding to the kk-th nearest neighbor iki_k of feature ii. This defines a kernel whose variance is locally adapted around each feature point, as opposed to the previous one where the variance is shared.

Supervised Graph Estimation

Given a training set with normalized features (the authors used feature normalization by its standard deviation and they suggested the possibility of whitening completely the data) XRN×MX\in\mathbb{R}^{N\times M} and labels y{1,,C}Ly\in\{1,\cdots, C\}^{L}, they initialized a FNN ϕ\phi with KK layers of weights W1,,WKW_1,\cdots, W_K with ReLU\text{ReLU} and dropout\text{dropout}. They then extracted the first layer features W1RN×M1W_1\in\mathbb{R}^{N\times M_1}, where M1M_1 is the number of first-layer hidden features and consider the distance

dsup(i,j)=W1,iW1,j2.d_{sup}(i, j) = ||W_{1, i}-W_{1, j}||^2.

The rest can be performed as in the unsupervised construction.

Last updated