# Deep Convolutional Networks on Graph-Structured Data

## Motivation

To generalize ConvNets to high-dimensional general datasets where the graph structure is not known a priori. In this context, learning the graph structure amounts to estimating the similarity matrix, which has complexity $$O(N^2)$$. The authors also attempt to experimentally answer the question whether the graph estimation followed by graph convolutions offers advantages with respect to learning directly from the data with fully connected layers.

## Generalizing Convolutions to Graphs

### Spectral Networks

Let $$W$$ be a $$N\times N$$ similarity matrix representing an undirected graph $$G$$, and let $$L=I-D^{-\frac{1}{2}}WD^{-\frac{1}{2}}$$ be its graph Laplacian where $$D=W\cdot\textbf{1}$$. Let $$U=(u\_1,\cdots, u\_N)$$ be the eigenvectors of $$L$$. Then a graph convolution of input signals $$x\in\mathbb{R}^{N}$$ with filters $$g$$ on $$G$$ is defined by $$x\ast\_{G} g=U^{T}(Ux\odot Ug),$$ where $$\odot$$ represents a point-wise product. As for the theory behind the construction, see [Bruna et al., 2014](https://arxiv.org/pdf/1312.6203.pdf).

With such a construction, learning filters on a graph thus amounts to learning spectral multipliers $$w\_g=(w\_1,\cdots, w\_N)$$ with $$w\_g$$ in

$$
x\ast\_{G}g := U^{T}\text{diag}(w\_g)Ux.
$$

Extending the convolution to inputs $$x$$ with multiple channels can be done by applying $$U$$ on each channel and then using multipliers $$w\_g=(w\_{i,j}:i\leq N, j\leq M)$$.

Note that one could simply replace $$U$$ by an arbitrary unitary matrix, which is then optimized by BP directly. In this case the number of parameters will be $$O(N^2)$$.

One problem with this formulation is that the multipliers will have $$O(N)$$ parameters while the filter in ConvNet has $$O(1)$$ parameters. To address this issue, it is necessary to restrict the class of spectral multipliers to those corresponding to localized filters. The theoretical explanation in the paper was not clear to me (Mufei), but their proposal is that we can use

$$
w\_g=\mathcal{K}\tilde{w}\_g,
$$

$$\mathcal{K}\in\mathbb{R}^{N\times N\_0}$$ is a smoothing kernel like splines and $$\tilde{w}\_g\in\mathbb{R}^{N\_0\times M}$$.

According to [Gunn 1998](http://www.svms.org/tutorials/Gunn1998.pdf), spline kernel is a popular choice for interpolation modeling. A finite spline of order $$p$$, with knots located at $$\tau\_s$$ is given by

$$
K(x, y) = \sum\_{r=0}^{p}x^{r}y^{r}+\sum\_{s=1}^{N}(x-\tau\_s)*{+}^{p}(y-\tau\_s)*{+}^{p},
$$

where $$(a)\_{+}=\max{a, 0}$$.

![](https://i.imgur.com/QZHZRNu.png)

### Pooling with Hierarchical Graph Clustering

The authors construct neighborhoods at different scales using multi-resolution spectral clustering and consider both average and max-pooling as in standard ConvNets.

## Graph Construction

When our non-Euclidean dataset does not have a prior graph structure, we need to estimate the similarity matrix $$W$$ before constructing the spectral network. The authors presented two approaches for graph construction, one **unsupervised** by measuring joint feature statistics, and another one **supervised** using an initial network as a proxy for the estimation.

### Unsupervised Graph Estimation

Given $$X\in\mathbb{R}^{N\times M}$$, $$N$$ is the number of samples and $$M$$ is the number of features, a distance can be defined between a pair of features $$i$$, $$j$$ by

$$
d(i, j) = ||X\_i-X\_j||^2,
$$

where $$X\_i$$ is the $$i$$-th column of $$X$$. This definition can be problematic as it fails to capture higher-order statistics and might be fragile when the features are not normalized. To address these issues, we can replace the Euclidean distance by the $$Z$$-score (thus renormalizing each feature by its standard deviation), the "square-correlation" (computing the correlation of squares of previously whitened features), or the mutual information.

A Gaussian diffusion kernel can be built from the chosen distance:

$$
w(i, j)=\exp^{-\frac{d(i, j)}{\sigma^2}}.
$$

In experiments, the authors consider the variant of self-tuning diffusion kernel

$$
w(i, j)=\exp^{-\frac{d(i, j)}{\sigma\_i\sigma\_j}},
$$

where $$\sigma\_i$$ is computed as the distance $$d(i, i\_k)$$ corresponding to the $$k$$-th nearest neighbor $$i\_k$$ of feature $$i$$. This defines a kernel whose variance is locally adapted around each feature point, as opposed to the previous one where the variance is shared.

### Supervised Graph Estimation

Given a training set with normalized features (the authors used feature normalization by its standard deviation and they suggested the possibility of whitening completely the data) $$X\in\mathbb{R}^{N\times M}$$ and labels $$y\in{1,\cdots, C}^{L}$$, they initialized a FNN $$\phi$$ with $$K$$ layers of weights $$W\_1,\cdots, W\_K$$ with $$\text{ReLU}$$ and $$\text{dropout}$$. They then extracted the first layer features $$W\_1\in\mathbb{R}^{N\times M\_1}$$, where $$M\_1$$ is the number of first-layer hidden features and consider the distance

$$
d\_{sup}(i, j) = ||W\_{1, i}-W\_{1, j}||^2.
$$

The rest can be performed as in the unsupervised construction.
