Deep Convolutional Networks on Graph-Structured Data
Last updated
Last updated
To generalize ConvNets to high-dimensional general datasets where the graph structure is not known a priori. In this context, learning the graph structure amounts to estimating the similarity matrix, which has complexity . The authors also attempt to experimentally answer the question whether the graph estimation followed by graph convolutions offers advantages with respect to learning directly from the data with fully connected layers.
Let be a similarity matrix representing an undirected graph , and let be its graph Laplacian where . Let be the eigenvectors of . Then a graph convolution of input signals with filters on is defined by where represents a point-wise product. As for the theory behind the construction, see Bruna et al., 2014.
With such a construction, learning filters on a graph thus amounts to learning spectral multipliers with in
Extending the convolution to inputs with multiple channels can be done by applying on each channel and then using multipliers .
Note that one could simply replace by an arbitrary unitary matrix, which is then optimized by BP directly. In this case the number of parameters will be .
One problem with this formulation is that the multipliers will have parameters while the filter in ConvNet has parameters. To address this issue, it is necessary to restrict the class of spectral multipliers to those corresponding to localized filters. The theoretical explanation in the paper was not clear to me (Mufei), but their proposal is that we can use
The authors construct neighborhoods at different scales using multi-resolution spectral clustering and consider both average and max-pooling as in standard ConvNets.
A Gaussian diffusion kernel can be built from the chosen distance:
In experiments, the authors consider the variant of self-tuning diffusion kernel
The rest can be performed as in the unsupervised construction.
is a smoothing kernel like splines and .
According to Gunn 1998, spline kernel is a popular choice for interpolation modeling. A finite spline of order , with knots located at is given by
where .
When our non-Euclidean dataset does not have a prior graph structure, we need to estimate the similarity matrix before constructing the spectral network. The authors presented two approaches for graph construction, one unsupervised by measuring joint feature statistics, and another one supervised using an initial network as a proxy for the estimation.
Given , is the number of samples and is the number of features, a distance can be defined between a pair of features , by
where is the -th column of . This definition can be problematic as it fails to capture higher-order statistics and might be fragile when the features are not normalized. To address these issues, we can replace the Euclidean distance by the -score (thus renormalizing each feature by its standard deviation), the "square-correlation" (computing the correlation of squares of previously whitened features), or the mutual information.
where is computed as the distance corresponding to the -th nearest neighbor of feature . This defines a kernel whose variance is locally adapted around each feature point, as opposed to the previous one where the variance is shared.
Given a training set with normalized features (the authors used feature normalization by its standard deviation and they suggested the possibility of whitening completely the data) and labels , they initialized a FNN with layers of weights with and . They then extracted the first layer features , where is the number of first-layer hidden features and consider the distance