Notes
  • README
  • Roadmap
  • Graph
    • GraphSAGE
    • DiffPool
    • RRN
    • Relational RL
    • Layerwise Adaptive Sampling
    • Representation Lerning on Graphs: Methods and Applications
    • GAT
    • How Powerful are Graph Neural Networks?
    • Pitfalls of Graph Neural Network Evaluation
    • Spectral Networks and Deep Locally Connected Networks on Graphs
    • Deep Convolutional Networks on Graph-Structured Data
  • Optimizations
    • Neural ODE
  • Tags
Powered by GitBook
On this page
  • Motivation
  • Spatial Construction
  • Locality via W
  • Multiresolution Analysis in Graphs
  • Deep Locally Connected Networks
  • Spectral Construction
  • Harmonic Analysis on Weighted Graphs
  • Extending Convolutions via the Laplacian Spectrum
  1. Graph

Spectral Networks and Deep Locally Connected Networks on Graphs

PreviousPitfalls of Graph Neural Network EvaluationNextDeep Convolutional Networks on Graph-Structured Data

Last updated 6 years ago

Motivation

Generalize ConvNet to graphs.

Spatial Construction

Consider an undirected weighted graph denoted by G=(V,W)G=(V, W)G=(V,W) where V={1,⋯ ,∣V∣}V=\{1,\cdots, |V|\}V={1,⋯,∣V∣} and W∈R∣V∣×∣V∣W\in\mathbb{R}^{|V|\times|V|}W∈R∣V∣×∣V∣ is a symmetric nonnegative matrix.

Locality via W

The weights in a graph determine a notion of locality like the one in images, audio and text. We can restrict attention to sparse "filters" with receptive fields given by neighborhoods, thus reducing the number of parameters in a filter to O(S⋅n)O(S\cdot n)O(S⋅n) where SSS is the average neighborhood size and nnn is the input feature size.

Multiresolution Analysis in Graphs

The authors mentioned that "[f]inding multiscale clusterings that are provably guaranteed to behave well w.r.t. Laplacian on the graph is still an open area of research" (the publication was in 2014).

Deep Locally Connected Networks

The spatial construction starts with a multiscale clustering of the graph. We consider KKK scales. We set V0=VV_{0}=VV0​=V, and for each k=1,⋯ ,Kk=1,\cdots, Kk=1,⋯,K, we define VkV_{k}Vk​ a partition of Vk−1V_{k-1}Vk−1​ into dkd_kdk​ clusters; and a collection of neighborhoods around each element of Vk−1V_{k-1}Vk−1​:

Nk={Nk,i;i=1,⋯ ,dk−1}.\begin{equation} \mathcal{N}_{k}=\{\mathcal{N}_{k,i}; i=1,\cdots, d_{k-1}\}. \end{equation}Nk​={Nk,i​;i=1,⋯,dk−1​}.​​

With these in hand, we can now define the kkk-th layer of the network as a mapping: Rdk−1×fk−1→Rdk×fk\mathbb{R}^{d_{k-1}\times f_{k-1}}\rightarrow \mathbb{R}^{d_k\times f_k}Rdk−1​×fk−1​→Rdk​×fk​, xk↦xk+1x_{k}\mapsto x_{k+1}xk​↦xk+1​ defined as

xk+1,j=Lkh(∑i=1fk−1Fk,i,jxk,i),\begin{equation} x_{k+1, j}=L_{k}h\left(\sum_{i=1}^{f_{k}-1}F_{k,i,j}x_{k, i}\right), \end{equation}xk+1,j​=Lk​h(i=1∑fk​−1​Fk,i,j​xk,i​),​​

where:

  • xk+1,jx_{k+1,j}xk+1,j​ is the jjj th column of xk+1x_{k+1}xk+1​ and xk,ix_{k, i}xk,i​ is the iii th column of xkx_kxk​

  • Fk,i,jF_{k, i, j}Fk,i,j​ is a dk−1×dk−1d_{k-1}\times d_{k-1}dk−1​×dk−1​ sparse matrix with nonzero entries in the locations given by Nk\mathcal{N}_{k}Nk​

  • LkL_kLk​ outputs the result of a pooling operation over each cluster in VkV_kVk​.

This construction is illustrated below:

Started with W0=WW_0=WW0​=W, for all k≤Kk\leq Kk≤K, VkV_kVk​ is constructed by merging all nodes whose pairwise distance (determined by the corresponding entry in Wk−1W_{k-1}Wk−1​) are less than ε\varepsilonε. WkW_kWk​ and Nk\mathcal{N}_{k}Nk​ are then obtained with

Ak(i,j)=∑s∈Vk(i)∑t∈Vk(j)Wk−1(s,t),Wk=rownormalize(Ak),Nk=supp(Wk),\begin{align} A_{k}(i, j)&=\sum_{s\in V_k(i)}\sum_{t\in V_{k}(j)}W_{k-1}(s, t),\\ W_{k}&=\text{rownormalize}(A_k),\\ \mathcal{N}_{k}&=\text{supp}(W_k),\\ \end{align}Ak​(i,j)Wk​Nk​​=s∈Vk​(i)∑​t∈Vk​(j)∑​Wk−1​(s,t),=rownormalize(Ak​),=supp(Wk​),​​

where Vk(i)V_{k}(i)Vk​(i) is the iii th node in VkV_kVk​, supp(Wk)\text{supp}(W_k)supp(Wk​) is the support of WkW_kWk​, i.e. {(i,j):Wk(i,j)>0}\{(i,j):W_k(i,j)>0\}{(i,j):Wk​(i,j)>0}.

Spectral Construction

The global structure of the graph can be exploited with the spectrum of its graph-Laplacian to generalize the convolution operator.

Harmonic Analysis on Weighted Graphs

The frequency and smoothness relative to WWW are interrelated through the combinatorial Laplacian L=D−WL=D-WL=D−W or graph Laplacian L=I−D−12WD−12\mathcal{L}=I-D^{-\frac{1}{2}}WD^{-\frac{1}{2}}L=I−D−21​WD−21​. For simplicity, the authors simply use the combinatorial Laplacian.

If xxx is an mmm-dimensional vector, a natural definition of the smoothness functional ∣∣∇x∣∣W2||\nabla x||_{W}^{2}∣∣∇x∣∣W2​ at a node iii is

∣∣∇x∣∣W2(i)=∑jWij(x(i)−x(j))2,∣∣∇x∣∣W2=∑i∑jWij(x(i)−x(j))2=2xTΔx,\begin{align} ||\nabla x||_{W}^{2}(i)&=\sum_{j}W_{ij}\left(x(i)-x(j)\right)^2,\\ ||\nabla x||_{W}^{2}&=\sum_{i}\sum_{j}W_{ij}\left(x(i)-x(j)\right)^2=2x^{T}\Delta x, \end{align}∣∣∇x∣∣W2​(i)∣∣∇x∣∣W2​​=j∑​Wij​(x(i)−x(j))2,=i∑​j∑​Wij​(x(i)−x(j))2=2xTΔx,​​

This definition is also known as the Dirichlet energy. With this definition, the smoothest vector is a constant:

ϕ0=argminx∈Rm,∣∣x∣∣=1∣∣∇x∣∣W2=1m1m.\begin{equation} \phi_0=\text{argmin}_{x\in\mathbb{R}^{m}, ||x||=1} ||\nabla x||_{W}^{2}=\frac{1}{\sqrt{m}}\textbf{1}_{m}. \end{equation}ϕ0​=argminx∈Rm,∣∣x∣∣=1​∣∣∇x∣∣W2​=m​1​1m​.​​

Note here ϕ0\phi_0ϕ0​ is also the eigenvector of LLL corresponding to eigenvalue 000.

Each successive

ϕi=argminx∈Rm,∣∣x∣∣=1,x⊥{v0,⋯ ,vi−1}∣∣∇x∣∣W2\begin{equation} \phi_i = \text{argmin}_{x\in\mathbb{R}^{m}, ||x||=1, x\perp\{v_0,\cdots,v_{i-1}\}}||\nabla x||_{W}^{2} \end{equation}ϕi​=argminx∈Rm,∣∣x∣∣=1,x⊥{v0​,⋯,vi−1​}​∣∣∇x∣∣W2​​​

is an eigenvalue of LLL, and the eigenvalues λi\lambda_iλi​ allow the smoothness of a vector xxx to be read off from the coefficients of xxx in [ϕ0,⋯ ,ϕm−1][\phi_0,\cdots, \phi_{m-1}][ϕ0​,⋯,ϕm−1​], equivalently as the Fourier coefficients of a signal defined in a grid. The filters are therefore multipliers on the eigenvalues of Δ\DeltaΔ and reduces the number of parameters of a filter from m2m^2m2 to mmm.

Two related results are: 1. Functions that are smooth relative to the grid metric have coefficients with quick decay in the basis of eigenvectors of Δ\DeltaΔ. 2. The eigenvectors of the subsampled Laplacian are the low frequency eigenvectors of Δ\DeltaΔ.

Extending Convolutions via the Laplacian Spectrum

Theoretical Formulation

Given two functions f,g:Ω→Rf,g:\Omega\rightarrow\mathbb{R}f,g:Ω→R, Ω\OmegaΩ a Euclidean domain, the convolution of fff and ggg is defined to be

(f⋆g)(t)=∫Ωf(s)g(t−s)ds,\begin{equation} (f\star g)(t)=\int_{\Omega}f(s)g(t-s)ds, \end{equation}(f⋆g)(t)=∫Ω​f(s)g(t−s)ds,​​

where ∫Ω∣f(s)g(t−s)∣ds<∞\int_{\Omega}|f(s)g(t-s)|ds<\infty∫Ω​∣f(s)g(t−s)∣ds<∞.

A famous result related to convolution is the convolution theorem: f⋆g^=f^⋅g^\widehat{f\star g}=\hat{f}\cdot\hat{g}f⋆g​=f^​⋅g^​, where ^\hat{}^ denotes the fourier transform of a function and ⋅\cdot⋅ denotes the elementwise multiplication.

The fourier transform of f∈L1(R)f\in L^{1}(\mathbb{R})f∈L1(R) is defined by

f^(x)=∫Rf(y)e−ixydy\hat{f}(x)=\int_{\mathbb{R}}f(y)e^{-ixy}dyf^​(x)=∫R​f(y)e−ixydy

The description above is for the Euclidean case. To generalize the notion of convolution to non-Euclidean case, one possibility is by using the convolution theorem as a definition.

Since Δ\DeltaΔ admits on a compact domain an eigendecomposition with a discrete set of orthonormal eigenfunctions ϕ0,ϕ1,⋯\phi_0,\phi_1,\cdotsϕ0​,ϕ1​,⋯ and they are the smoothest functions in the sense of the Dirichlet energy, they can be interpreted as a generalization of the standard Fourier basis.

We can then define the convolution of fff and ggg on a graph as

(f⋆g)(x)=∑i≥0⟨f,ϕi⟩L2(X)⟨g,ϕi⟩L2(X)ϕi(x),(f\star g)(x)=\sum_{i\geq 0} \langle f, \phi_i\rangle_{L^{2}(\mathcal{X})}\langle g, \phi_i\rangle_{L^{2}(\mathcal{X})}\phi_i(x),(f⋆g)(x)=i≥0∑​⟨f,ϕi​⟩L2(X)​⟨g,ϕi​⟩L2(X)​ϕi​(x),

where X\mathcal{X}X is a graph.

For two vectors f=(f1,⋯ ,fn)T\textbf{f}=(f_1,\cdots, f_n)^{T}f=(f1​,⋯,fn​)T and g=(g1,⋯ ,gn)T\textbf{g}=(g_1,\cdots, g_n)^{T}g=(g1​,⋯,gn​)T, the convolution of f\textbf{f}f and g\textbf{g}g can be calculated by

f⋆g=[gngn−1⋯g1g1gn⋯gn−1⋮⋮⋱⋮gn−1gn−2⋯gn][f1⋮fn]=Φ(ΦTg)∘(ΦTf)=Φdiag(g^1,⋯ ,g^n)ΦTf,\begin{align} f \star g&= \begin{bmatrix} g_{n} & g_{n-1} & \cdots & g_{1}\\ g_{1} & g_{n} & \cdots & g_{n-1}\\ \vdots & \vdots & \ddots & \vdots\\ g_{n-1} & g_{n-2} & \cdots & g_{n}\\ \end{bmatrix} \begin{bmatrix} f_1\\ \vdots\\ f_n \end{bmatrix}\\ &=\Phi(\Phi^{T}g)\circ(\Phi^{T}f)\\ &=\Phi\text{diag}(\hat{g}_1,\cdots, \hat{g}_n)\Phi^{T}f, \end{align}f⋆g​=​gn​g1​⋮gn−1​​gn−1​gn​⋮gn−2​​⋯⋯⋱⋯​g1​gn−1​⋮gn​​​​f1​⋮fn​​​=Φ(ΦTg)∘(ΦTf)=Φdiag(g^​1​,⋯,g^​n​)ΦTf,​​

where Φ=(ϕ1,⋯ ,ϕn)\Phi=(\phi_1,\cdots, \phi_n)Φ=(ϕ1​,⋯,ϕn​). Note that this formulation depends on the choice of Φ\PhiΦ.

Practice

Let Φ\PhiΦ be the eigenvectors of LLL, ordered by eigenvalue. A convolutional layer transforming node features with R∣Ω∣×fk−1→R∣Ω∣×fk\mathbb{R}^{|\Omega|\times f_{k-1}}\rightarrow\mathbb{R}^{|\Omega|\times f_{k}}R∣Ω∣×fk−1​→R∣Ω∣×fk​, xk↦xk+1x_{k}\mapsto x_{k+1}xk​↦xk+1​ can be defined as

xk+1,j=h(∑i=1fk−1ΦFk,i,jΦTxk,i)=h(Φ∑i=1fk−1Fk,i,jΦTxk,i),j=1,⋯ ,fk,x_{k+1, j}=h\left(\sum_{i=1}^{f_{k-1}}\Phi F_{k,i,j}\Phi^{T}x_{k, i}\right)=h\left(\Phi \sum_{i=1}^{f_{k-1}} F_{k,i,j}\Phi^{T}x_{k, i}\right), j=1,\cdots, f_k,xk+1,j​=h(i=1∑fk−1​​ΦFk,i,j​ΦTxk,i​)=h(Φi=1∑fk−1​​Fk,i,j​ΦTxk,i​),j=1,⋯,fk​,

where Fk,i,jF_{k,i,j}Fk,i,j​ is a diagonal matrix and hhh is a real-valued nonlinearity. Note with our previous theory the update rule corresponds to fk×fk−1f_{k}\times f_{k-1}fk​×fk−1​ filters and fk×fk−1×∣Ω∣f_{k}\times f_{k-1}\times |\Omega|fk​×fk−1​×∣Ω∣ parameters.

Often only the first ddd eigenvectors of the Laplacian are useful, which carry the smooth geometry of the graph. We can treat ddd as a hyperparameter and replace Φ\PhiΦ by Φd\Phi_dΦd​ in the equation above, which only keeps the first ddd columns of Φ\PhiΦ.

This construction can suffer when the individual high frequency eigenvectors are not meaningful but a cohort of high frequency eigenvectors contain meaningful information. It is also not obvious how to do either the forwardprop or the backprop efficiently while applying the nonlinearity on the space side.

The construction above requires O(∣Ω∣)O(|\Omega|)O(∣Ω∣) parameters. One possibility to use O(1)O(1)O(1) parameters instead is to employ

diag(Fk,i,j)=Kαk,i,j,\text{diag}(F_{k,i,j})=\mathcal{K}\alpha_{k,i,j},diag(Fk,i,j​)=Kαk,i,j​,

where K\mathcal{K}K is a d×qkd\times q_kd×qk​ fixed qubic spline kernel and αk,i,j\alpha_{k,i,j}αk,i,j​ spline coefficients. The theory for this part seems to be ambiguous.

Link to paper