Notes
  • README
  • Roadmap
  • Graph
    • GraphSAGE
    • DiffPool
    • RRN
    • Relational RL
    • Layerwise Adaptive Sampling
    • Representation Lerning on Graphs: Methods and Applications
    • GAT
    • How Powerful are Graph Neural Networks?
    • Pitfalls of Graph Neural Network Evaluation
    • Spectral Networks and Deep Locally Connected Networks on Graphs
    • Deep Convolutional Networks on Graph-Structured Data
  • Optimizations
    • Neural ODE
  • Tags
Powered by GitBook
On this page
  • Introduction
  • Motivations
  • Deep Learning Motivation: ResNet
  • Mathematical Motivation: Ordinary Differential Equations (ODE)
  • Neuroscience Motivation
  • Proposal
  • BP through ODE Solvers
  • Experiments in Supervised Learning
  • Continuous Normalizing Flows
  • Generative latent function time-series model
  • Benefits
  1. Optimizations

Neural ODE

PreviousOptimizationsNextTags

Last updated 6 years ago

Introduction

With motivations from deep learning, neuroscience and mathematics, neural ODE is an attempt to replace layers of neural networks with a continuous-depth model enabled by ODE solvers. The solver parameters are updated using a second ODE solver along with the adjoint method, making it more efficient for both space and time. The model also has a controllable time cost during test and has potential applications in continuous time-series models.

Motivations

Deep Learning Motivation: ResNet

Given a hidden layer that maps x\textbf{x}x to y\textbf{y}y, a ResNet block enforces y=F(x)+x.\textbf{y}=F(\textbf{x})+\textbf{x}.y=F(x)+x. The motivation is that if you can not do better with more layers, just make F(x)=0F(\textbf{x})=0F(x)=0.

Mathematical Motivation: Ordinary Differential Equations (ODE)

Definition (ODE): Equations taking the form of

G(x,F(x),F′(x),⋯ ,F(n)(x))=0\begin{equation} G(\textbf{x},F(\textbf{x}),F'(\textbf{x}),\cdots,F^{(n)}(\textbf{x}))=0 \end{equation}G(x,F(x),F′(x),⋯,F(n)(x))=0​​

ODEs can be helpful for describing a continuous time system. Consider z:t↦z(t),R→Rn\textbf{z}:t\mapsto\textbf{z}(t), \mathbb{R}\rightarrow\mathbb{R}^{n}z:t↦z(t),R→Rn

G(t,z(t),z′(t))=0⟺z′(t)=f(z(t),t)\begin{equation} G(t,\textbf{z}(t), \textbf{z}'(t))=0\Longleftrightarrow \textbf{z}'(t)=f(\textbf{z}(t), t) \end{equation}G(t,z(t),z′(t))=0⟺z′(t)=f(z(t),t)​​

In ODE, we are usually interested in solving an initial value problem:

Given

{z(0)z′(t)=f(z(t),t)\begin{cases} \textbf{z}(0)\\ \textbf{z}'(t)=f(\textbf{z}(t), t) \end{cases}{z(0)z′(t)=f(z(t),t)​

we want to know z(t)=z(0)+∫0tz′(t).\textbf{z}(t)=\textbf{z}(0)+\int_{0}^{t}\textbf{z}'(t).z(t)=z(0)+∫0t​z′(t).

When it is difficult or impossible to solve z(t)\textbf{z}(t)z(t), we can numerically approximate it at points of interest. This is what people working in numerical differential equations/computational mathematics are doing.

The most basic method is Euler's method: z(t0+h)≈z(t0)+hz′(t0)\textbf{z}(t_{0}+h)\approx\textbf{z}(t_{0})+h\textbf{z}'(t_0)z(t0​+h)≈z(t0​)+hz′(t0​)

Comparing with ResNet: y=x+F(x)\textbf{y}=\textbf{x}+F(\textbf{x})y=x+F(x), we can view each Residual layer as performing z(x)↦z(x+1)\textbf{z}(\textbf{x})\mapsto\textbf{z}(\textbf{x}+1)z(x)↦z(x+1), the final output is just z(T)\textbf{z}(T)z(T).

But it seems that hhh can take any real value, then can we have infinite number of layers?

Yes!

Neuroscience Motivation

Real brains are continuous time systems, hence early neural network research takes continuous time systems as a starting point.

Artificial Neural Networks today model brain with discretization.

Proposal

We can modify ResNet to make it more ODE alike with a continuous time system!

def f(z, t, theta):
    return nnet(z, theta[t])

def resnet(z, theta):
    for t in [1:T]:
        z = z + f(z, t, theta)
    return z
⇓\begin{equation} \Downarrow \end{equation}⇓​​
def f(z, t, theta):
    return nnet([z, t], theta)

def resnet(z, theta):
    for t in [1:T]:
        z = z + f(z, t, theta)
    return z

We can replace resnet with an ODE solver, which approximates ∫t0t1f(z(t),t,θ)dt\int_{t_{0}}^{t_1}f(\textbf{z}(t), t, \theta)dt∫t0​t1​​f(z(t),t,θ)dt.

BP through ODE Solvers

We want to optimize

L(ODESolve(z(t0),f,t0,t1,θ))\begin{equation} L(\text{ODESolve}(\textbf{z}(t_0), f, t_0, t_1, \theta)) \end{equation}L(ODESolve(z(t0​),f,t0​,t1​,θ))​​

by changing its parameters z(t0),t0,t1,θ\textbf{z}(t_0), t_0,t_1,\thetaz(t0​),t0​,t1​,θ.

Direct BP is slow, with extra numerical error and memory cost.

This approach computes gradients by solving a second, augmented ODE backwards in time, and is applicable to all ODE solvers.

The key results from the proof are as follows:

  • Let a(t)=∂L∂z(t)\textbf{a}(t)=\frac{\partial L}{\partial \textbf{z}(t)}a(t)=∂z(t)∂L​

  • da(t)dt=−a(t)T∂f(z(t),t,θ)∂z\frac{d\textbf{a}(t)}{dt}=-\textbf{a}(t)^{T}\frac{\partial f(\textbf{z}(t), t, \theta)}{\partial\textbf{z}}dtda(t)​=−a(t)T∂z∂f(z(t),t,θ)​

  • dLdθ=∫t1t0a(t)T∂f(z(t),t,θ)∂θdt\frac{dL}{d\theta}=\int_{t_1}^{t_0}\textbf{a}(t)^{T}\frac{\partial f(\textbf{z}(t),t,\theta)}{\partial \theta}dtdθdL​=∫t1​t0​​a(t)T∂θ∂f(z(t),t,θ)​dt

When the loss depends on intermediate states, the reverse-mode derivative must be broken into a sequence of separate solves.

Experiments in Supervised Learning

The authors compared the performance of a small residual network which downsamples the input twice then applies 6 residual blocks, which are replaced by an ODESolve module in the ODE-Net variant. They also tested a network which directly BP through a Runge-Kutta integrator (a numerical method for iteratively solving an nonlinear ODE), referred to as RK-Net.

In the figure above, LLL denotes the number of layers in the ResNet and L~\tilde{L}L~ denotes the number of function evaluations that the ODE solver requests in a single forward pass, which can be interpreted as an implicit number of layers.

For reference, a neural net with a single hidden layer of 300300300 units has around the same number of parameters as the ODE-Net and RK-Net architecture.

One advantage of using ODE solvers is that many of them approximately ensure that the output is within a given tolerance of the true solution.

The authors verify that error can indeed be controlled in figure 3a. They also show that tunning the tolerance gives us a trade-off between accuracy and computational cost with figure 3b.

Figure 3c shows that the number of evaluations in the backward time is roughly half of the forward pass. This suggests that the adjoint sensitivity method is not only more memory efficient, but also more computationally efficient than direct BP.

Note that it is not clear how to define the "depth" of an ODE solution. A related quantity is the number of evaluations of the hidden state dynamics required, a detail delegated to the ODE solver and dependent on the initial state or input. In figure 3d, the authors show that the number of function evaluations increases throughout training, presumably adapting to increasing complexity of the model. Note that Duvenaud mentioned that their model in the end can be 2-4x slower than ResNet.

Continuous Normalizing Flows

The discretized equation

ht+1=ht+f(ht,θt)\begin{equation} \textbf{h}_{t+1}=\textbf{h}_{t}+f(\textbf{h}_{t}, \theta_t) \end{equation}ht+1​=ht​+f(ht​,θt​)​​
z1=f(z0)⟹log⁡p(z1)=log⁡p(z0)−log⁡∣det⁡∂f∂z0∣\begin{equation} \textbf{z}_1=f(\textbf{z}_0) \Longrightarrow \log p(\textbf{z}_1)=\log p(\textbf{z}_0)-\log\left|\det\frac{\partial f}{\partial \textbf{z}_0}\right| \end{equation}z1​=f(z0​)⟹logp(z1​)=logp(z0​)−log​det∂z0​∂f​​​​

Generally, the main bottleneck to using the change of variables formula is computing of the determinant of the Jacobian ∂f∂z\frac{\partial f}{\partial \textbf{z}}∂z∂f​, which has a cubic cost in either the dimension of z\textbf{z}z, or the number of hidden units.

Surprisingly, moving from a discrete set of layers to a continuous transformation simplifies the computation of the change in normalizing constant.

Theorem (Instantaneous Change of Variables). Let z(t)\textbf{z}(t)z(t) be a finite continuous random variable with probability p(z(t))p(\textbf{z}(t))p(z(t)) dependent on time. Let dzdt=f(z(t),t)\frac{d\textbf{z}}{dt}=f(\textbf{z}(t), t)dtdz​=f(z(t),t) be a differential equation describing a continuous-in-time transformation of z(t)\textbf{z}(t)z(t). Assuming that fff is uniformly Lipschitz continuous (∃\exists∃ an universal Lipschitz constant) in z\textbf{z}z and continuous in ttt, then the change in log probability also follows a differential equation,

∂log⁡p(z(t))∂t=−tr(dfdz(t)).\begin{equation} \frac{\partial \log p(\textbf{z}(t))}{\partial t}=-\text{tr}\left(\frac{df}{d\textbf{z}(t)}\right). \end{equation}∂t∂logp(z(t))​=−tr(dz(t)df​).​​

Proof. Let z(t+ϵ)=Tϵ(z(t))\textbf{z}(t+\epsilon)=T_{\epsilon}(\textbf{z}(t))z(t+ϵ)=Tϵ​(z(t)). We assume that fff is Lipschitz continuous in z(t)\textbf{z}(t)z(t) and continuous in ttt, so every initial value problem has a unique solution by Picard's existence theorem. We also assume z(t)\textbf{z}(t)z(t) is bounded. These conditions imply that f,Tϵf, T_{\epsilon}f,Tϵ​ and ∂∂zTϵ\frac{\partial}{\partial \textbf{z}}T_{\epsilon}∂z∂​Tϵ​ are all bounded.

∂log⁡p(z(t))∂t=lim⁡ϵ→0+log⁡p(z(t))−log⁡∣det⁡∂∂zTϵ(z(t))∣−log⁡p(z(t))ϵ=−lim⁡ϵ→0+log⁡∣det⁡∂∂zTϵ(z(t))∣ϵ=−lim⁡ϵ→0+∂∂ϵ∣det⁡∂∂zTϵ(z(t))∣∣det⁡∂∂zTϵ(z(t))∣ (by L’Hopital’s rule)=−(lim⁡ϵ→0+∂∂ϵ∣det⁡∂∂zTϵ(z(t))∣)(lim⁡ϵ→0+1∣det⁡∂∂zTϵ(z(t))∣)=−lim⁡ϵ→0+∂∂ϵ∣det⁡∂∂zTϵ(z(t))∣\begin{align} \frac{\partial\log p(\textbf{z}(t))}{\partial t}&=\lim_{\epsilon\rightarrow 0^{+}}\frac{\log p(\textbf{z}(t))-\log\left|\det\frac{\partial}{\partial\textbf{z}}T_{\epsilon}(\textbf{z}(t))\right|-\log p(\textbf{z}(t))}{\epsilon}\\ &=-\lim_{\epsilon\rightarrow 0^{+}}\frac{\log\left|\det\frac{\partial}{\partial\textbf{z}}T_{\epsilon}(\textbf{z}(t))\right|}{\epsilon}\\ &=-\lim_{\epsilon\rightarrow 0^{+}}\frac{\frac{\partial}{\partial\epsilon}\left|\det\frac{\partial}{\partial\textbf{z}}T_{\epsilon}(\textbf{z}(t))\right|}{\left|\det\frac{\partial}{\partial\textbf{z}}T_{\epsilon}(\textbf{z}(t))\right|} \text{ (by L'Hopital's rule)}\\ &=-\left(\lim_{\epsilon\rightarrow 0^{+}}\frac{\partial}{\partial\epsilon}\left|\det\frac{\partial}{\partial\textbf{z}}T_{\epsilon}(\textbf{z}(t))\right|\right)\left(\lim_{\epsilon\rightarrow 0^{+}}\frac{1}{|\det\frac{\partial}{\partial\textbf{z}}T_{\epsilon}(\textbf{z}(t))|}\right)\\ &=-\lim_{\epsilon\rightarrow 0^{+}}\frac{\partial}{\partial\epsilon}\left|\det\frac{\partial}{\partial\textbf{z}}T_{\epsilon}(\textbf{z}(t))\right| \end{align}∂t∂logp(z(t))​​=ϵ→0+lim​ϵlogp(z(t))−log​det∂z∂​Tϵ​(z(t))​−logp(z(t))​=−ϵ→0+lim​ϵlog​det∂z∂​Tϵ​(z(t))​​=−ϵ→0+lim​​det∂z∂​Tϵ​(z(t))​∂ϵ∂​​det∂z∂​Tϵ​(z(t))​​ (by L’Hopital’s rule)=−(ϵ→0+lim​∂ϵ∂​​det∂z∂​Tϵ​(z(t))​)(ϵ→0+lim​∣det∂z∂​Tϵ​(z(t))∣1​)=−ϵ→0+lim​∂ϵ∂​​det∂z∂​Tϵ​(z(t))​​​

The derivative of the determinant can be expressed using Jacobi's formula, which gives

∂log⁡p(z(t))∂t=−lim⁡ϵ→0+tr(adj(∂∂zTϵ(z(t)))∂∂ϵ∂∂zTϵ(z(t)))=−tr((lim⁡ϵ→0+adj(∂∂zTϵ(z(t))))(lim⁡ϵ→0+∂∂ϵ∂∂zTϵ(z(t))))=−tr(lim⁡ϵ→0+∂∂ϵ∂∂zTϵ(z(t)))\begin{align} \frac{\partial \log p(\textbf{z}(t))}{\partial t} &= -\lim_{\epsilon\rightarrow 0^{+}}\text{tr}\left(\text{adj}\left(\frac{\partial}{\partial\textbf{z}}T_{\epsilon}(\textbf{z}(t))\right)\frac{\partial}{\partial\epsilon}\frac{\partial}{\partial\textbf{z}}T_{\epsilon}(\textbf{z}(t))\right)\\ &=-\text{tr}\left(\left(\lim_{\epsilon\rightarrow 0^{+}}\text{adj}\left(\frac{\partial}{\partial\textbf{z}}T_{\epsilon}(\textbf{z}(t))\right)\right)\left(\lim_{\epsilon\rightarrow 0^{+}}\frac{\partial}{\partial \epsilon}\frac{\partial}{\partial\textbf{z}}T_{\epsilon}(\textbf{z}(t))\right)\right)\\ &=-\text{tr}\left(\lim_{\epsilon\rightarrow 0^{+}}\frac{\partial}{\partial \epsilon}\frac{\partial}{\partial\textbf{z}}T_{\epsilon}(\textbf{z}(t))\right) \end{align}∂t∂logp(z(t))​​=−ϵ→0+lim​tr(adj(∂z∂​Tϵ​(z(t)))∂ϵ∂​∂z∂​Tϵ​(z(t)))=−tr((ϵ→0+lim​adj(∂z∂​Tϵ​(z(t))))(ϵ→0+lim​∂ϵ∂​∂z∂​Tϵ​(z(t))))=−tr(ϵ→0+lim​∂ϵ∂​∂z∂​Tϵ​(z(t)))​​

By expanding the Taylor series around z(t)\textbf{z}(t)z(t), the equation above equals to

−tr(∂∂zf(z(t),t))\begin{equation} -\text{tr}\left(\frac{\partial}{\partial\textbf{z}}f(\textbf{z}(t), t)\right) \end{equation}−tr(∂z∂​f(z(t),t))​​

The proof has been finished now.

Trace is a much cheaper operation, which has a square cost in either the dimension of z\textbf{z}z, or the number of hidden units.

Generative latent function time-series model

The authors present a continuous-time generative approach to modeling time series. Their model represents each time series by a latent trajectory. Each trajectory is determined from a local initial state zt0\textbf{z}_{t_0}zt0​​, and a global set of latent dynamics shared across all time series. Given observation times t0,t1,⋯ ,tNt_0, t_1,\cdots, t_Nt0​,t1​,⋯,tN​ and an initial state zt0\textbf{z}_{t_0}zt0​​, an ODE solver produces zt1,⋯ ,ztN\textbf{z}_{t_1},\cdots, \textbf{z}_{t_N}zt1​​,⋯,ztN​​, which describe the latent state at each observation. They define this generative model fromally through a sampling procedure:

zt0∼p(zt0)zt1,zt2,⋯ ,ztN=ODESolve(zt0,f,θf,t0,⋯ ,tN)xti∼p(x∣zti,θx)\begin{align} \textbf{z}_{t_0}&\sim p(\textbf{z}_{t_0})\\ \textbf{z}_{t_1},\textbf{z}_{t_2},\cdots,\textbf{z}_{t_N} &= \text{ODESolve}(\textbf{z}_{t_0},f,\theta_f,t_0,\cdots, t_N)\\ \textbf{x}_{t_i}&\sim p(\textbf{x}|\textbf{z}_{t_i},\theta_{\textbf{x}}) \end{align}zt0​​zt1​​,zt2​​,⋯,ztN​​xti​​​∼p(zt0​​)=ODESolve(zt0​​,f,θf​,t0​,⋯,tN​)∼p(x∣zti​​,θx​)​​

Function fff is a time-invariant function that takes the value z\textbf{z}z at the current time step and outputs the gradient: ∂z(t)∂t=f(z(t),θf)\frac{\partial\textbf{z}(t)}{\partial t}=f(\textbf{z}(t),\theta_f)∂t∂z(t)​=f(z(t),θf​). The function is parametrized using an NN. Because fff is time-invariant, given any latent space z(t)\textbf{z}(t)z(t), the entire trajectory is uniquely defined. Extrapolating this latent trajectory lets us make predictions arbitrarily far forwards or backwards in time.

Benefits

Benefits of differentiable ODE solvers include: 1. Memory efficiency: no intermediate variabels stored for chain rule, hence constant memory cost 2. Adaptive computation: the choice of ODE solver is orthogonal, different ODE solvers can be used for a same model. There are theories for ODE solvers with more than 120 years of development. Modern ODE solvers allow a controllable trade off between time cost and precission during test time. 3. Parameter sharing across layers 4. Easier computation for changing variables 5. Continuous time-series models

The authors compute gradients using the adjoint method, which directly approximates the gradient, rather than differentiating the approximation. Proof can be found .

also appears in and the . These methods use the change of variables theorem to compute exact changes in probability if samples are transformed through a bijective function fff:

here
normalizing flows
NICE framework
Link to paper
Link to code