0% found this document useful (0 votes)
88 views68 pages

Notes On Deep Learning Theory

This document provides an overview of deep learning theory, covering topics like initialization, loss landscape, generalization, and neural tangent kernel theory. It introduces concepts like preserving variance during initialization, dynamical stability, gradient descent dynamics, wide neural networks, linear networks, local convergence guarantees, uniform bounds, PAC-Bayesian bounds, gradient descent as a kernel method, stationarity of the kernel, and gradient descent convergence via kernel stability.

Uploaded by

Aaaloha aaaloha
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
88 views68 pages

Notes On Deep Learning Theory

This document provides an overview of deep learning theory, covering topics like initialization, loss landscape, generalization, and neural tangent kernel theory. It introduces concepts like preserving variance during initialization, dynamical stability, gradient descent dynamics, wide neural networks, linear networks, local convergence guarantees, uniform bounds, PAC-Bayesian bounds, gradient descent as a kernel method, stationarity of the kernel, and gradient descent convergence via kernel stability.

Uploaded by

Aaaloha aaaloha
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 68

arXiv:2012.05760v1 [cs.

LG] 10 Dec 2020

Notes on Deep Learning Theory

Evgenii (Eugene) Golikov


Neural Networks and Deep Learning lab.
Moscow Institute of Physics and Technology
Moscow, Russia
golikov.ea@mipt.ru

December 11, 2020


Abstract

These are the notes for the lectures that I was giving during Fall 2020 at the Moscow Institute of Physics and Tech-
nology (MIPT) and at the Yandex School of Data Analysis (YSDA). The notes cover some aspects of initialization,
loss landscape, generalization, and a neural tangent kernel theory. While many other topics (e.g. expressivity, a
mean-field theory, a double descent phenomenon) are missing in the current version, we plan to add them in future
revisions.
Contents

1 Introduction 3
1.1 Generalization ability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Global convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 From weight space to function space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Initialization 6
2.1 Preserving the variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Linear case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 ReLU case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.3 Tanh case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Dynamical stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Linear case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.2 ReLU case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 GD dynamics for orthogonal initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 Loss landscape 21
3.1 Wide non-linear nets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.1 Possible generalizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Linear nets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Local convergence guarantees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.1 Limitations of the result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4 Generalization 31
4.1 Uniform bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1.1 Upper-bounding the supremum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1.2 Upper-bounding the Rademacher complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1.3 Failure of uniform bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 PAC-bayesian bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2.1 At most countable case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2.2 General case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2.3 Applying PAC-bayesian bounds to deterministic algorithms . . . . . . . . . . . . . . . . . . . 40

5 Neural tangent kernel 48


5.1 Gradient descent training as a kernel method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.1.1 Exact solution for a square loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.1.2 Convergence to a gaussian process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.1.3 The kernel diverges at initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.1.4 NTK parameterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.1.5 GD training and posterior inference in gaussian processes . . . . . . . . . . . . . . . . . . . . 52
5.2 Stationarity of the kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2.1 Finite width corrections for the NTK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.2.2 Proof of Conjecture 1 for linear nets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.3 GD convergence via kernel stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

1
5.3.1 Component-wise convergence guarantees and kernel alignment . . . . . . . . . . . . . . . . . 62

2
Chapter 1

Introduction

Machine learning aims to solve the following problem:

R(f ) → min . (1.1)


f ∈F

Here R(f ) = E x,y∼D r(y, f (x)) is a true risk of a model f from a class F , and D is a data distribution. However,
we do not have an access to the true data distribution; instead we have a finite set of i.i.d. samples from it:
Sn = {(xi , yi )}ni=1 ∼ Dn . For this reason, instead of approaching (1.1), we substitite it with an empirical risk
minimization:
R̂n (f ) → min, (1.2)
f ∈F

where R̂n (f ) = E x,y∈Sn r(y, f (x)) is an empirical risk of a model f from a class F .

1.1 Generalization ability


How does the solution of (4.2) relate to (1.1)? In other words, we aim to upper-bound the difference between the
two risks:
R(fˆn ) − R̂n (fˆn ) ≤ bound(fˆn , F , n, δ) w.p. ≥ 1 − δ over Sn , (1.3)
where fˆn ∈ F is a result of training the model on the dataset Sn .
We call the bound (1.3) a-posteriori if it depends on the resulting model fˆn , and we call it a-priori if it does not.
An a-priori bound allows one to estimate the risk difference before training, while an a-posteriori bound estimates
the risk difference based on the final model.
Uniform bounds are instances of an a-priori class:

R(fˆn ) − R̂n (fˆn ) ≤ sup |R(f ) − R̂n (f )| ≤ ubound(F , n, δ) w.p. ≥ 1 − δ over Sn , (1.4)
f ∈F

A typical form of the uniform bound is the following:


r !
C(F ) + log(1/δ)
ubound(F , n, δ) = O , (1.5)
n

where C(F ) is a complexity of the class F .


The bound above suggests that the generalization ability, measured by the risk difference, decays as the model
class becomes larger. This suggestion conforms the classical bias-variance trade-off curve. The curve can be repro-
duced if we fit the Runge function with a polynomial using a train set of equidistant points; the same phenomena
can be observed for decision trees.
A typical notion of model class complexity is VC-dimension [Vapnik and Chervonenkis, 1971]. For neural net-
works, VC-dimension grows at least linearly with the number of parameters [Bartlett et al., 2019]. Hence the
bound (1.5) becomes vacuous for large enough nets. However, as we observe, the empirical (train) risk R̂n vanishes,
while the true (test) risk saturates for large enough width (see Figure 1 of [Neyshabur et al., 2015]).

3
One might hypothesize that the problem is in VC-dimension, which overestimates the complexity of neural nets.
However, the problem turns out to be in uniform bounds in general. Indeed, if the class F contains a bad network,
i.e. a network that perfectly fits the train data but fails desperately on the true data distribution, the uniform
bound (1.4) becomes at least nearly vacuous. In realistic scenarios, such a bad network can be found explicitly:
[Zhang et al., 2016] demonstrated that practically large nets can fit data with random labels; similarly, these nets
can fit the training data plus some additional data with random labels. Such nets fit the training data perfectly
but generalize poorly.
Up to this point, we know that among the networks with zero training risk, some nets generalize well, while
some generalize poorly. Suppose we managed to come with some model complexity measure that is symptomatic
for poor generalization: bad nets have higher complexity than good ones. If we did, we can come up with a better
bound by prioritizing less complex models.
Such prioritization is naturally supported by a PAC-bayesian paradigm. First, we come up with a prior distri-
bution P over models. This distribution should not depend on the train dataset Sn . Then we build a posterior
distribution Q | Sn over models based on observed data. For instance, if we fix random seeds, a usual network
training procedure gives a posterior distribution concentrated in a single model fˆn . The PAC-bayesian bound
[McAllester, 1999b] takes the following form:
r !
KL(Q | Sn kP ) + log(1/δ)
R(Q | Sn ) − R̂n (Q | Sn ) ≤ O w.p. ≥ 1 − δ over Sn , (1.6)
n

where R(Q) is an expected risk for models sampled from Q; similarly for R̂n (Q). If more complex models are less
likely to be found, then we can embed this information into prior, thus making the KL-divergence typically smaller.
The PAC-bayesian bound (1.6) is an example of an a-posteriori bound, since the bound depends on Q. However,
it is possible to obtain an a-priori bound using the same paradigm [Neyshabur et al., 2018].
The bound (1.6) becomes better when our training procedure tends to find models that are probable according
to the prior. But what kind of models does the gradient descent typically find? Does it implicitly minimize some
complexity measure of the resulting model? Despite the existence of bad networks, minimizing the train loss using
a gradient descent typically reveals well-performing solutions. This phenomenon is referred as an implicit bias of
gradient descent.
Another problem with a-priori bounds is that they all are effectively two-sided: all of them are bounding an
absolute value of the risk difference, rather then the risk difference itself. Two-sided bounds fail if there exist
networks that generalize well, while failing on a given train set. [Nagarajan and Kolter, 2019] have constructed a
problem for which such networks are typically found by gradient descent.

1.2 Global convergence


We have introduced the empirical minimization problem (4.2) because we were not able to minimize the true risk
directly: see (1.1). But are we able to minimize the empirical risk? Let f (x; θ) be a neural net evaluated at input
x with parameters θ. Consider a loss function ℓ that is a convex surrogate of a risk r. Then minimizing the train
loss will imply empirical risk minimization:

L̂n (θ) = E x,y∈Sn ℓ(y, f (x; θ)) → min . (1.7)


θ

Neural nets are complex non-linear functions of both inputs and weights; we can hardly expect the loss landscape
L̂n (θ) induced by such functions to be simple. At least, for non-trivial neural nets L̂n is a non-convex function of
θ. Hence it can have local minima that are not global.
The most widely-used method of solving the problem (1.7) for deep learning is gradient descent (GD), or some of
its variants. Since GD is a local method, it cannot have any global convergence guarantees in general case. However,
for practically-sized neural nets it always succeeds in finding a global minimum.
Given this observation, it is tempting to hypothesize that despite of the non-convexity, all local minima of L̂n (θ)
are global. This turns to be true for linear nets [Kawaguchi, 2016, Lu and Kawaguchi, 2017, Laurent and Brecht, 2018],
and for non-linear nets if they are sufficiently wide [Nguyen, 2019].
While globality of local minima implies almost sure convergence of gradient descent [Lee et al., 2016, Panageas and Piliouras, 2
there are no guarantees on convergence speed. Generally, convergence speed depends on initialization. For instance,

4
initializing linear nets orthogonally makes the optimization speed independent of depth [Saxe et al., 2013]. Ill-posed
initialization may heavily slow down the optimization process. [Glorot and Bengio, 2010, He et al., 2015] proposed
heuristics for preventing such situations.

1.3 From weight space to function space


Consider the optimization problem (1.7). The gradient descent dynamics for this problem looks as follows:

∂ℓ(y, z)
θ̇t = −ηE x,y∈Sn ∇θ f (x; θt ). (1.8)
∂z z=f (x;θt )

This dynamics is very complex due to non-linearity of f (x; θ) as a function of θ. Let us multiply both sides of (1.8)
by ∇Tθ f (x; θt ):
˙ ′ ∂ℓ(y, z)
ft (x ) = −ηE x,y∈Sn Kt (x′ , x), (1.9)
∂z z=ft (x)
where ft (x) = f (x; θt ), and Kt (x′ , x) is a tangent kernel :

Kt (x′ , x) = ∇Tθ f (x′ ; θt )∇θ f (x; θt ). (1.10)

Generally, the kernel is stochastic and evolves with time. For this reason, dynamics (1.9) is not completely defined.
However, if the network is parameterized in a certain way, the kernel Kt converges to a stationary deterministic
kernel K̄0 as the number of hidden units (width) goes to infinity [Jacot et al., 2018].
If the kernel is stationary and deterministic, the dynamics (1.9) is much simpler than (1.8). Indeed, for square
loss (1.9) is a linear ODE, which can be solved analytically [Lee et al., 2019], while (1.8) still remains non-linear.
This allows us to prove several results on convergence and generalization for large enough nets [Du et al., 2019,
Arora et al., 2019a]. Indeed, for a large enough network, its kernel is almost deterministic, and one have to prove
that is is almost stationary. Given this, we can transfer results from statinary deterministic kernels of infinitely
wide nets to sufficiently wide ones.
Kernels of realistic finite-sized networks turn out to be non-stationary. It is possible to take this effect into
account by introducing finite-width corrections [Huang and Yau, 2019, Dyer and Gur-Ari, 2020].

5
Chapter 2

Initialization

2.1 Preserving the variance


Consider a network with L hidden layers and no biases:
f (x) = WL φ(WL−1 . . . φ(W0 x)), (2.1)
where Wl ∈ Rnl+1 ×nl and a non-linearity φ is applied element-wise. Note that x ∈ Rn0 ; we denote with k = nL+1
the dimensionality of the output: f : Rn0 → Rk .
We shall refer nl as the width of the l-th hidden layer. Denote n = n1 , and take nl = αl n ∀l ∈ [L]. We shall
refer n as the width of the network, and keep α-factors fixed.
Consider a loss function ℓ(y, z). We try to minimize the average loss of our model: L = E x,y ℓ(y, f (x)).
Let us assume both x and y are fixed. Define:
h1 = W0 x ∈ Rn1 , xl = φ(hl ) ∈ Rnl , hl+1 = Wl xl ∈ Rnl+1 ∀l ∈ [L]. (2.2)
Hence given x f (x) = hL+1 .
This is forward dynamics; we can express backward dynamics similarly. Define a loss gradient with respect to
the hidden representation:
∂ℓ(y, hL+1)
gl = ∈ Rnl ∀l ∈ [L + 1]. (2.3)
∂hl
We have then:
∂ℓ(y, h)
gl = Dl WlT gl+1 ∀l ∈ [L], gL+1 = , (2.4)
∂h
where Dl = diag(φ′ (hl )).
Using the backward dynamics, we are able to compute gradients wrt the weights:
∂ℓ(y, hL+1 )
∇l = ∈ Rnl+1 ×nl ∀l ∈ [L]0 . (2.5)
∂Wl
Then,
∇l = gl+1 xTl ∀l ∈ [L]0 . (2.6)
Assume the weights are initialized with zero mean and layer-dependent variance vl :
ij ij
E Wl,0 = 0, Var Wl,0 = vl . (2.7)
Note that ∀l ∈ [L + 1] all components of the vector hl are i.i.d. with zero mean. Let ql be its variance:
1
ql = Var hil = E (hil )2 = E hTl hl . (2.8)
nl
The same holds for gl ; we denote its variance by δl :
1
δl = Var gli = E (gli )2 = E g T gl . (2.9)
nl l

6
2.1.1 Linear case
Consider first φ(h) = h.
Consider the following two properties of the initialization:
1. Normalized forward dynamics: ql does not depend neither on n0:l−1 , nor on l ∀l ∈ [L + 1].
2. Normalized backward dynamics: δl does not depend neither on nl+1:L+1 , nor on l ∀l ∈ [L + 1].
The first property implies that the model stays finite at the initialization irrespective of width n and depth L.
The two properties combined imply finite weight gradients at the initialization: Var ∇ij l does not depend neither
on n0:L+1 , nor on l ∀l ∈ [L]0 .
In turn, these two imply that weight increments stay finite at the initialization irrespective of width n and depth
L if we consider training with SGD:
∆Wl = −ηE x,y ∇l (x, y). (2.10)
Since the initial weights have zero mean, all hidden representations hl have zero mean too, and due to initial
weight independence:
1 1
ql+1 = E hTl WlT Wl hl = vl E hTl hl = nl vl ql ∀l ∈ [L], q1 = E xT W0T W0 x = kxk22 v0 ∝ n0 v0 . (2.11)
nl+1 n1

Hence forward dynamics is normalized if vl = n−1


l ∀l ∈ [L]0 .
We can compute variances for gradients wrt hidden representations in a similar manner:
1
δl = E g T Wl WlT gl+1 = vl E gl+1
T
gl+1 = nl+1 vl δl+1 ∀l ∈ [L − 1], (2.12)
nl l+1
∂ℓ(y, h) 2

1 T
δL = E gL+1 WL WLT gL+1 = ∂h vL ∝ nL+1 vL .
(2.13)
nL 2

As we see, backward dynamics is normalized if vl = n−1


l+1 ∀l ∈ [L]0 . This means that we cannot have both forward
dynamics and backward dynamics normalized at the same time. [Glorot and Bengio, 2010] proposed taking a
harmonic mean of the variances for the two normalization requirements:
2
vl = ∀l ∈ [L]0 . (2.14)
nl + nl+1

2.1.2 ReLU case


We start with the forward dynamics:
1 1
ql+1 = E xTl WlT Wl xl = vl E xTl xl ∀l ∈ [L], q1 = E xT W0T W0 x = kxk22 v0 ∝ n0 v0 . (2.15)
nl+1 n1

1 1
E xTl xl = E [hl ]T+ [hl ]+ =
E hTl hl = nl ql ∀l ∈ [L]. (2.16)
2 2
Here the second equality holds due to the symmetry of hl distribution. The latter in its turn holds by induction on
l.
Hence for ReLU the forward dynamics is normalized if vl = 2n−1 l , a result first presented in [He et al., 2015].
Let us consider the backward dynamics then:
1 1 1
δl =E g T Wl Dl2 WlT gl+1 = vl E gl+1
T
gl+1 = nl+1 vl δl+1 ∀l ∈ [L − 1], (2.17)
nl l+1 2 2
2
1 T 2 T 1 T 1 ∂ℓ(y, h)
vL ∝ 1 nL+1 vL .
δL = E gL+1 WL DL WL gL+1 = vL E gL+1 gL+1 = (2.18)
nL 2 2 ∂h 2 2
−1
Similarly, we have to take vl = 2nl+1 for this type of normalization. Note that here we have assumed that gl+1 does
not depend on Wl and hl , which is not true: gl+1 depends on hl+1 through Dl+1 which depends on both Wl and hl .

7
Again, we have a contradiction between the two normalization requirements. However in some practical cases
satisfying only one of these is enough. For instance, if we consider minimizing the cross-entropy loss, the model
diverging or vanishing at the initialization does not break the optimization process. Moreover, the magnitude of
hidden representations does not matter, thanks to homogeneity of ReLU. Hence in this case normalizing the forward
dynamics is not a strict requirement.
On the other hand, using an optimizer that normalizes the gradient, i.e. Adam, makes backward normalization
unnecessary.

2.1.3 Tanh case


Assume φ ∈ C 3 (R), φ′ (z) > 0, φ(0) = 0, φ′ (0) = 1, φ′′ (0) = 0, φ′′′ (0) < 0, and φ is bounded. The guiding example
is the hyperbolic tangent:
ez − e−z
φ(z) = z . (2.19)
e + e−z
In this case taking vl = n−1
l ensures that activations xl are neither in a linear regime (Var hl are not too small),
nor in a saturated regime (Var hl are not too large). However, if we take vl = n−1 l+1 , Var gl still vanishes for large l
since |φ′ (h)| ≤ 1. Nevertheless, [Glorot and Bengio, 2010] suggests initializing with a harmonic mean of variances
for the class of non-linearities we consider. Rationale: in this case a network is almost linear at the initialization.
2
Let us assume that vl = σw /nl . Consider the forward dynamics in detail:
2
1 σw
ql+1 = E hl E Wl φ(hl )T WlT Wl φ(hl ) = E hl φ(hl )T φ(hl ). (2.20)
nl+1 nl
By the Cenral Limit Theorem, ∀i hil converges to N (0, ql ) in distribution as nl−1 → ∞. Note that for a fixed x
h1 is always normally distributed. Hence by taking subsequent limits n1 → ∞, n2 → ∞, and so on, we come up
with the following recurrent relation (see [Poole et al., 2016]):

2 √ kxk22
ql+1 = σw E z∼N (0,1) φ( ql z)2 = V(ql |σw
2
), 2
q1 = σw . (2.21)
n0
Let us study properties of the length map V:
√ √ √
V ′ (q|σw
2 2
) = σw E z∼N (0,1) φ( qz)φ′ ( qz)z/ q > 0. (2.22)

The last inequality holds since φ( qz)z > 0 for z 6= 0 due to monotonicity of φ and since φ(0) = 0. Hence V
monotonically increases.
√ √ √ √ √ √ 
V ′ (q|σw
2 2
) = σw E z∼N (0,1) φ( qz)φ′ ( qz)z/ q = σw 2
E z∼N (0,1) φ′ ( qz)2 + φ( qz)φ′′ ( qz) . (2.23)

In particular,
V ′ (0|σw
2 2
) = σw E z∼N (0,1) (φ′ (0))2 = σw
2
> 0. (2.24)
[Poole et al., 2016] claimed that the second derivative is always negative for φ being a hyperbolic tangent, which
we were not able to show. We can check it for q = 0 though:

V ′′ (0|σw
2 2
) = 4σw E z∼N (0,1) φ′ (0)φ′′′ (0) = 4σw
2 ′′′
φ (0) < 0. (2.25)

Hence at least, V is concave at zero.


Whenever σw ≤ 1, q = 0 is a stable fixed point for the length map. However for σw > 1 q = 0 becomes unstable;
2
since V(∞|σw ) < ∞ due to boundedness of φ, there should be at least one stable fixed point for the length map. If
we believe that V is indeed concave everywhere, this stable fixed point is unique. We denote it as q∞ .
This means that assuming L = ∞, ql = Var hl has finite non-zero limit as n → ∞ and l → ∞ whenever
2
σw > 1. We shall refer this property as asymptotically normalized forward dynamics. Note that asymptotic and
non-asymptotic forward dynamics normalizations are equivalent for linear and ReLU nets, and they hold exactly
2 2
for σw = 1 and σw = 2, respectively.
Let us proceed with backward dynamics. Similar to the forward case, we have:
1
δl = E g T Wl diag(φ′ (hl ))2 WlT gl+1 . (2.26)
nl l+1

8
We cannot factorize the expectation since gl+1 depends on W0:l unless φ′ is constant. Nevertheless, assume that
gl+1 does not depend on W0:l . Hence it does not depend on hl , and we have the following:

1 1
δl ≈ E g (g T E Wl (Wl E hl diag(φ′ (hl ))2 WlT )gl+1 ) = E h∼N (0,ql ) (φ′ (h))2 E gl+1 (gl+1
T
E Wl (Wl WlT )gl+1 ) =
nl l+1 l+1 nl
σ2 √ 2 αl+1 √
= w E z∼N (0,1) (φ′ ( ql z))2 E gl+1 gl+1
T
gl+1 = σw δl+1 E z∼N (0,1) (φ′ ( ql z))2 . (2.27)
nl αl
We have already noted that given concavity of V the latter has a single unique stable point q∞ ; this also implies
ql → q∞ as l → ∞. [Poole et al., 2016] has noted that convergence to q∞ is fast; assume ql = q∞ then. This allows
us to express the dynamics solely in terms of δl :
αl+1 √
2
δl = σw δl+1 E z∼N (0,1) (φ′ ( q∞ z))2 . (2.28)
αl
For simplicity assume αl = 1 ∀l ≥ 1 (all matrices W1:L+1 are square). Define:

2
χ1 = σw E z∼N (0,1) (φ′ ( q∞ z))2 . (2.29)

We get (see [Schoenholz et al., 2016]):


δl = χ1 δl+1 . (2.30)
Obviously, χ1 > 1 implies exploding gradients, while χ1 < 1 causes gradients to vanish. We shall refer the case
χ1 = 1 as asymptotically normalized backward dynamics. Note that for linear and ReLU nets χ1 = 1 is equivalent
2 2
to σw = 1 and σw = 2, respectively.

Correlation stability
The term χ1 has a remarkable interpretation in terms of correlation stability (see [Poole et al., 2016]). Consider
two inputs, x1 and x2 , together with their hidden representations h1l and h2l . Define the terms of the covariance
matrix for the latter two:
 11
ql12

q 1
Σl = l12 ; qlab = E ha,T b
l hl , a, b ∈ {1, 2}. (2.31)
ql ql22 nl
p
Consider a correlation factor c12 12 11 22
l = ql / ql ql . We have already derived the dynamics for the diagonal terms
in the subsequent limits of infinite width:

aa 2 kxa k22
E z∼N (0,1) φ( qlaa z)2 , q1aa = σw
2
p
ql+1 = σw , a ∈ {1, 2}. (2.32)
n0
Consider the diagonal term:
2
12 1 σw
ql+1 = E h1l ,h2l E Wl φ(h1l )T WlT Wl φ(h2l ) = E 1 2 φ(h1l )T φ(h2l ). (2.33)
nl+1 nl hl ,hl

Taking the same subsequent limits as before, we get:


12 2
ql+1 = σw E (u1 ,u2 )T ∼N (0,Σl ) φ(u1 )φ(u2 ) = σw
2
E (z1 ,z2 )T ∼N (0,I) φ(u1l (z 1 ))φ(u2l (z 1 , z 2 )) = C(c12 11 22 2
l , ql , ql |σw ), (2.34)
p p p
where u1l = ql11 z 1 , while u2l = ql22 (c12 1
l z + 1 − (c12 2 2
l ) z ). We shall refer C as a correlation map.
As before, assume that qlaa = q∞ , a ∈ {1, 2}, ∀l. This assumption results in a self-consistent dynamics of the
correlation factor:
−1
c12 12 2
l+1 = q∞ C(cl , q∞ , q∞ |σw ). (2.35)
Note that c12 = 1 is a fixed point of the c-dynamics. Indeed:
−1 −1 2 √ −1
c12 = q∞ 2
C(1, q∞ , q∞ |σw ) = q∞ σw E z∼N (0,1) φ( q∞ z)2 = q∞ 2
V(q∞ |σw ) = 1. (2.36)

9
In order to study its stability, we have to consider a derivative of the C-map at c12 = 1. Let us compute the
derivative for a c12 < 1 first:
∂c12

l+1 −1 2 √ √ p √ p
σw E (z1 ,z2 )T ∼N (0,I) φ( q∞ z 1 )φ′ ( q∞ (cz 1 + 1 − c2 z 2 ))( q∞ (z 1 − z 2 c/ 1 − c2 )).

12 = q∞ (2.37)
∂cl c12 =c

l

We shall use the following equivalence:


Z +∞ Z +∞ Z +∞
−z 2 /2 −z 2 /2 2
E z∼N (0,1) F (z)z = F (z)ze dz = (−F (z)) de = F ′ (z)e−z /2
dz = E z∼N (0,1) F ′ (z).
−∞ −∞ −∞
(2.38)
We begin the integration with analyzing one of the parts of this equation:
√ p √ p √ p
E z∼N (0,1) φ′ ( q∞ (cz 1 + 1 − c2 z)) q∞ zc/ 1 − c2 = q∞ E z∼N (0,1) φ′′ ( q∞ (cz 1 + 1 − c2 z))c. (2.39)
Henceforth,
∂c12

l+1 −1 2 √
σw E z1 ∼N (0,1) φ(u1 (z 1 ))E z2 ∼N (0,1) ( q∞ z 1 φ′ (u2 (z 1 , z 2 )) − q∞ cφ′′ (u2 (z 1 , z 2 ))),

12 = q∞ (2.40)
∂cl c12 =c

l

√ √ √
where u1 = q∞ z 1 , while u2 = q∞ (cz 1 + 1 − c2 z 2 ). Consider the limit of c → 1:
∂c12

l+1
−1 2 √ √ ′ √ ′′ √
lim
c→1 ∂c 12 12 = q∞ σw E z∼N (0,1) φ( q∞ z)( q∞ zφ ( q∞ z) − q∞ φ ( q∞ z)). (2.41)
l c =c
l

Let us compute the first term first:


√ √ √ √ √ √
E z∼N (0,1) φ( q∞ z) q∞ zφ′ ( q∞ z) = q∞ E z∼N (0,1) (φ′ ( q∞ z))2 + φ( q∞ z)φ′′ ( q∞ z) .

(2.42)
This gives the final result:
∂c12

l+1 2 ′ √ 2

lim 12 12 = σw E z∼N (0,1) (φ ( q∞ z)) = χ1 . (2.43)
c→1 ∂c
l c =c
l

We see that χ1 drives the stability of the correlation of strongly correlated hidden representations, or, equivalently,
of nearby input points. For χ1 < 1 nearby points with c12 ≈ 1 become more correlated as they propagate through
the layers. Hence initially different points become more and more similar. We refer this regime as ordered. In
contrast, for χ1 > 1 nearby points separate as they propagate deeper in the network. We refer this regime as
chaotic. Hence the case of χ1 = 1 is the edge of chaos.

2.2 Dynamical stability


Following [Pennington et al., 2017], let us turn our attention to the input-output jacobian:
L
∂hL+1 Y
J= = Wl Dl ∈ RnL+1 ×n1 . (2.44)
∂h1
l=1

We now compute the mean square Frobenius norm of J T J ∈ Rn1 ×n1 :



L
!T L   
L
!T L 
Y Y Y Y
E kJ T Jk2F = E tr(J T J) = E W0:L tr  Wl Dl Wl Dl  = tr E W0:L  Wl Dl Wl Dl  =
l=1 l=1 l=1 l=1
 
L−1
!T L−1

Y Y
= tr E W0:L−1  Wl Dl DL E WL (WLT WL )DL Wl Dl  =
l=1 l=1
 
L−1
!T L−1

Y Y
2
= nL+1 vL tr E W0:L−1  Wl Dl DL Wl Dl  . (2.45)
l=1 l=1

10
Assuming that tr(Dl2 ) does not depend on W0:l ∀l ∈ [L] allows us to proceed with calculations:
 
L−2
!T L−2

Y Y
2
E kJ T Jk2F = nL+1 vL E hL tr(DL )vL−1 tr E W0:L−2  Wl Dl 2
DL−1 Wl Dl  =
l=1 l=1
L
Y L
Y
E hl tr(Dl2 )vl−1 tr E W0 D12 = nL+1 vl E hl tr(Dl2 ). (2.46)
 
= nL+1 vL
l=2 l=1

2
Suppose we aim to normalize the backward dynamics: vl = σw /nl+1 ∀l ∈ [L]. Assume then (see Section 2.1.3)
hl ∼ N (0, q∞ ) ∀l ∈ [L]. Then the calculation above gives us the mean average eigenvalue of J T J:
n1
1 X 1 √ L
E λi = E kJ T Jk2F = σw
2L
E z∼N (0,1) φ′ ( q∞ z) = χL
1. (2.47)
n1 i=1 n1

Hence χL1 is the mean average eigenvalue of the input-ouput jacobian of the network of depth L.
Let us assume that our non-linearity is homogeneous: φ(βz) = βφ(z). This property holds for leaky ReLU with
arbitrary slope; in particular, it holds in the linear case. Then we have the following:
n1
1 1 1 X
hL+1 = Jh1 ; qL+1 = E kJh1 k22 = hT1 (E J T J)h1 = E λi (viT h1 )2 . (2.48)
nL+1 nL+1 nL+1 i=1

n1
1 1 T 1 X
g1 = J T gL+1 ; δ1 = E kJ T gL+1 k22 = gL+1 (E JJ T )gL+1 = E λi (uTi gL+1 )2 . (2.49)
n1 n1 n1 i=1
One can perceive qL+1 as a mean normalized squared length of the network output. We may want to study a
distribution of normalized squared lengths instead.
In this case it suffices to study a distribution of the empirical spectral density:
n1
1 X
ρ̂(x) = δ(x − λi ). (2.50)
n1 i=1

Besides being random, it converges to a deterministic limiting spectral density ρ as n → ∞ if we assume nl = αl n


∀l ∈ [L + 1] with αl being constant.
Assume all matrices Wl are square: n1 = . . . = nL+1 = n. In this case the choice of vl = 1/n normalizes both
forward and backward dynamics. On the other hand, in the linear case the limiting spectrum can be parameterized
as (see [Pennington et al., 2017]):
sinL+1 ((L + 1)φ)
λ(φ) = . (2.51)
sin φ sinL (Lφ)
We shall prove this result in the upcoming section. Notice that limφ→0 λ(φ) = (L + 1)L+1 /LL ∼ e(L + 1) for large L.
Hence in this case despite we preserve lengths of input vectors on average, some of the input vectors get expanded
with positive probability during forward propagation, while some get contracted. The same holds for the backward
dynamics.

2.2.1 Linear case


Our goal is to compute a limiting spectrum of the matrix JJ T ∈ Rn×n with J = L
Q
l=1 Wl with all Wl composed
of i.i.d. gaussians with variance 1/n; it is referred as product Wishart ensemble. The case of L = 1, W W T , is
known as Wishart ensemble. The limiting spectrum of the Wishart ensemble is known as Marchenko-Pastur law
[Marchenko and Pastur, 1967]: r
1 4
ρW W T (x) = − 1 Ind[0,4] (x). (2.52)
2π x

11
It is possible to derive a limiting spectrum for JJ T by using the so-called S-transform, which we shall define
later. A high-level algorithm is the following. First, we compute an S-transform for the Wishart ensemble:
1
SW W T (z) = . (2.53)
1+z
The S-transform has a following fundamental property. Given two asymptotically free random matrices A and B,
we have [Voiculescu, 1987]1 :
SAB = SA SB (2.54)
in the limit of n → ∞. QL
As we shall see later, the S-transform of JL = l=1 WL depends only on traces of the form n−1 tr(JLk ) which
are invariant under cyclic permutations of matrices Wl . This allows us to compute SJJ T :
L
Y
L
SJJ T = SJL JLT = SWLT WL JL−1 JL−1
T = SWLT WL SJL−1 JL−1
T = SWlT Wl = SW TW. (2.55)
l=1

The last equation holds since all Wl are distributed identically. The final step is to recover the spectrum of JJ T
from its S-transform.

Free independence. We say that A and B are freely independent, or just free, if:

τ ((P1 (A) − τ (P1 (A)))(Q1 (B) − τ (Q1 (B))) . . . (Pk (A) − τ (Pk (A)))(Qk (B) − τ (Qk (B)))) = 0, (2.56)

where ∀i ∈ [k] Pi and Qi are polynomials, while τ (A) = n−1 E tr(A) — an analogue of the expectation for scalar
random variables. Note that τ is a linear operator and τ (I) = 1. Compare above with the definition of classical
independence:
τ ((P (A) − τ (P (A)))(Q(B) − τ (Q(B)))) = 0, (2.57)
for all polynomials P and Q.
Note that two scalar-valued random variables are free iff one of them is constant; indeed:

E ((ξ − E ξ)(η − E η)(ξ − E ξ)(η − E η)) = E ((ξ − E ξ)2 (η − E η)2 ) = (E (ξ − E ξ)2 )(E (η − E η)2 ) = Var ξVar η. (2.58)

Hence having Var ξ = 0 or Var η = 0 is necessary; this implies ξ = const or η = const, which gives free independence.
This means that the notion of free independence is too strong for scalar random variables. The reason for this
is their commutativity; only non-commutative objects can have a non-trivial notion of free independence. As for
random matrices with classically independent entries, they have a remarkable property that they become free in
the limit of n → ∞:

lim τ ((P1 (An ) − τ (P1 (An )))(Q1 (Bn ) − τ (Q1 (Bn ))) . . . (Pk (An ) − τ (Pk (An )))(Qk (Bn ) − τ (Qk (Bn )))) = 0, (2.59)
n→∞

for An and Bn ∈ Rn×n such that the moments τ (Akn ) and τ (Bnk ) are finite for large n for k ∈ N. We shall sat that
the two sequences {An } and {Bn } are asymptotically free as n → ∞.

Asymptotic free independence for Wigner matrices. In order to illustrate the above property, consider X
and Y being classically independent n × n Wigner matrices, i.e. Xij = Xji ∼ N (0, n−1 ), and similarly for Y . Of
course, τ (X) = τ (Y ) = 0, while τ (X 2 Y 2 ) = n−1 tr(E X 2 E Y 2 ) = n−1 tr(I) = 1. Let us compute τ (XY XY ):

1 1
τ (XY XY ) = E Xij Y jk Xkl Y li = 3 ((δik δjl + δil δjk )(δ jl δ ki + δ ji δ kl ) − Cn) =
n n
1
= 3 (n2 + (3 − C)n) = On→∞ (n−1 ). (2.60)
n
This means that X and Y are not freely independent, however, it suggests that they become free in the limit of
large n.
1 see also https://mast.queensu.ca/~speicher/survey.html

12
A sum of freely independent random matrices. Before moving to the definition of the S-transform used for
finding the product, we discuss a simpler topic of finding the distribution of the sum of freely independent random
matrices.
Let ξ and η be scalar-valued independent random variables. The density of their sum can be computed using a
charasteric function:

Fξ+η (t) = E ei(ξ+η)t = E eiξt eiηt = E eiξt · E eiηt = Fξ (t) + Fη (t).


  
(2.61)

The first equality is a definition of the charasteric function. The third equlaity holds due to independence of ξ and
η. A (generalized) density of their sum can be computed by taking the inverse Fourier transform:
1
Z
pξ+η (x) = e−ixt Fξ+η (t) dt. (2.62)
2π R
Let X and Y be random matrix ensembles of sizes n × n. We cannot apply the same technique to random
matrices since they do not generally commute; for this reason, ei(X+Y )t 6= eiXt eiY t generally, and the second
equality of (2.61) does not hold. However, there exists a related technique for freely independent random matrices.
Following [Tao, 2012], define the Stieltjes transform as:

GX (z) = τ ((z − X)−1 ), (2.63)

where τ (X) = n−1 E tr(X). This allows for formal Laurent series which give the following:
∞ ∞ Pn ∞ ∞
X τ (X k ) X n−1 E X i=1 λi (X)k X E X E λ∼ρ̂X λk X E λ∼ρX λk
GX (z) = k+1
= k+1
= k+1
= = E λ∼ρX (z − λ)−1 , (2.64)
z z z z k+1
k=0 k=0 k=0 k=0

where ρX (λ) denotes the expected spectral desnity:


n
1 X
ρX (λ) = E X ρ̂X (λ) = EX δ(λ − λi (X)). (2.65)
n i=1

Let ζ = GX (z) = τ ((z − X)−1 ). Here ζ is a function of z; let assume that z is a function of ζ: z = zX (ζ). We
have the following then:
(zX (ζ) − X)−1 = ζ(1 − EX ), (2.66)
where τ (EX ) = 0. Rearranging gives:
X = zX (ζ) − ζ −1 (1 − EX )−1 , (2.67)
while for Y we have the same:
Y = zY (ζ) − ζ −1 (1 − EY )−1 , (2.68)
and so:
X + Y = zX (ζ) + zY (ζ) − ζ −1 ((1 − EX )−1 + (1 − EY )−1 ). (2.69)
We have:

(1 − EX )−1 + (1 − EY )−1 = (1 − EX )−1 (1 − EX + 1 − EY )(1 − EY )−1 =


= (1 − EX )−1 ((1 − EX )(1 − EY ) + 1 − EX EY )(1 − EY )−1 = 1 + (1 − EX )−1 (1 − EX EY )(1 − EY )−1 . (2.70)

Hence:
(zX (ζ) + zY (ζ) − X − Y − ζ −1 )−1 = ζ(1 − EY )(1 − EX EY )−1 (1 − EX ). (2.71)
We have: ∞
X
(1 − EY )(1 − EX EY )−1 (1 − EX ) = (1 − EY ) (EX EY )k (1 − EX ). (2.72)
k=0

The last expression is a sum of alternating products of EX and EY . Since X and Y are freely independent, EX
and EY are freely independent too. Applying τ gives:

τ ((zX (ζ) + zY (ζ) − X − Y − ζ −1 )−1 ) = ζ. (2.73)

13
At the same time:
τ ((zX+Y (ζ) − X − Y )−1 ) = ζ. (2.74)
Hence:
zX+Y (ζ) = zX (ζ) + zY (ζ) − ζ −1 . (2.75)
−1
Define RX (ζ) = zX (ζ) − ζ . Hence:
RX+Y (ζ) = RX (ζ) + RY (ζ). (2.76)
Alternatively, we can say that the R-transform is a solution of the following equation:

RX (GX (z)) + (GX (z))−1 = z. (2.77)

As a sanity check, consider the R-transform of a scalar constant x. In this case, Gx (z) = (z − x)−1 . This gives
Rx ((z − x)−1 ) + z − x = z, hence Rx ((z − x)−1 ) = x. This means simply Rx ≡ x.

S-transform. Let us now define the S-transform. We start with defining the moment generating function M :
∞ ∞ ∞
X τ (X k ) X E λ∼ρX λk X mk (X)
M (z) = zG(z) − 1 = = = , (2.78)
zk zk zk
k=1 k=1 k=1

where the k-th moment of ρ is defined as follows:

mk (X) = E λ∼ρX λk = τ (X k ). (2.79)

The moment generating function M is a mapping from C \ {0} to C. Let M −1 be its functional inverse. We are
now ready to define the S-transform:
1+z
S(z) = . (2.80)
zM −1 (z)
In order to get some intuition concerning the property (2.54), we consider the case ρ(λ) = δ(λ − x). In this case
M (z) = x/(z − x); hence z = x(1 + 1/M (z)). This gives M −1 (z) = x(1 + 1/z), and S(z) = 1/x, which obviously
satisfies the property.

Recovering the limiting spectrum. We are not going to compute the S-transform (2.53) of the Wishart
ensemble (2.52), but we aim to recover the spectrum of the product Wishart enesmble from its S-transform (2.55).
We have:
L 1 −1 (1 + z)L+1
SJJ T (z) = SW T W (z) =
L
, MJJ T (z) = . (2.81)
(1 + z) z
First we need to recover the Stieltjes transform G. Recall MJJ T (z) = zGJJ T (z) − 1. This gives:

−1 (zGJJ T (z))L+1
z = MJJ T (MJJ T (z)) = , (2.82)
zGJJ T (z) − 1
or:
zGJJ T (z) − 1 = z L GJJ T (z)L+1 . (2.83)
This equation gives a principle way to recover GJJ T . However, our goal is the spectral density ρJJ T . The density
can be recovered from its Stieltjes transform using the inversion formula:
1
ρ(λ) = − lim ℑG(λ + iǫ). (2.84)
π ǫ→0+
Indeed:

ρ(t) ρ(t)
Z Z
lim ℑG(λ + iǫ) = lim ℑ dt = lim ℑ (λ − t − iǫ) dt =
ǫ→0+ ǫ→0+ λ − t + iǫ ǫ→0+ (λ − t)2 + ǫ2
(−ǫ)ρ(t) (−ǫ)ρ(u + λ) (−1)ρ(vǫ + λ)
Z Z Z
= lim 2 2
dt = lim 2 2
du = lim dv = −πρ(λ). (2.85)
ǫ→0+ (λ − t) + ǫ ǫ→0+ u +ǫ ǫ→0+ v2 + 1

14
Hence we should consider z = λ + iǫ and take the limit of ǫ → 0+. Assume also GJJ T (λ + iǫ) = reiφ . Substituting
it to (2.83) gives:
(λ + iǫ)reiφ − 1 = (λ + iǫ)L rL+1 ei(L+1)φ . (2.86)
Let us consider the real and imaginary parts of this equation separately:

r(λ cos φ + O(ǫ)) − 1 = λL rL+1 ((1 + O(ǫ2 )) cos((L + 1)φ) + O(ǫ)); (2.87)

r(λ sin φ + O(ǫ)) = λL rL+1 ((1 + O(ǫ2 )) sin((L + 1)φ) + O(ǫ)). (2.88)
Taking the limit of ǫ → 0+ gives:

rλ cos φ − 1 = λL rL+1 cos((L + 1)φ), rλ sin φ = λL rL+1 sin((L + 1)φ). (2.89)

Consequently:
rλ sin φ sin φ
= tan((L + 1)φ), rL = λ1−L . (2.90)
rλ cos φ − 1 sin((L + 1)φ)
From the first equality we get:

1 sin((L + 1)φ)
r = λ−1 = λ−1 . (2.91)
cos φ − sin φ/ tan((L + 1)φ) sin(Lφ)

This equality together with the second on the previous line give:

sin φ sinL (Lφ)


1=λ . (2.92)
sin((L + 1)φ) sinL ((L + 1)φ)

Hence:
sinL+1 ((L + 1)φ)
λ= . (2.93)
sin φ sinL (Lφ)
We also get the density:
1 1 sin2 φ sinL−1 (Lφ)
ρ(λ) = − r sin φ = − . (2.94)
π π sinL ((L + 1)φ)
For the sake of convenience, we substitute φ with −φ; this gives:

sinL+1 ((L + 1)φ) 1 sin2 φ sinL−1 (Lφ)


λ= , ρ(λ(φ)) = . (2.95)
sin φ sinL (Lφ) π sinL ((L + 1)φ)

All eigenvalues of JJ T are real and non-negative. This gives us a constraint: φ ∈ [0, π/(L + 1)]. The left edge of
this segment gives a maximal λ = (L + 1)L+1 /LL , while the right edge gives a minimum: λ = 0. Note that the
same constraint results in non-negative spectral density.
As a sanity check, take L = 1 and compare with (2.52):
s
sin2 (2φ) 1 sin2 φ
r
2 1 1 1 1 4
λ= = 4 cos φ, ρ(λ(φ)) = = tan φ = −1= − 1. (2.96)
sin2 φ π sin(2φ) 2π 2π cos2 φ 2π λ(φ)

2.2.2 ReLU case


For gaussian initialization, we expect similar problems as we had for linear case. However, curing the expanding
jacobian spectrum for a linear net with square layers is easy: one have to assume orthogonal initialization instead
of i.i.d. gaussian:
Wl ∼ U (On×n ) ∀l ∈ [L]. (2.97)
In this case kJh1 k = kh1 k a.s; the same holds for kJ T gL+1 k. The goal of the current section is to check whether
orthogonal initialization helps in the ReLU case.

15
Similarly to the linear case, we have:

SJJ T = SJL JLT = SDL WLT WL DL JL−1 JL−1


T = SDL WLT WL DL SJL−1 JL−1
T =
L
Y L
Y
L
= SDL2 WL WLT SJL−1 JL−1
T = SDL2 SWL WLT SJL−1 JL−1
T = SDl2 SWl WlT = SW WT SDl2 . (2.98)
l=1 l=1

Consider orthogonal
√ initialization. In order to normalize forward and backward dynamics, we have to introduce a
factor σw = 2:
Wl ∼ σw U (On×n ) ∀l ∈ [L]. (2.99)
−2
For a scaled orthogonal matrix W SW W T = Sσw2 I ≡ σw = 1/2. We have to compute SDl+1
2 then.
2
Since we have assumed that ∀l ∈ [L] hl ∼ N (0, q∞ ), the spectrum of D is given simply as:
1 1
ρDl+1
2 (x) = δ(x) + δ(x − 1). (2.100)
2 2
Taking the Stieltjes transform we get:  
1 1 1
GDl+1
2 (z) = + . (2.101)
2 z z−1
This gives the moment generating function and its inverse:
1 −1 1
MDl+1
2 (z) = , MD 2 (z) = + 1. (2.102)
2(z − 1) l+1 2z

Finally, we get the S-transform:


z+1 z+1
SDl2 (z) = −1 = . (2.103)
zMD2 (z) z + 1/2
l+1

The S-transform of JJ T is then given as:


 L
−2L z+1
SJJ T = σw . (2.104)
z + 1/2

−1 2L (z + 1/2)L
MJJ T = σw . (2.105)
z(z + 1)L−1
Recall MJJ T (z) = zGJJ T (z) − 1. This gives:

−1 2L (zGJJ T (z) − 1/2)L


z = MJJ T (MJJ T (z)) = σw , (2.106)
(zGJJ T (z) − 1)(zGJJ T (z))L−1
or:
z(zGJJ T (z) − 1)(zGJJ T (z))L−1 = (2zGJJ T (z) − 1)L . (2.107)
Taking its imaginary part gives a sequence of transformations:

λ2 r sin φ(λL−1 rL−1 cos((L−1)φ))+λ(λr cos φ−1)(λL−1 rL−1 sin((L−1)φ)) = L(2λr cos φ−1)L−1 2λr sin φ+O(sin2 φ).
(2.108)
λL+1 rL sin(Lφ) − λL rL−1 sin((L − 1)φ) = L(2λr cos φ − 1)L−1 2λr sin φ + O(sin2 φ). (2.109)
λL+1 rL L sin φ − λL rL−1 (L − 1) sin φ = L(2λr − 1)L−1 2λr sin φ + O(sin2 φ). (2.110)
Hence for φ = 0 we have:

λL+1 rL L − λL rL−1 (L − 1) = L(2λr − 1)L−1 2λr = L(2λr − 1)L + L(2λr − 1)L−1 . (2.111)

A real part of (2.107) in its turn gives:

λ(λr cos φ − 1)(λL−1 rL−1 cos((L − 1)φ)) − λ2 r sin φ(λL−1 rL−1 sin((L − 1)φ)) = (2λr cos φ − 1)L + O(sin2 φ). (2.112)

16
λL+1 rL cos(Lφ) − λL rL−1 cos((L − 1)φ) = (2λr cos φ − 1)L + O(sin2 φ). (2.113)
λL+1 rL − λL rL−1 = (2λr − 1)L + O(sin2 φ). (2.114)
Hence for φ = 0 we have:
λL+1 rL − λL rL−1 = (2λr − 1)L . (2.115)
Eq. (2.111) −L× eq. (2.115) results in:
λL rL−1 = L(2λr − 1)L−1 . (2.116)
Putting this to (2.115) gives:
L(2λr − 1)L−1 (λr − 1) = (2λr − 1)L . (2.117)
L(λr − 1) = 2λr − 1. (2.118)
L−1
λr = . (2.119)
L−2
Putting this back to (2.116) gives:
 L−1  L−1
L−1 L
λ =L . (2.120)
L−2 L−2
 L−1  L−1
L 1
λ=L =L 1+ . (2.121)
L−1 L−1
The last equation is equivalent to eL for large L. Hence the spectral density ρJJ T gets expanded with depth at
least linearly in ReLU case even for orthogonal initialization.

2.3 GD dynamics for orthogonal initialization


It seems natural that a well-conditioned jacobian is necessary for trainability. But does a well-conditioned jacobian
ensure trainability? In fact, yes, in linear case. Following [Saxe et al., 2013] we will show that for a linear net with
L hidden layers initialized orthogonally and trained with square loss, the number of optimization steps required to
reach the minimum does not depend on L for large L.

Shallow nets. In order to show this, we start with the case of L = 1:

f (x) = W1 W0 x. (2.122)

Consider square loss:


1
ℓ(y, z) = ky − zk22 , L = E x,y ℓ(y, f (x)). (2.123)
2
Gradient descent step:

Ẇ0 = ηE x,y W1T (yxT − W1 W0 xxT ), Ẇ1 = ηE x,y (yxT − W1 W0 xxT )W0T . (2.124)

Define Σxx = E xxT — input correlation matrix, and Σxy = E yxT — input-output correlation matrix. Assume then
that the data is whitened: Σxx = I. Consider an SVD decomposition for the input-output correlation:
n
X
Σxy = U2 S2,0 V0T = sr ur vrT . (2.125)
r=1

Perform a change of basis:


W̄1 = U2T W1 , W̄0 = W0 V0 . (2.126)
Gradient descent step becomes:
˙ = η W̄ T (S − W̄ W̄ ),
W̄ ˙ = η(S − W̄ W̄ )W̄ T .
W̄ (2.127)
0 1 2,0 1 0 1 2,0 1 0 0

17
Note that while the matrix element W0,ij connects a hidden neuron i to an input neuron j, the matrix element
W̄0,iα connects a hidden neuron i to an input mode α. Let W̄0 = [a1 , . . . , an ], while W̄1 = [b1 , . . . , bn ]T . Then we
get:
n
1 X X
ȧα = sα bα − bγ (bTγ aα ) = (sα − (bTα aα ))bα − (bTγ aα )bγ ; (2.128)
η γ=1 γ6=α
n
1 X X
ḃα = sα aα − (aTγ bα )aγ = (sα − (aTα bα ))aα − (aTγ bα )aγ . (2.129)
η γ=1 γ6=α

This dynamics is a GD dynamics on the following energy function:


n
1X 1X
E= (sα − aα bα )2 + (aα bγ )2 . (2.130)
2 α=1 2
α6=γ

Let assume that there exists an orthogonal matrix R = [r1 , . . . , rn ] such that aα ∝ rα and bα ∝ rα . In other
words, W̄0 = RD0 and W̄1 = D1 RT , where D0 and D1 are diagonal matrices. Note that in this case W1 = U2 D1 RT ,
while W0 = RD0 V0T .
Given this, the dynamics above decomposes into a system of independent equations of the same form:

ȧ = η(s − ab)b, ḃ = η(s − ab)a. (2.131)

Note that a2 − b2 is a motion integral, while the energy function for each individual equation depends only on ab:
E = (s − ab)2 /2.
There exists a solution for these equations that admits a = b. In this case D0 = D1 . Let u = ab. We have:

u̇ = 2η(s − u)u. (2.132)

This ODE is integrable:


1 uf
Z uf         
du 1 du du 1 uf uf − s 1 uf (u0 − s)
Z
t= = + = ln − ln = ln .
η u0 2u(s − u) 2sη u0 u s−u 2sη u0 u0 − s 2sη u0 (uf − s)
(2.133)
Note that u = s is a global minimizer. Hence the time required to achieve uf = s − ǫ from u0 = ǫ is:

(s − ǫ)2
 
1 1 1
t= ln 2
= ln(s/ǫ − 1) ∼ ln(s/ǫ) for ǫ → 0. (2.134)
2sη ǫ sη sη
This means that the larger the correlation s between input and output modes a and b, the faster convergence is.

Deep nets. Let us proceed with a linear network with L hidden in the same setup:
L
!
Y
f (x) = Wl x. (2.135)
l=0

Gradient descent step:


L
!T L
! ! l−1
!T
Y Y Y
T T
Ẇl = ηE x,y Wl′ yx − Wl′ xx Wl′ ∀l ∈ [L]0 . (2.136)
l′ =l+1 l′ =0 l′ =0

Again, assume that Σxx = 1 and Σxy = UL+1 SL+1,0 V0T . Moreover, in analogy to the shallow case suppose
Wl = Rl+1 Dl RlT for l ∈ [L]0 , where Dl is a diagonal matrix, while Rl are orthogonal; R0 = V0 , RL+1 = UL+1 . Note
that if all Wl are themselves orthogonal, and L T
Q
l=0 Wl = UL+1 V0 , then the assumption above holds for Dl = I
∀l ∈ [L]0 , R0 = V0 , Rl+1 = Wl Rl . This gives:

L
!T L
!! l−1
!T
Y Y Y
Ḋl = η D l′ SL+1,0 − D l′ D l′ ∀l ∈ [L]0 . (2.137)
l′ =l+1 l′ =0 l′ =0

18
The latter decouples into independent modes:
L
!
Y Y
ȧl = η s − a l′ a l′ , (2.138)
l′ =0 l′ 6=l

which is a gradient descent for the following energy function:


L
!2
1 Y
E(a0:L ) = s− al . (2.139)
2
l=0
QL
Again, we are looking for solutions of the form a0 = . . . = aL . Define u = l=0 al . This gives an ODE:

u̇ = η(L + 1)u2L/(L+1)(s − u). (2.140)

For large L we can approximate this equation with u̇ = η(L + 1)u2 (s − u) (why?) which is easily integrable:
Z uf Z uf     
1 du 1 du du 1 1 1 1 uf (u0 − s)
t= = + = − + ln .
(L + 1)η u0 u2 (s − u) (L + 1)sη u0 u2 u(s − u) (L + 1)sη u0 uf s u0 (uf − s)
(2.141)
We see that t ∼ L−1 : training time decreases as the number of layers grows. Note that we cannot perform a
gradient flow; we perform a gradient descent with discrete steps instead. Hence we have to count the number of
steps as a function of L.
The optimal learning rate is inversely proportional to the maximum eigenvalue of the Hessian of the energy
function observed during training. Let us first compute the Hessian:
L
!
∂E Y Y
∇i := =− s− al al . (2.142)
∂ai
l=0 l6=i
  
L
!
∂2E Y Y Y Y
∇2ij := =  al   al  − s− al al for i 6= j. (2.143)
∂ai ∂aj
l6=i l6=j l=0 l6=i,j
 2
2
∂ E Y
∇2ii := =  al  . (2.144)
∂a2i
l6=i

Taking into account our assumption a0 = . . . = aL = a, we get:

∇i = −(s − aL+1 )aL , ∇2ij = 2a2L − saL−1 , ∇2ii = a2L . (2.145)

There is an eigenvector v1 = [1, . . . , 1]T of value λ1 = ∇2ii + L∇2ij = (1 + 2L)a2L − sLaL−1 . Also, there are L
eigenvectors of the form vi = [1, 0, . . . , 0, −1, 0, . . . , 0] of value λi = ∇2ii − ∇2ij = saL−1 − a2L . Notice that for large
L λ1 becomes the largest eigenvalue irrespective of a.
During the scope of optimization u travels inside the segment [0, s], hence a lies inside 0, s1/(L+1) . Let us find
 

the maximum of λ1 on this segment:


dλ1
= 2L(1 + 2L)a2L−1 − sL(L − 1)aL−2 . (2.146)
da
Equating this derivative to zero yields:
 1/(L+1)  1/(L+1)
∗ s(L − 1) 1/(L+1) L−1
a = =s < s1/(L+1) . (2.147)
2(1 + 2L) 2(1 + 2L)
The second solution is, of course, a = 0 if L > 2. Therefore we have three candidates for being a maximum: a = 0,
a = s1/(L+1) , and a = a∗ . Let us check them:

λ1 (0) = 0, λ1 (s1/(L+1) ) = (1 + L)s2L/(L+1) ≥ 0. (2.148)

19
 2L/(L+1)  (L−1)/(L+1)
L−1 L−1
λ1 (a∗ ) = s2L/(L+1) (1 + 2L) − sLs(L−1)/(L+1) =
2(1 + 2L) 2(1 + 2L)
 (L−1)/(L+1) 
2L/(L+1) 1 
=s (L − 1)2L/(L+1) − L(L − 1)(L−1)/(L+1) =
2(1 + 2L)
 (L−1)/(L+1)
1
= −s2L/(L+1) (L − 1)(L−1)/(L+1) =
2(1 + 2L)
 (L−1)/(L+1)
L−1
= −s2L/(L+1) ≤ 0. (2.149)
2(1 + 2L)

Hence the maximal λ1 during the scope of optimization is λ1 (s1/(L+1) ) = (1 + L)s2L/(L+1) . Recall the optimal
learning rate is proportional to maximal eigenvalue of the Hessian:
1
ηopt ∝ = (L + 1)−1 s−2L/(L+1) . (2.150)
maxt λ1
Substituting it to t yiels:
     
1 1 1 1 uf (u0 − s) 1 1 1 uf (u0 − s)
topt = − + ln = s(L−1)/(L+1) − + ln . (2.151)
(L + 1)sηopt u0 uf s u0 (uf − s) u0 uf s u0 (uf − s)

This equation asymptotically does not depend on L. In other words, training time (in terms of the number of
gradient steps) for very deep nets does not depend on depth.

20
Chapter 3

Loss landscape

Neural network training process can be viewed as an optimization problem:

L(θ) = E x,y∈Ŝm ℓ(y, f (x; θ)) → min, (3.1)


θ

where ℓ is a loss function assumed to be convex, f (·; θ) is a neural net with parameters θ, and Ŝm = {(xi , yi )}m
i=1
is a dataset of size m sampled from the data distribution D.
This problem is typically non-convex, hence we do not have any guarantees for gradient descent convergence
in general. Nevertheless, in realistic setups we typically observe that gradient descent always succeeds in finding
the global minimum of Lm ; moreover, this is done in reasonable time. This observation leads us to the following
hypotheses:
1. If the neural network and the data satisfy certain properties, all local minima of Lm (θ) are global.
2. If the neural network and the data satisfy certain properties, gradient descent converges to a global minimum
in a good rate with high probability.
Neither of the two hypotheses are stronger than the other. Indeed, having all local minima being global does
not tell us anything about convergence rate, while having convergence guarantee with high probability does not
draw away the possibility to have (few) local minima.
In the present chapter, we shall discuss the first hypothesis only, while the second one will be discussed later in
the context of a Neural Tangent Kernel.

3.1 Wide non-linear nets


It turns out that if the training data is consistent, one can prove globality of local minima if the network is wide
enough.
Following [Yu and Chen, 1995], we shall start with a simplest case of a two-layered net trained to minimize the
square loss:
f (x; W0,1 ) = W1 φ(W0 x); (3.2)
m
1X 1
L(W0,1 ) = kyi − f (xi ; W0,1 )k22 = kY − W1 φ(W0 X)k2F , (3.3)
2 i=1 2

where Wl ∈ Rnl+1 ×nl , xi ∈ Rn0 , yi ∈ Rn2 , X ∈ Rn0 ×m , and Y ∈ Rn2 ×m .



Let W0,1 be a local minimum of L. Consider L with W0 fixed to W0∗ :

1
LW0∗ (W1 ) = kY − W1 φ(W0∗ X)k2F . (3.4)
2

Since W0,1 is a minimum of L, W1∗ is a minimum of LW0∗ . Minimizing LW0∗ (W1 ) is a convex problem. Hence W1∗
is a global minimum of LW0∗ .

21
1
Denote H1 = W0 X and X1 = φ(H1 ); then LW0∗ (W1 ) = 2m kY − W1 X1∗ k2F . Hence rk X1∗ = m implies
min LW0∗ (W1 ) = 0. Since W1 is a global minimum of LW0∗ , L(W0,1 ) = LW0∗ (W1∗ ) = 0; hence W0,1
∗ ∗ ∗
is a global
minimum of L.
Suppose rk X1∗ < m. If we still have min LW0∗ (W1 ) = 0, we arrive at the same conclusion as previously. Suppose

L(W0,1 ) = LW0∗ (W1∗ ) = min LW0∗ (W1 ) > 0. We shall prove that W0,1 ∗
cannot be a minimum of L in this case, as
long as conditions of the following lemma hold:
Lemma 1. Suppose φ is non-zero real analytic. If n1 ≥ m and ∀i 6= j xi 6= xj then µ({W0 : rk X1 < m}) = 0,
where µ is a Lebesgue measure on Rn1 ×n0 .
∗ ∗
Since L(W0,1 ) > 0 and L is a continuous function of W0,1 , ∃ǫ > 0 : ∀W0,1 ∈ Bǫ (W0,1 ) L(W0,1 ) > 0. By the
′ ∗ ′
virtue of the lemma, ∀δ > 0 ∃W0 ∈ Bδ (W0 ) : rk X1 ≥ m.
Take δ ∈ (0, ǫ). In this case L(W0′ , W1∗ ) > 0, while rk X1′ ≥ m. Note that minimizing LW0′ (W1 ) is a convex
problem and min LW0′ (W1 ) = 0 (since rk X1′ ≥ m). Hence a (continuous-time) gradient descent on LW0′ that starts
from W1∗ converges to a point W1∗,′ for which LW0′ (W1∗,′ ) = 0. Because of the latter, (W0′ , W1∗,′ ) ∈ ∗
/ Bǫ (W0,1 ).
′ ∗ ∗
Overall, we have the following: ∃ǫ > 0 : ∀δ ∈ (0, ǫ) ∃(W0 , W1 ) ∈ Bδ (W0,1 ) : a continuous-time gradient descent
on L that starts from (W0′ , W1∗ ) and that acts only on W1 converges to a point (W0′ , W1∗,′ ) ∈
/ Bǫ (W0,1∗
).
Obviously, we can replace "∀δ ∈ (0, ǫ)" with "∀δ > 0". Given this, the statement above means that the gradient
∗ ∗
flow dynamics that acts only on W1 is unstable in Lyapunov sense at W0,1 . Hence W0,1 cannot be a minimum, and
hence having min L(W0,1 ) > 0 is impossible as long as the conditions of Lemma 1 hold. This means that all local
minima of L are global.
Let us prove Lemma 1. Let Im ⊂ [n1 ] and |Im | = m. Consider X1,Im ∈ Rm×m — a subset of rows of X1 indexed
by Im . Note that rk X1 < m is equivalent to det X1,Im = 0 ∀Im .
Since φ is analytic, det X1,Im is an analytic function of W0 ∀Im . We shall use the following lemma:
Lemma 2. Given the conditions of Lemma 1, ∃W0 : rk X1 = m.
Given this, ∃W0 : ∃Im : det X1,Im 6= 0. Since the determinant is an analytic function of W0 , µ({W0 : det X1,Im =
0}) = 0. This implies the statement of Lemma 1.

3.1.1 Possible generalizations


Let us list the properties we have used to prove the theorem of [Yu and Chen, 1995]:
1. The loss is square.
2. The number of hidden layers L is one.
3. nL ≥ m.
4. φ is real analytic.
Can we relax any of them? First, note that it is enough to have "rk X1∗ = m implies min LW0∗ (W1 ) = 0". For this
it is enough to have convex ℓ(y, z) with respect to z with minz ℓ(y, z) = 0 ∀y (in particular, minimum should exist).
For this reason, cross-entropy loss should require a more sophisticated analysis.
In order to relax the second property, it suffices to generalize Lemma 1:
Lemma 3. Suppose φ is non-zero real analytic and l ∈ [L]. If nl ≥ m and ∀i 6= j xi 6= xj then µ({W0:l−1 : rk Xl <
m}) = 0.
The generalized version is proven in [Nguyen and Hein, 2017].
As for the third property, we may want to relax it in two directions. First, we may require not the last hidden
layer, but some hidden layer to be wide enough. Second, we may try to make the lower bound on the number of
hidden units smaller. It seems like the second direction is not possible for a general dataset Sm : one have to assume
some specific properties of the data in order to improve the lower bound.

22
Deep nets with analytic activations
Following [Nguyen and Hein, 2017], let us elaborate the first direction. We start with defining forward dynamics:

Hl+1 = Wl Xl , Xl = φ(Hl ) ∀l ∈ [L], H1 = W0 X, (3.5)

where X ∈ Rn0 ×m , Wl ∈ Rnl+1 ×nl , Hl ∈ Rnl ×m . We also define backward dynamics:


∂L ∂L
Gl = = φ′ (Hl ) ⊙ (WlT ) = φ′ (Hl ) ⊙ (WlT Gl+1 ), (3.6)
∂Hl ∂Hl+1

∂L 1 ∂
GL+1 = = kY − HL+1 k2F = HL+1 − Y, (3.7)
∂HL+1 2 ∂HL+1
where Gl ∈ Rnl ×m . Then we have:
∂L
∇l = = Gl+1 XlT ∈ Rnl+1 ×nl . (3.8)
∂Wl

Let W0:L be a local minimum and suppose dl ≥ m for some l ∈ [L]. As previously, we divide our reasoning in
two parts: in the first part we assume that rk Xl∗ = m, while in the second one we show that if rk Xl∗ < m then

W0:L cannot be a minimum.
Assume rk Xl∗ = m. We have:
0 = ∇∗l = G∗l+1 Xl∗,T , (3.9)
or,
Xl∗ G∗,T
l+1 = 0 ∈ R
nl ×nl+1
. (3.10)
Each column of the right-hand side is a linear combination of columns of Xl∗ . Since columns of Xl∗ are linearly
independent, G∗l+1 = 0. By the recurrent relation,
∗,T ∗
0 = G∗l+1 = φ′ (Hl+1

) ⊙ (Wl+1 Gl+2 ). (3.11)

Assume that φ′ never gets zero. This gives:


∗,T ∗
Wl+1 Gl+2 = 0 ∈ Rnl+1 ×m . (3.12)
∗,T ∗
If we assume that columns of Wl+1 (or, equivalently, rows of Wl+1 ) are linearly independent), we shall get G∗l+2 = 0.
∗ ∗
Linearly independent rows of Wl+1 is equivalent to rk Wl+1 = nl+2 which implies nl+1 ≥ nl+2 .
Suppose this assumption holds. If we moreover assume that rk Wl∗′ = nl′ +1 ∀l′ ∈ {l + 1, . . . , L}, we get G∗L+1 = 0.

This implies L(W0:L ) = 0. The assumption on ranks of Wl∗′ requires nl′ ≥ nl′ +1 ∀l′ ∈ {l + 1, . . . , L}: the network
does not expand after the l-th layer.
Now we assume rk Xl∗ < m and L(W0:L ∗
) > 0, while still nl ≥ m. In the shallow case we have shown that

first, infinitesimal perturbation of W0:l−1 results in rk Xl ≥ m, and second, starting from this perturbed point, the
∗ ∗
gradient descent dynamics leaves a sufficiently large vicinity of W0:L . These two together imply that W0:L cannot
be a minimum which is a contradiction.
While both statements still hold if l = L, the second one does not hold for 0 < l < L since the problem of
minimizing LW0:l−1 ∗ (Wl:L ) is still non-convex, hence we have no guarantees on gradient descent convergence. Hence
for the case 0 < l < L we have to come up with another way of reasoning.
Define:
∂L(u, v)
u = vec(W0:l−1 ), v = vec(Wl:L ), ψ= . (3.13)
∂v
Since (u∗ , v ∗ ) is a minimum, we have ψ(u∗ , v ∗ ) = 0. Assume that jacobian of ψ with respect to v is non-singular at
(u∗ , v ∗ ):
det(Jv ψ(u∗ , v ∗ )) 6= 0. (3.14)
Note that in the case l = L this property is equivalent to rk XL∗ = nL :

ψ(u, v)inL +j = (WL,ik XLkl − Yil )XL,jl ; l


(Jv ψ(u, v))inL +j,i′ nL +j ′ = δii′ XL,j T
′ XL,jl = δii′ (XL XL )jj ′ . (3.15)

23
We see that Jv ψ(u, v) is a block-diagonal matrix constructed with nL+1 identical blocks XL XLT . Its determinant

at W0:L is therefore (det(XL∗ XL∗,T ))nL+1 , which is positive as long as rk XL∗ = nL . Note that rk XL∗ ≤ m, hence we
need nL ≤ m. Note that we need nL ≥ m in order to apply Lemma 1, hence Сondition 3.14 actually requires a
stronger property nL = m instead of nL ≥ m used before.
Condition 3.14 allows us to apply the implicit function theorem:

∃δ1 > 0 : ∃ṽ ∈ C 1 (Bδ1 (u∗ )) : ṽ(u∗ ) = v ∗ and ∀u ∈ Bδ1 (u∗ ) ψ(u, ṽ(u)) = 0. (3.16)

Since all matrices Wl+1:L are full rank, and the set of non-full rank matrices has measure zero,

∃ǫ̃ > 0 : ∀v ∈ Bǫ̃ (v ∗ ) ∀l′ ∈ {l + 1, . . . , L} rk Wl′ = nl′ +1 . (3.17)

Since ṽ ∈ C 1 (Bδ1 (u∗ )),


∃δ2 ∈ (0, δ1 ) : ∀u ∈ Bδ2 (u∗ ) ṽ(u) ∈ Bǫ̃ (v ∗ ). (3.18)
Consequently,
∀u ∈ Bδ2 (u∗ ) ∀l′ ∈ {l + 1, . . . , L} rk W̃l′ = nl′ +1 . (3.19)
Due to Lemma 3,
∀ǫ > 0 ∃ũ ∈ Bǫ (u∗ ) : rk X̃l = m. (3.20)
Hence

∀ǫ ∈ (0, δ2 ) ∃ũ ∈ Bǫ (u∗ ) : rk X̃l = m and ∀l′ ∈ {l + 1, . . . , L} rk W̃l′ = nl′ +1 and ψ(ũ, ṽ(ũ)) = 0. (3.21)

Note that in the first part of the proof we have only used that rk Xl∗ = m, rk Wl∗′ = nl′ +1 ∀l′ ∈ {l + 1, . . . , L}, and
∇∗l = 0. Hence we can conclude that L(ũ, ṽ(ũ)) = 0. Since this is true for all ǫ ∈ (0, δ2 ) and the loss is continuous
with respect to weights, this is also true for ǫ = 0: L(u∗ , v ∗ ) = L(W0:L

) = 0.

Relaxing analyticity and other conditions


Overall, we have proven the following result first:

Proposition 1. Consider a point in the weight space W0:L . Suppose the following hold:
1. φ′ is not zero anywhere;
2. GH+1 = 0 implies L → min;
3. rk Xl∗ = m;
4. rk Wl∗′ = nl′ +1 ∀l′ ∈ {l + 1, . . . , L};
5. ∇∗l = 0.

Then L(W0:L ) = min L.
After that, we have relaxed the 3rd condition in the expense of few others:

Proposition 2. Consider a point in the weight space W0:L . Suppose the following hold:
1. φ′ is not zero anywhere;
2. GH+1 = 0 implies L → min;
3. φ is non-zero real analytic;
4. rk Wl∗′ = nl′ +1 ∀l′ ∈ {l + 1, . . . , L};

5. det(∇2Wl+1:L L(W0:L )) 6= 0;

6. ∇∗l′ = 0 ∀l′ ∈ {l, . . . , L}.



Then L(W0:L ) = min L.

24
However, besides of the 3rd condition, Proposition 1 requires 4th condition that is hard to ensure. We can prove
the following lemma which is due to [Nguyen, 2019]:
Lemma 4. Let θ = Wl+1:L . Suppose the following hold:
1. rk Xl = m;
2. nl′ > nl′ +1 ∀l′ ∈ {l + 1, . . . , L};
3. φ(R) = R and φ is strictly monotonic.
Then
1. ∃θ′ : ∀l′ ∈ {l + 1, . . . , L} rk Wl′ = nl′ +1 and L(θ′ ) = L(θ);
2. ∃ a continuous curve connecting θ and θ′ , and loss is constant on the curve.
∗ ∗,′
Applying this lemma, we can drive Wl+1:L to full-rank Wl+1:L without altering the loss, however Lemma 4 does
∗,′
not guarantee that ∇l = 0. Hence by applying Lemma 4 we potentially violate the 5th condition of Proposition 1.
Moreover, as we have discussed before, loss convexity is not enough to ensure that minima exist. For example, for
cross-entropy loss there could be no critical points of L, hence we cannot statisfy the 5th condition at all. Hence
we have to formulate a different variant of Proposition 1.
Following [Nguyen, 2019], we define an α-level set as L−1 (α) and α-sublevel set as L−1 ((−∞, α)). We also refer
a connected connected of a sublevel set a "local valley", and we call a local valley global if its infium coincide with
inf L. There is a theorem which is due to [Nguyen, 2019]:
Theorem 1. Suppose the following hold:
1. φ(R) = R and φ is strictly monotonic;
2. ℓ(y, z) is convex wrt z with inf z ℓ(y, z) = 0 ∀y;
3. rk Xl = m;
4. nl′ > nl′ +1 ∀l′ ∈ {l + 1, . . . , L}.
Then
1. Every sublevel set is connected;
2. ∀ǫ > 0 L can attain a value < ǫ.
Theorem 1 not only formulates a global minimality condition in a way suitable for cross-entropy (i.e. that
all local valleys are global), but also implies that all local valleys are connected. In a case when local minima
exist, the latter implies that all of them are connected: a phenomena empirically observed in [Garipov et al., 2018,
Draxler et al., 2018].
Notice that it is enough to prove Theorem 1 for l = 0: otherwise we can just apply this result to a subnetwork
starting from the l-th layer. Let Ωl = Rnl+1 ×nl be a set of all nl+1 × nl matrices, while Ω∗l ⊂ Ωl be a subset of
full-rank matrices. We shall state the following result first:
Lemma 5. Suppose the following hold:
1. φ(R) = R and φ is strictly monotonic;
2. rk X = m;
3. nl > nl+1 ∀l ∈ [L].
Then there exists a map h : Ω∗1 × . . . × Ω∗L × RnL+1 ×m → Ω0 :
1. ∀H̃L+1 ∈ RnL+1 ×m for full-rank W1:L HL+1 (h(W1:L , H̃L+1 ), W1:L ) = H̃L+1 ;
2. ∀W0:L where all W1:L are full-rank, there is a continuous curve between W0:L and (h(W1:L , HL+1 (W0:L )), W1:L )
such that the loss is constant on the curve.

25
The first statement can be proven easily. Indeed, let X † be the left inverse of X, while Wl† be the right inverse
of Wl ; this means that X † X = Im , while Wl Wl† = Inl+1 ∀l ∈ [L]. These pseudo-inverses exist, because X has full
column rank, while all Wl have full row rank (since nl > nl+1 ). Define the following recursively:

W̃0 = H̃1 X † , H̃l = φ−1 (X̃l ), X̃l = Wl† H̃l+1 ∀l ∈ [L]. (3.22)

This gives the following:

W̃0 X = H̃1 X † X = H̃1 , Wl φ(H̃l ) = Wl Wl† H̃l+1 = H̃l+1 ∀l ∈ [L]. (3.23)

This simply means that HL+1 (W̃0 , W1:L ) = H̃L+1 . Hence defining h(W1:L , H̃l+1 ) := W̃0 gives the result.
We shall omit the proof of the second statement. Then the proof of Theorem 1 proceeds by constructing paths
from two points θ = W0:L and θ′ to a common point such that the loss does not increase along both of these paths.
We then show that the common meeting point can attain loss < ǫ ∀ǫ > 0.
Denote losses at points θ and θ′ as L and L′ respectively. Let us start from the point θ. By the virtue of
Lemma 4 we can travel from θ to another point for which all matrices are full rank without altering the loss. Hence
without loss of generality assume that all matrices of θ are full rank, and for θ′ we can assume the same. This
allows us to use Lemma 5 and travel from θ and θ′ to the following points by curves of constant loss:

θ = (h(W1:L , HL+1 (W0:L )), W1:L ), θ′ = (h(W1:L


′ ′
, HL+1 (W0:L ′
)), W1:L ). (3.24)

Since the set of full-rank matrices is connected, ∀l ∈ [L] there is a continuous curve Wl (t) for which Wl (0) = Wl ,
Wl (1) = Wl′ , and Wl (t) is full-rank. Hence we can travel from θ to the following point in the weight space:
′ ′
θ = (h(W1:L (1), HL+1 (W0:L )), W1:L (1)) = (h(W1:L , HL+1 (W0:L )), W1:L ). (3.25)

Since we do not alter the model output while traveling throughout the curve, we do not alter the loss as well.
Consider some H̃L+1 ∈ RnL+1 ×m such that corresponding loss is less than min(ǫ, L, L′ ). Consider a curve
HL+1 (t) = (1 − t)HL+1 (W0:L ) + tH̃L+1 , and a corresponding curve in the weight space:
′ ′
θ(t) = (h(W1:L , HL+1 (t)), W1:L ). (3.26)

Note that

L(θ(t)) = L(HL+1 (θ(t))) = L((1 − t)HL+1 (W0:L ) + tH̃L+1 ) ≤ (1 − t)L(HL+1 (W0:L )) + tL(H̃L+1 ) ≤ L. (3.27)

Hence the curve θ(t) is fully contained in any sublevel set conatining initial θ. The same curve starting from θ′
arrives at the same point. Recall that the endpoint has loss less than ǫ. Hence all sublevel sets are connected and
can attain loss less than any positive ǫ.

3.2 Linear nets


The second case for which one can prove globality of local minima is the case of φ(z) = z. Consider:

f (x; W0:L ) = WL . . . W0 x, (3.28)

where Wl ∈ Rnl+1 ×nl . We are going to prove the following result which is due to [Laurent and Brecht, 2018]:
Theorem 2. Let ℓ be convex and differentiable, and there are no bottlenecks in the architecture: minl∈[L]0 nl =
min{n0 , nL+1 }. Then all local minima of L(W0:L ) = E x,y ℓ(y, f (x; W0:L )) are global.
This theorem follows from the result below:
Theorem 3. Assume L̃ is a scalar differentiable function of nL+1 × n0 matrices. Let L(W0:L ) = L̃(WL . . . W0 ) and
let minl∈[L]0 nl = min{n0 , nL+1 }. Then any local minimizer Ŵ0:L of L satisfies ∇L̃(Â) = 0 for  = ŴL . . . Ŵ0 .

26
Indeed, consider L̃(A) = E x,y ℓ(y, Ax). The corresponding L writes as follows: L(W0:L ) = E x,y ℓ(y, f (x; W0:L ));
hence we are in the scope of Theorem 2. Take a local minimizer Ŵ0:L of L. From Theorem 3 ŴL . . . Ŵ0 is a critical
point of L̃. It follows from convexity of ℓ that ŴL . . . Ŵ0 is a global minimum of L̃. Since L(Ŵ0:L ) = L̃(Â), Ŵ0:L
is a global minimum of L.
Let us now prove Theorem 3. Define Wl,+ = WL . . . Wl , Wl,− = Wl . . . W0 , and A = WL . . . W0 . Note that
T T
∇l L(W0:L ) = Wl+1,+ ∇L̃(A)Wl−1,− ∀l ∈ [L]0 . (3.29)

Since Ŵ0:L is a local minimum of L, we have:


T
0 = ∇L L(Ŵ0:L ) = ∇L̃(Â)ŴL−1,− . (3.30)

If ker WL−1,− = {0} then L̃(Â) = 0 as required. Consider the case when the kernel is non-trivial. We shall prove
that there exist perturbed matrices W̃0:L such that à =  and W̃0:L is a local minimizer of L, and for some
T
l ∈ [L − 1]0 kernels of both W̃l−1,− and W̃l+1,+ are trivial. This gives L̃(Ã) = 0 which is equivalent to L̃(Â) = 0.
By definition of a local minimizer, ∃ǫ > 0 : kWl − Ŵl kF ≤ ǫ ∀l ∈ [L]0 implies L(W0:L ) ≥ L(Ŵ0:L ).
Proposition 3. Let W̃0:L satisfy the following:
1. kW̃l − Ŵl kF ≤ ǫ/2 ∀l ∈ [L]0 ;
2. Ã = Â.
Then W̃0:L is a local minimizer of L.
Proof. Let kWl − W̃l kF ≤ ǫ/2 ∀l ∈ [L]0 . Then kWl − Ŵl kF ≤ kWl − W̃l kF + kW̃l − Ŵl kF ≤ ǫ ∀l ∈ [L]0 . Hence
L(W0:L ) ≥ L(Ŵ0:L ) = L(W̃0:L ).
Since Wl+1,− = Wl+1 Wl,− , we have ker(Wl+1,− ) ⊇ ker(Wl,− ). Hence there is a chain of inclusions:

ker(Ŵ0,− ) ⊆ . . . ⊆ ker(ŴL−1,− ). (3.31)

Since the (L − 1)-th kernel is non-trivial, there exists l∗ ∈ [L − 1]0 such that ker(Ŵl,− ) is non-trivial for any l ≥ l∗ ,
while for l < l∗ the l-th kernel is trivial. This gives the following:

0 = ∇l∗ L(Ŵ0:L ) = ŴlT∗ +1,+ ∇L̃(Â)ŴlT∗ −1,− implies 0 = ŴlT∗ +1,+ ∇L̃(Â). (3.32)

We cannot guarantee that ker(ŴlT∗ +1,+ ) is trivial. However, we can try to construct a perturbation that does not
alter the loss and such that the corresponding kernel is trivial.
First, without loss of generality assume that nL+1 ≥ n0 . Indeed, if Theorem 3 is already proven for nL+1 ≥ n0 , we
can get the same result for nL+1 < n0 by applying this theorem to L̃′ (A) = L̃(AT ). This gives that all local minima
of L′ (WL:0
T
) = L̃′ (W0T . . . WLT ) = L̃(WL . . . W0 ) correspond to critical points of L̃′ (W0T . . . WLT ). This is equivalent to
saying that all local minima of L(W0:L ) = L′ (WL:0 T
) correspond to critical points of L̃(WL . . . W0 ) = L̃′ (W0T . . . WLT ).
Combination of assumptions nL+1 ≥ n0 and minl nl = min{n0 , nL+1 } gives nl ≥ n0 ∀l ∈ [L + 1].
Note that Wl,− ∈ Rnl+1 ×n0 ∀l ∈ [L]0 . Since nl+1 ≥ n0 , it is a "column" matrix. Consider an SVD decomposition
of Ŵl,− :
Ŵl,− = Ûl Σ̂l V̂lT . (3.33)
Here Ûl is an orthogonal nl+1 × nl+1 matrix, V̂l is an orthogonal n0 × n0 matrix, and Σ̂l is a diagonal nl+1 × n0
matrix. Since for l ≥ l∗ Ŵl,− has a non-trivial kernel, its least singular value is zero: σ̂l,n0 = 0. Let ûl be the n0 -th
column of Ûl , which exists since n0 ≤ nl+1 . Let us now define a family of perturbations satisfying the conditions of
Proposition 3:
Proposition 4. Let wl∗ +1 , . . . , wL be any collections of vectors and δl∗ +1 , . . . , δL be any collection of scalars satis-
fying:
1. wl ∈ Rnl+1 , kwl k2 = 1;

27
2. δl ∈ [0, ǫ/2].
Then the tuples W̃0:L defined by

W̃l = Ŵl + δl wl ûTl−1 for l > l∗ , and W̃l = Ŵl otherwise, (3.34)

satisfy the conditions of Proposition 3.


Proof. For l ≤ l∗ the first condition is trivial. In the opposite case we have:

kW̃l − Ŵl k2F = kδl wl ûTl−1 k2F = δl2 kwl k22 kûl−1 k22 ≤ ǫ2 /4, (3.35)

which gives the first condition of Proposition 3.


Let us now prove that W̃l,− = Ŵl,− ∀l ≥ l∗ (for l < l∗ the statement is trivial). For l = l∗ the statement goes
from the definition; this gives the induction base. The induction step is given as follows:

W̃l+1,− = W̃l+1 W̃l,− = W̃l+1 Ŵl,− = (Ŵl+1 + δl+1 wl+1 ûTl )Ŵl,− = Ŵl+1 Ŵl,− = Ŵl+1,− . (3.36)

Hence by Proposition 3 for any δl and wl satisfying the conditions of Proposition 4, W̃0:L is a local minimum of
L. Then we have an equation similar to (3.32):

0 = ∇l∗ L(W̃0:L ) = W̃lT∗ +1,+ ∇L̃(Ã)ŴlT∗ −1,− . (3.37)

As before, this implies:


0 = ∇L̃T (Ã)W̃l∗ +1,+ . (3.38)
For δl∗ +1 = 0 we have:
0 = ∇L̃T (Ã)W̃L . . . W̃l∗ +2 Ŵl∗ +1 . (3.39)
Substracting the latter equation to the pre-latter one gives:

0 = ∇L̃T (Ã)W̃L . . . W̃l∗ +2 (W̃l∗ +1 − Ŵl∗ +1 ) = ∇L̃T (Ã)W̃L . . . W̃l∗ +2 (δl∗ +1 wl∗ +1 ûTl∗ ). (3.40)

Right-multiplying this equation by ûl∗ gives:

0 = δl∗ +1 ∇L̃T (Ã)W̃L . . . W̃l∗ +2 wl∗ +1 , (3.41)

which holds for any sufficiently small non-zero δl∗ +1 and any unit wl∗ +1 . Hence

0 = ∇L̃T (Ã)W̃L . . . W̃l∗ +2 . (3.42)

Proceeding in the same manner gives finally ∇L̃(Ã) = 0. The proof concludes with noting that ∇L̃(Â) = ∇L̃(Ã)
by construction of W̃0:L .

3.3 Local convergence guarantees


Let L ∈ C 2 (Rdim θ ) and L has L-Lipschitz gradient:

k∇L(θ1 ) − ∇L(θ2 )k2 ≤ Lkθ1 − θ2 k2 ∀θ1,2 . (3.43)

Consider a GD update rule:


θk+1 = θk − η∇L(θk ) = g(θk ). (3.44)
Let θ∗ be a strict saddle:
∇L(θ∗ ) = 0, λmin (∇2 L(θ∗ )) < 0. (3.45)
Let θ0 ∼ Pinit . We shall prove the following result which is due to [Lee et al., 2016]:

28
Theorem 4. Suppose Pinit is absolutely continuous with respect to the Lebesgue measure µ on Rdim θ . Then for
η ∈ (0, L−1 ), P({limk→∞ θk = θ∗ }) = 0.
Proof. The proof starts with the definition of global stable sets. Define a global stable set of a critical point as a
set of initial conditions that lead to convergence to this critical point:

Θs (θ∗ ) = {θ0 : lim θk = θ∗ }. (3.46)


k→∞

In order to prove the theorem it suffices to show that µ(Θs (θ∗ )) = 0.


The proof relies on the following result of the theory of dynamical systems:
Theorem 5. Let 0 be a stable point of a local diffeomorphism φ : U → E, where U is a vicinity of zero in a
Banach space E. Suppose that E = Es ⊕ Eu , where Es is a span of eigenvectors that correspond to eigenvalues of
Dφ(0) less or equal to one, while Eu is a span of eigenvectors that correspond to the eigenvalues greater than one.
Then there exists a disk Θsc
loc tangent to Es at 0 called T
the local stable center manifold. Moreover, there exists a
∞ −k
neighborhood B of 0, such that φ(Θsc sc
loc ) ∩ B ⊂ Θloc and k=0 φ (B) ⊂ Θsc
loc .

In order to apply this theorem, we have to prove that g is a diffeomorphism:


Proposition 5. For η ∈ (0, L−1 ) g is a diffeomorphism.
Given this, we apply the theorem above to φ(θ) = g(θ + θ∗ ): its differential at zero is Dφ(0) = I − η∇2 L(θ∗ ).
Since λmin (∇2 L(θ∗ )) < 0, dim Eu > 0, hence dim Es < dim θ. This means that µ(Θsc loc ) = 0.
s ∗
Let B be a vicinity of zero promised
T∞ by Theorem 5. Given θ 0 ∈ Θ (θ ), ∃K ≥ 0 : ∀k ≥ K θk ∈ B. Equivalently,
∀l ≥ 0 g l (θK ) ∈ B. Hence θK ∈ l=0 g −l (B) ⊂ Θsc
loc . This gives the following:

[
Θs (θ∗ ) ⊆ g −K (Θsc
loc ). (3.47)
K=0
P∞ P∞
The proof concludes by noting that µ(Θs (θ∗ )) ≤ K=0 µ(g
−K
(Θsc
loc )) =
sc
K=0 µ(Θloc ) = 0 since g is a diffeomor-
phism.
Proof of Proposition 5. Being a diffeomorphism is equivalent to be injective, surjective, continuously differentiable,
and having a continuously differentiable inverse.
Suppose g(θ) = g(θ′ ). Then θ − θ′ = η(∇L(θ′ ) − ∇L(θ)). Hence:

kθ − θ′ k2 = ηk∇L(θ′ ) − ∇L(θ)k2 ≤ ηLkθ − θ′ k2 . (3.48)

Since η < 1/L, this implies θ = θ′ . Hence g is injective.


Given some point θ2 , we shall construct θ1 such that θ2 = g(θ1 ). Consider:
1
h(θ1,2 ) = kθ1 − θ2 k22 − ηL(θ1 ). (3.49)
2
Note that h(θ1,2 ) is strongly convex with respect to θ1 :

λmin (∇2θ1 h(θ1,2 )) ≥ 1 − ηL > 0. (3.50)

Hence it has a unique global minimizer which is a critical point:

0 = ∇θ1 h(θ1,2 ) = θ1 − θ2 − η∇L(θ1 ). (3.51)

Hence a unique element of arg minθ1 h(θ1,2 ) satisfies θ2 = g(θ1 ). Hence g is surjective.
The fact that g ∈ C 1 (Rdim θ ) follows from the fact that g(θ) = θ − η∇L(θ) and L ∈ C 2 (Rdim θ ). By the virtue of
the inverse function theorem, in order to prove that g has C 1 inverse, it suffices to show that g is itself C 1 and its
jacobian is non-singular everywhere. The jacobian is given as Jg(θ) = I − η∇2 L(θ); hence its minimal eigenvalue
≥ 1 − ηL > 0 which means that the jacobian is non-singular. The latter statement concludes the proof that g is a
diffeomorphism.

29
3.3.1 Limitations of the result
Note that Theorem 4 applies in the following assumptions:
1. L ∈ C 2 ;
2. ∇L is L-Lipschitz and η ∈ (0, L−1 );
3. No gradient noise;
4. The saddle point is strict;
5. The saddle point is isolated.
The first assumption is necessary to ensure that g is a diffeomorphism. ReLU nets violate this assumption, and
hence require a generalization of Theorem 5.
The second assumption is a standard assumption for the optimization literature. Note however that for, say,
a quadratic loss, a network with at least one hidden layer results in a loss surface which is not globally Lips-
chitz. Fortunately, if we show that there exists a subset S ∈ Rdim θ such that g(S) ⊆ S and restrict initializa-
tions to this subset, one can substitite a global Lipschitzness requirement to local Lipschitzness in S; this is done
in [Panageas and Piliouras, 2017]. Note that ReLU nets break gradient Lipschitzness anyway.
A full-batch gradient descent is rare in practice; a typical procedure is a stochastic gradient descent which
introduces a zero-centered noise to gradient updates. Existence of such noise pulls us away from the scope of the
dynamical systems theory. Nevertheless, intuitively, this noise should help us to escape a stable manifold associated
with a saddle point at hand. In turns out that the presence of noise allows one to have guarantees not only for
convergence itself, but even for convergence rates: see e.g. [Jin et al., 2017].
Strictness of saddle points is necessary to ensure that the second order information about the hessian of the loss
is enough to identify Eu . We hypothesize that the generalization of Theorem 5 to high-order saddles is still possible
(but out of the scope of the conventional dynamical systems theory).
Note that Theorem 4 says essentially that we cannot converge to any a-priori given saddle point. If the set of
all saddle points is at most countable, this will imply that we cannot converge to any saddle points. However, if
this set is uncountable, Theorem 4 does not guarantee that we do not converge to any of them. Moreover, e.g. for
ReLU nets, there is a continuous family of weight-space symmetries that keep criticality (and hence keep negativity
of the least eigenvalue of the hessian). Indeed, substituting (Wl+1 , Wl ) with (α−1 Wl+1 , αWl ) for any positive α
keeps f (x) unchanged. Moreover, if ∇l′ = 0 ∀l′ then
(α) (α),T
∇l+1 = gl+2 xl+1 = αgl+2 xTl+1 = α∇l+1 = 0, (3.52)

(α)
and all other ∇l′ = 0 by a similar reasoning.
A generalization of Theorem 4 to non-isolated critical points is given in [Panageas and Piliouras, 2017]. Intu-
itively, if we have a manifold of strict saddle points, the global stable set associated with this manifold is still of
measure zero due to the existence of the unstable manifold. Nevertheless, one have to again generalize Theorem 5.

30
Chapter 4

Generalization

The goal of learning is to minimize a population risk R over some class of predictors F :
f ∗ ∈ Arg min R(f ), (4.1)
f ∈F

where R(f ) = E x,y∼D r(y, f (x)); here D is a data distribution and r(y, z) is a risk; we shall assume that r(y, z) ∈ [0, 1].
A typical notion of risk for binary classification problems is 0/1-risk: r0/1 (y, z) = [yz < 0], where the target
y ∈ {−1, 1} and the logit z ∈ R; in this case 1 − R(f ) is an accuracy of f . Since we do not have an access to the
true data distribution D, we cannot minimize the true risk. Instead, we can hope to minimize an empirical risk R̂m
over a set of m i.i.d. samples Sm from distribution D:

fˆm ∈ Arg min R̂m (f ), (4.2)


f ∈F

where R̂m (f ) = E x,y∼Sm r(y, f (x)). Since a risk function is typically non-convex and suffer from poor gradients, one
cannot solve problem (4.2) directly with gradient methods. A common solution is to consider a convex differentiable
surrogate ℓ for a risk r, and substitute problem (4.2) with a train loss minimization problem:

fˆm ∈ Arg min L̂m (f ), (4.3)


f ∈F

where L̂m (f ) = E x,y∼Sm ℓ(y, f (x)); this problem can be attacked directly with gradient methods.
Unfortunately, it is hard to obtain any guarantees for finding solutions even for problem (4.3). Nevertheless,
suppose we have a learning algorithm A that takes a dataset Sm and outputs a model fˆm . This algorithm may
aim to solve problem (4.3) or to tackle problem (4.2) directly, but its purpose does not matter; what matters is the
fact that it conditions a model fˆm on a dataset Sm . Our goal is to upper-bound some divergence of R̂m (fˆm ) with
respect to R(fˆm ). Since the dataset Sm is random, fˆm is also random, and the bound should have some failure
probability δ with respect to Sm .

4.1 Uniform bounds


First of all, note that R(f ) = E Sm ∼Dm R̂m (f ). This fact suggests applying the Hoeffding’s inequality for upper-
bounding R̂m (f ) − R(f ):
Theorem 6 (Hoeffding’s inequality [Hoeffding, 1963]). Let X1:m be i.i.d. random variables supported on [0, 1].
Then, given ǫ > 0,
m m m m
! !
2 2ǫ2
− 2ǫ
X X X X
P Xi − E Xi ≥ ǫ ≤ e m , P E Xi − Xi ≥ ǫ ≤ e − m . (4.4)
i=1 i=1 i=1 i=1

This gives us the following:


2
P(R(f ) − R̂m (f ) ≥ ǫ) ≤ e−2mǫ ∀ǫ > 0 ∀f ∈ F . (4.5)

31
Hence for any f ∈ F , r
1 1
R(f ) − R̂m (f ) ≤ log w.p. ≥ 1 − δ over Sm . (4.6)
2m δ
However, the bound above does not make sense since f there is given a-priori and does not depend on Sm . Our
goal is to bound the same difference but with f = fˆm . The simplest way to do this is to upper-bound this difference
uniformly over F :
R(fˆm ) − R̂m (fˆm ) ≤ sup (R(f ) − R̂m (f )). (4.7)
f ∈F

A note on the goodness of uniform bounds. One may worry about how large the supremum over F can be.
If the model class F includes a "bad model" which has low train error for a given Sm but large true error, the bound
becomes too pessimistic. Unfortunately, in the case of realistic neural nets, one can explicitly construct such a bad
model. For instance, for a given Sm consider fˆm,m′ = A(Sm ∪ S̄m′ ) with S̄m′ being a dataset with random labels —
it is independent on Sm and taken in advance. For m′ ≫ m, fˆm,m′ ≈ A(S̄m′ ) — a model learned on random labels;
see [Zhang et al., 2016]. Hence for binary classification with balanced classes R(fˆm,m′ ) ≈ 0.5, while R̂m (fˆm,m′ ) ≈ 0
whenever the algorithm is able to learn the data perfectly, which is empirically the case for gradient descent applied
to realistic neural nets.
Nevertheless, taking F to be a set of all models realizable with a given architecture is not necessary. Indeed,
assume the data lies on a certain manifold: supp D ⊆ M. Then for sure, fˆm ∈ A(Mm ). Taking F = A(Mm )
ensures that F contains only those models that are realizable by our algorithm on realistic data — this excludes
the situation discussed above. One can hope then that if our learning algorithm is good for any realistic data, the
bound will be also good. The problem then boils to upper-bounding the supremum as well as possible.
Unfortunately, bounding the supremum for F = A(Mm ) is problematic since it requires taking the algorithm
dynamics into account, which is complicated for gradient descent applied to neural nets. As a trade-off, one can
consider some larger F ⊇ A(Mm ), for which the supremum can be upper-bounded analytically.

4.1.1 Upper-bounding the supremum


When F is finite, we can still apply our previous bound:
!
P sup (R(f ) − R̂m (f )) ≥ ǫ = P(∃f ∈ F : (R(f ) − R̂m (f )) ≥ ǫ) ≤
f ∈F
X 2
≤ P(R(f ) − R̂m (f ) ≥ ǫ) ≤ |F |e−2mǫ ∀ǫ > 0. (4.8)
f ∈F

Hence s  
1 1
sup (R(f ) − R̂m (f )) ≤ log + log |F | w.p. ≥ 1 − δ over Sm . (4.9)
f ∈F 2m δ
In the case when F is infinite, we can rely on a certain generalization of Hoeffding’s inequality:
Theorem 7 (McDiarmid’s inequality [McDiarmid, 1989]). Let X1:m be i.i.d. random variables and g is a scalar
function of m arguments such that

sup |g(x1:m ) − g(x1:i−1 , x̂i , xi+1,m )| ≤ ci ∀i ∈ [m]. (4.10)


x1:m ,x̂i

Then, given ǫ > 0,


2ǫ2
− Pm
c2
P (g(X1:m ) − E g(X1:m ) ≥ ǫ) ≤ e i=1 i . (4.11)

Applying this inequality to g({(xi , yi )}m


i=1 ) = supf ∈F (R(f ) − R̂m (f )) gives:
!
2

PSm sup (R(f ) − R̂m (f )) − E Sm
′ sup (R(f ) − R̂m (f )) ≥ǫ ≤ e−2mǫ , (4.12)
f ∈F f ∈F

32
which is equivalent to:
r
′ 1 1
sup (R(f ) − R̂m (f )) ≤ E Sm
′ sup (R(f ) − R̂m (f )) + log w.p. ≥ 1 − δ over Sm . (4.13)
f ∈F f ∈F 2m δ

Let us upper-bound the expectation:

′ ′′ ′ ′′ ′
′ sup (R(f ) − R̂
E Sm m (f )) = E Sm
′ sup (E S ′′ R̂
m m (f ) − R̂m (f )) ≤ E Sm
′ E S ′′ sup (R̂
m m (f ) − R̂m (f )) =
f ∈F f ∈F f ∈F
m m
! !
1 X 1 X ′′
= E Sm
′ E S ′′ sup (r(yi′′ , f (x′′i )) − r(yi′ , f (x′i ))) = E Sm
′ E S ′′ sup

(r (f ) − ri (f )) =
m
f ∈F m i=1
m
f ∈F m i=1 i
m
!
1 X
= E Sm
′ E S ′′ E σ ∼{−1,1}m sup
m
σi (ri′′ (f ) − ri′ (f )) ≤
m
f ∈F m i=1

1 X m
′′ ′
≤ E Sm ′ E S ′′ E σ ∼{−1,1}m sup σ (r (f ) − r (f )) =

m i i i
f ∈F m i=1
m

m
m
!
1 X 1 X
′′ ′
≤ E Sm
′ E S ′′ E σ ∼{−1,1}m sup σ r (f ) + σ r (f ) =

m i i i i
m
m m

f ∈F
i=1

i=1

m

1 X
′ ′ ′
= 2E Sm ′ E σ ∼{−1,1}m sup σ r(y , f (x ′ Rad(r ◦ F | S
i = 2E Sm
)) m ), (4.14)

m i i
f ∈F m i=1

where we have defined a function class r ◦ F such that ∀h ∈ r ◦ F h(x, y) = r(y, f (x)), and the Rademacher
complexity of a class H of functions supported on [0, 1] conditioned on a dataset z1:m :

1 Xm
Rad(H | z1:m ) = E σ1:m ∼{−1,1}m sup σi h(zi ) . (4.15)

h∈H m
i=1

4.1.2 Upper-bounding the Rademacher complexity


The case of zero-one risk
Consider r(y, z) = r0/1 (y, z) = [yz < 0]. In this case we have:
m
m

1 X 1 X
Rad(r ◦ F | Sm ) = E σm sup σi ri (f ) = E σm max σi ri (f ) =

f ∈F m i=1
f ∈F Sm m i=1

m ! m !
1 X 1 X
= log exp sE σm max σi ri (f ) ≤ log E σm exp s max σi ri (f ) =

ms f ∈FSm
i=1
ms f ∈FSm
i=1

m m
! !
1 X 1 X X
= log E σm exp s max σi hi ≤ log E σm exp s σi hi ≤
ms h∈r◦FSm ∪(−r)◦FSm
i=1
ms i=1
h∈r◦FSm ∪(−r)◦FSm
1 X ms2 1  ms2
 1 s
≤ log e 2 = log 2|FSm |e 2 = log(2|FSm |) + , (4.16)
ms ms ms 2
h∈r◦FSm ∪(−r)◦FSm

where s is any positive real number and FSm is an equivalence class of functions from F , where two functions are
equivalent iff their images on Sm have identical signs; note that this class is finite: |FSm | ≤ 2m . We have also used
the following lemma:
Lemma 6 (Hoeffding’s lemma [Hoeffding, 1963]). Let X be a random variable a.s.-supported on [a, b] with zero
mean. Then, for any positive real s,
(b−a)2 s2
E esX ≤ e 8 . (4.17)

33
Since the upper bound
p (4.16) is valid for any s > 0, we can minimize it with respect to s. One can easily deduce
that the optimal s is (2/m) log(2|FSm |); plugging it into eq. (4.16) gives:
r
2
Rad(r ◦ F | Sm ) ≤ log(2|FSm |). (4.18)
m
Define ΠF (m) = maxSm |FSm | — a growth function of a function class F . The growth function shows how many dis-
tinct labelings a function class F induces on datasets of varying sizes. Obviously, ΠF (m) ≤ 2m and ΠF (m) is mono-
tonically non-increasing. We say "F shatters Sm " whenever |FSm | = 2m . Define a VC-dimension [Vapnik and Chervonenkis, 1971]
as a maximal m for which F shatters any Sm :

VC(F ) = max{m : ΠF (m) = 2m }. (4.19)

One can relate a growth function with a VC-dimension using the following lemma:
PVC(F ) m
Lemma 7 (Sauer’s lemma [Sauer, 1972]). ΠF (m) ≤ k=0 k .

Now we need to express the asymptotic behavior as m → ∞ in a convenient way. Let d = VC(F ). For m ≤ d
ΠF (m) = 2m ; consider m > d:
d   d  
 m d X k m  
 m d X k  m d  m 
X m m d m d d em d
ΠF (m) ≤ ≤ ≤ = 1+ ≤ . (4.20)
k d k m d k m d m d
k=0 k=0 k=0

Substituting it into (4.18) gives the final bound:


r r !
2 log m
Rad(r ◦ F | Sm ) ≤ (log 2 + VC(F ) (1 + log m − log VC(F ))) = Θm→∞ 2VC(F ) . (4.21)
m m

Hence for the bound to be non-vacuous having VC(F ) < m/(2 log m) is necessary. According to [Bartlett et al., 2019],
whenever F denotes a set of all models realizable by a fully-connected network of width U with W parameters,
VC(F ) = Θ(W U ). While the constant is not give here, this results suggests that the corresponding bound will be
vacuous for realistic nets trained on realistic datasets since W ≫ m there.

The case of a margin risk


Suppose now r is a γ-margin risk: r(y, z) = rγ (y, z) = [yz < γ]. In this case we can bound the true 0/1-risk as:
r
ˆ ˆ ˆ ′ 1 1
R0/1 (fm ) ≤ Rγ (fm ) ≤ R̂m,γ (fm ) + 2E Sm Rad(rγ ◦ F | Sm ) +
′ log w.p. ≥ 1 − δ over Sm . (4.22)
2m δ
Here we have a trade-off between the train risk and the Rademacher complexity: as γ grows larger the former term
grows too, but the latter one vanishes. One can hope that a good enough model fˆm should be able to classify the
dataset it was trained on with a sufficient margin, i.e. R̂m,γ (fˆm ) ≈ 0 for large enough γ.
In the case of a margin loss, a Rademacher complexity is upper-bounded with covering numbers:
 !1/p 
 Xm 
Np (H, ǫ, Sm ) = inf |H̄| : ∀h ∈ H ∃h̄ ∈ H̄ : |h(zk ) − h̄(zk )|p <ǫ . (4.23)
H̄⊆H  
k=1

Note that Np (H, ǫ, Sm ) grows as ǫ → 0 and decreases with p. There are several statements that link the Rademacher
complexity with covering numbers; we start with a simple one:
Theorem 8. Suppose H is a class of hypotheses supported on [0, 1]. Then ∀Sm and ∀ǫ > 0,
r
ǫ 2
Rad(H | Sm ) ≤ + log(2N1 (H, ǫ, Sm )). (4.24)
m m

34
Pm
Proof. Take ǫ > 0. Let H̄ ⊆ H: ∀h ∈ H ∃h̄ ∈ H̄: k=1 |h(zk ) − h̄(zk )| < ǫ. Let h̄[h] ∈ H̄ be a representative of
h ∈ H. Then,
m

1 X
Rad(H | Sm ) = E σ1:m sup σk h(zk ) ≤

h∈H m
k=1

1 X m 1 X m
≤ E σ1:m sup σk (h(zk ) − h̄[h](zk )) + E σ1:m sup σk h̄[h](zk )

h∈H m h∈H m
k=1 k=1

m r
ǫ 1 X ǫ ǫ 2
≤ + E σ1:m sup σk h̄(zk ) = + Rad(H̄ | Sm ) ≤ + log(2|H̄|Sm ). (4.25)

m h̄∈H̄ m m m m
k=1

Taking infium with respect to H̄ concludes the proof.


Note that for γ = 0 rγ = r0/1 and N1 (r0/1 ◦ F , ǫ, Sm ) → |FSm | as ǫ → 0; hence we get (4.18).
The next theorem is more involved:
Theorem 9 (Dudley entropy integral [Dudley, 1967]). Suppose H is a class of hypotheses supported on [−1, 1].
Then ∀Sm and ∀ǫ > 0,
Z √m/2 p
4ǫ 12
Rad(H | Sm ) ≤ √ + log N2 (H, t, Sm ) dt. (4.26)
m m ǫ
Now the task is to upper-bound the covering number for H = rγ ◦ F . It is easier, however, to upper-bound
r̃γ ◦ F instead, where r̃γ is a soft γ-margin risk. Indeed,
Np (r̃γ ◦ F , ǫ, Sm ) ≤ Np (γ −1 F , ǫ, Sm ) = Np (F , γǫ, Sm ). (4.27)
In this case it suffices to upper-bound the covering number for the model class F itself. Note also that we still have
an upper-bound for the true 0/1-risk:
r
ˆ ˆ ˆ ′ 1 1
R0/1 (fm ) ≤ R̃γ (fm ) ≤ R̂m,γ (fm ) + 2E Sm Rad(r̃γ ◦ F | Sm ) +
′ log w.p. ≥ 1 − δ over Sm . (4.28)
2m δ
When F is a set of all models induced by a neural network of a given architecture, Np (F , γǫ, Sm ) is infinite.
Nevertheless, if restrict F to a class of functions with uniformly bounded Lipschitz constant, the covering number
becomes finite, which implies the finite conditional Rademacher complexity. If we moreover assume that the data
have bounded support then the expected Rademacher complexity becomes finite as well.
A set of all neural nets of a given architecture does not have a uniform Lipschitz constant, however, this is the
case if we assume weight norms to be a-priori bounded. For instance, consider a fully-connected network f (·; W0:L )
with L hidden layers without biases. Assume the activation function φ to have φ(0) = 0 and to be 1-Lipschitz.
Define:
Fs0:L ,b0:L = {f (·; W0:L ) : ∀l ∈ [L]0 kWl k2 ≤ sl , kWlT k2,1 ≤ bl }. (4.29)
Theorem 10 ([Bartlett et al., 2017]).
kXm k2F 2
log N2 (Fs0:L ,b0:L , ǫ, Sm ) ≤ C 2 Rs0:L ,b0:L , (4.30)
ǫ2
p
where C = log(2 max n2l ) and we have introduced a spectral complexity:

!3/2
  2/3 3/2
L L L
!
Y X X Y
Rs0:L ,b0:L = sl × (bl /sl )2/3 = bl s l′   . (4.31)
 
l=0 l=0 l=0 l′ 6=l

Plugging this result into Theorem 9 and noting eq. (4.27) gives:
Z √m/2 √ 
4ǫ 12 kXm kF 4ǫ 12 kXm kF m
Rad(r̃γ ◦ Fs0:L ,b0:L | Sm ) ≤ √ + C Rs0:L ,b0:L dt = √ + C Rs0:L ,b0:L log .
m m ǫ γt m m γ 2ǫ
(4.32)

35
Differentiating the right-hand side wrt ǫ gives:
d(rhs.) 4 12 kXm kF
=√ − C Rs0:L ,b0:L . (4.33)
dǫ m m γǫ
Hence the optimal ǫ is given by:
3 kXm kF
ǫopt = √ C Rs0:L ,b0:L . (4.34)
m γ
Plugging it back into the bound gives:
  
12 kXm kF 6 kXm kF
Rad(r̃γ ◦ Fs0:L ,b0:L | Sm ) ≤ C Rs0:L ,b0:L 1 − log C Rs0:L ,b0:L . (4.35)
m γ m γ

From an a-priori bound to an a-posteriori bound


We thus have obtained a bound for a class of neural networks with a-priori bounded weight norms. Let θ be the
following set of network weights:

θ(i0:L , j0:L ) = {W0:L : ∀l ∈ [L]0 kWl k2 ≤ sl (il ), kWlT k2,1 ≤ bl (jl )}, (4.36)

where sl (·) and bl (·) are strictly monotonic functions on N growing to infinity. Correspondingly, define a set of
failure probabilities:
δ
δ(i0:L , j0:L ) = QL . (4.37)
l=0 (il (il + 1)jl (jl + 1))
Note the following equality:
∞ ∞  
X 1 X 1 1
= − = 1. (4.38)
j=1
j(j + 1) j=1 j j+1

This implies the following:



X ∞ X
X ∞ ∞
X
... ... δ(i0:L , j0:L ) = δ. (4.39)
i0 =1 iL =1 j0 =1 jL =1

Hence by applying the union bound, the following holds with probability ≥ 1 − δ over Sm :
s
′ 1 1
sup (R(f ) − R̂m (f )) ≤ E Sm
′ Rad(r̃γ ◦ Fs
0:L (i0:L ),b0:L (j0:L )
| Sm ) + log ∀il , jl ∈ N.
f ∈Fs0:L (i0:L ),b0:L (j0:L ) 2m δ(i 0:L , j0:L )
(4.40)
Take the set of smallest i0:L and j0:L such that kŴl k2 < il /L and kŴlT k2,1 < jl /L ∀l ∈ [L]0 for Ŵ0:L being the
weights of a learned network fˆm = A(Sm ). Denote by i∗0:L and j0:L ∗
the sets mentioned above; let s∗0:L = s0:L (i∗0:L )
and b∗0:L = b0:L (j0:L

). Given this, fˆm ∈ Fs∗0:L ,b∗0:L , and we have with probability ≥ 1 − δ over Sm :
s
1 1
R(fˆm ) − R̂m (fˆm ) ≤ sup (R(f ) − R̂m (f )) ≤ E Sm ′ Rad(r̃γ ◦ Fs∗ ,b∗

| Sm ) + log ∗ ). (4.41)
f ∈Fs∗ ,b∗
0:L 0:L
2m δ(i∗0:L , j0:L
0:L 0:L

Let us express the corresponding spectral complexity in a more convenient form.


  2/3 3/2   2/3 3/2
XL Y L
X Y
Rs∗0:L ,b∗0:L = 
 b∗l s∗l′  

≤
 (kŴlT k2,1 + ∆b∗l ) (kŴl′ k2 + ∆s∗l )  ,

(4.42)
l=0 l′ 6=l l=0 l′ 6=l

where ∆s∗l = s∗l+1 − s∗l and ∆b∗l = b∗l+1 − b∗l . At the same time,
  2/3 3/2
L
X  Y
Rs∗0:L ,b∗0:L ≥ kŴlT k2,1 kŴl′ k2   . (4.43)

l=0 l′ 6=l

36

These two bounds together give an upper-bound for Rad(r̃γ ◦ Fs∗0:L ,b∗0:L | Sm ) that depends explicitly on learned
∗ ∗
weight norms but not on s0:L and b0:L .
Note that i∗l = il (s∗l ) ≤ il (kŴl k2 ) + 1 and jl∗ = jl (b∗l ) ≤ jl (kŴlT k2,1 ) + 1 ∀l ∈ [L]0 , where il (sl ) and jl (bl ) are
inverse maps for sl (il ) and bl (jl ), respectively. This gives an upper-bound for log(δ(i∗0:L , j0:L ∗
)−1 ) :

1
log ≤
δ(i∗0:L , j0:L
∗ )

L 
1 X 
≤ log + log(1 + il (kŴl k2 )) + log(2 + il (kŴl k2 )) + log(1 + jl (kŴlT k2,1 )) + log(2 + jl (kŴlT k2,1 )) . (4.44)
δ
l=0

To sum up, we have expressed the bound on test-train risk difference in terms of the weights of the learned model
fˆm = A(Sm ), thus arriving at an a-posteriori bound. Note that the bound is valid for any sequences sl (il ) and bl (jl )
taken before-hand. Following [Bartlett et al., 2017], we can take, for instance, sl (il ) = il /L and bl (jl ) = jl /L.

4.1.3 Failure of uniform bounds


Recall the general uniform bound:

R(fˆm ) − R̂m (fˆm ) ≤ sup (R(f ) − R̂m (f )), (4.45)


f ∈F

where fˆm = A(Sm ). We have already discussed that the bound fails if F contains a "bad model" for which R(f ) is
large, while R̂m (f ) is small; hence we are interested to take F as small as possible. We have also noted that the
smallest F we can consider is A(M), where M = supp(Dm ).
Consider now the ideal case: ∃ǫ > 0 : R(f ) < ǫ ∀f ∈ F . In other words, all models of the class F generalize
well. In this case the bound (4.45) becomes simply:

R(fˆm ) − R̂m (fˆm ) ≤ ǫ w.p. ≥ 1 − δ over Sm , (4.46)

which is perfect. Our next step was to apply McDiarmid’s inequality: see eq. (4.13); in our case this results in:
r
1 1
sup (R(f ) − R̂m (f )) ≤ ǫ + log w.p. ≥ 1 − δ over Sm , (4.47)
f ∈F 2m δ

which is almost perfect as well. What happened then, is we tried to upper-bound the expected supremum:
′ ′′ ′ ′′ ′
′ sup (R(f ) − R̂
E Sm m (f )) = E Sm
′ sup (E S ′′ R̂
m m (f ) − R̂m (f )) ≤ E Sm
′ E S ′′ sup (R̂
m m (f ) − R̂m (f )). (4.48)
f ∈F f ∈F f ∈F

The last step is called "symmetrization". Note that having small true error does not imply having small empirical
′′
error on any train dataset. [Nagarajan and Kolter, 2019] constructed a learning setup for which for any Sm there
˜ ′′ ˜ m
exists a model fm ∈ F such that R̂m (fm ) ≈ 1; this is true even for F = A(M ). Specifically, they provided a
simple algorithm to construct a specific dataset ¬(Sm′′
) and take f˜m = A(¬(Sm
′′
)).

4.2 PAC-bayesian bounds


4.2.1 At most countable case
Recall the following bound for finite F :
!
 
P sup (R(f ) − R̂m (f )) ≥ ǫ = P ∃f ∈ F : (R(f ) − R̂m (f )) ≥ ǫ ≤
f ∈F
X 2
≤ P(R(f ) − R̂m (f ) ≥ ǫ) ≤ |F |e−2mǫ ∀ǫ > 0. (4.49)
f ∈F

37
When F has infinite cardinality, the bound above P still holds,2 but it is vacuous. Consider at most countable F and
ǫ that depends on f . If we take ǫ(f ) for which f ∈F e−2mǫ (f ) is finite, then we arrive into the finite bound:
  X   X 2
P ∃f ∈ F : (R(f ) − R̂m (f )) ≥ ǫ(f ) ≤ P R(f ) − R̂m (f ) ≥ ǫ(f ) ≤ e−2mǫ (f ) ∀ǫ > 0. (4.50)
f ∈F f ∈F

2 2
For instance, consider some probability distribution P (f ) on F . Take ǫ(f ) such that e−2mǫ (f )
= P (f )e−2mǫ̃ for
some ǫ̃ ∈ R+ . Solving this equation gives:
s
1 1
ǫ(f ) = ǫ̃ + log . (4.51)
2m P (f )

Hence we have ∀ǫ̃ > 0:


s !
1 1 2
P ∃f ∈ F : (R(f ) − R̂m (f )) ≥ ǫ̃ + log ≤ e−2mǫ̃ . (4.52)
2m P (f )

Or, equivalently, w.p. ≥ 1 − δ over Sm we have ∀f ∈ F :


s  
1 1 1
R(f ) − R̂m (f ) ≤ log + log . (4.53)
2m δ P (f )

4.2.2 General case


Let us refer P (f ) as a "prior distribution". Suppose our learning algorithm is stochastic and outputs a model
distribution Q(f ) which we shall refer as a "posterior":

fˆm ∼ Q̂m = A(Sm ). (4.54)

We shall now prove the following theorem:


Theorem 11 ([McAllester, 1999a]). For any δ ∈ (0, 1) w.p. ≥ 1 − δ over Sm we have:
s  
1 4m
R(Q̂m ) − R̂m (Q̂m ) ≤ log + KL(Q̂m || P ) , (4.55)
2m − 1 δ

where R(Q) = E f ∼Q R(f ) and R̂m (Q) = E f ∼Q R̂m (f ).


Proof. The proof relies on the following lemmas:
Lemma 8 ([McAllester, 1999a]). For any probability distribution P on F , for any δ ∈ (0, 1) w.p. ≥ 1 − δ over Sm
we have:
2 4m
E f ∼P e(2m−1)(∆m (f )) ≤ , (4.56)
δ
where ∆m (f ) = |R(f ) − R̂m (f )|.
Lemma 9 ([Donsker and Varadhan, 1985]). Let P and Q be probability distributions on X. Then for any h : X →
R
E x∼Q h(x) ≤ log E x∼P eh(x) + KL(Q || P ). (4.57)
From D-V lemma, taking X = F , h = (2m − 1)∆2m , and Q = Q̂m :
2
E f ∼Q̂m (2m − 1)(∆m (f ))2 ≤ log E f ∼P e(2m−1)(∆m (f )) + KL(Q̂m || P ). (4.58)

Hence from Lemma 8, w.p. ≥ 1 − δ over Sm we have:


4m
E f ∼Q̂m (2m − 1)(∆m (f ))2 ≤ log + KL(Q̂m || P ). (4.59)
δ

38
A simple estimate concludes the proof:

R(Q̂m ) − R̂m (Q̂m ) ≤ |E f ∼Q̂m (R(f ) − R̂m (f ))| ≤ E f ∼Q̂m |R(f ) − R̂m (f )| =
s  
q 1 4m
= E f ∼Q̂m ∆m (f ) ≤ E f ∼Q̂m (∆m (f ))2 ≤ log + KL(Q̂m || P ) . (4.60)
2m − 1 δ

Let us prove D-V lemma first; we shall prove in the case when P ≪ Q and Q ≪ P :
Proof of Lemma 9.
  
dQ
E x∼Q h(x) − KL(Q || P ) = E x∼Q h(x) − log (x) =
dP
   
dP dP
= E x∼Q log eh(x) (x) ≤ log E x∼Q eh(x) (x) = log E x∼P eh(x) , (4.61)
dQ dQ

where dQ/dP is a Radon-Nikodym derivative.


We now proceed with proving Lemma 8:

Proof of Lemma 8. Recall Markov’s inequality:


Theorem 12 (Markov’s inequality). Let X be a non-negative random variable. Then ∀a > 0

EX
P(X ≥ a) ≤ . (4.62)
a
Hence taking a = 4m/δ, it suffices to show that
2
E Sm E f ∼P e(2m−1)(∆m (f )) ≤ 4m. (4.63)

We are going to prove a stronger property:


2
E Sm e(2m−1)(∆m (f )) ≤ 4m ∀f ∈ F . (4.64)

Note that from Hoeffding’s inequality we get:


2
PSm (∆m (f ) ≥ ǫ) ≤ 2e−2mǫ ∀ǫ > 0 ∀f ∈ F . (4.65)

First, assume that the distribution of ∆m (f ) has density ∀f ∈ F ; denote it by pf (∆). In this case we can
directly up‘per-bound the expectation over Sm :
Z ∞ Z ∞  Z ∞ 
2 2 2 d
E Sm e(2m−1)(∆m (f )) = e(2m−1)ǫ pf (ǫ) dǫ = e(2m−1)ǫ − pf (∆) d∆ dǫ =
0 0 dǫ ǫ
 Z ∞  ∞ Z ∞ Z ∞
2 2
= − e(2m−1)ǫ ǫe(2m−1)ǫ

pf (∆) d∆ + 2(2m − 1) pf (∆) d∆ dǫ ≤
ǫ ǫ=0 0 ǫ
Z ∞ Z ∞ Z ∞
2
≤ pf (∆) d∆ + 2(2m − 1) ǫe(2m−1)ǫ pf (∆) d∆ dǫ ≤
0 0 ǫ
Z ∞ Z ∞
2 2 2
≤ 2 + 4(2m − 1) ǫe(2m−1)ǫ e−2mǫ dǫ = 2 + 4(2m − 1) ǫe−ǫ dǫ = 2 + 2(2m − 1) = 4m. (4.66)
0 0

We now relax our assumption of density existence. Let µf be a distribution of ∆m (f ). Consider a class M of
all non-negative sigma-additive measures on R+ such that a property similar to (4.65) holds:
2
µ([ǫ, ∞)) ≤ 2e−2mǫ ∀ǫ > 0 ∀µ ∈ M. (4.67)

39
Note that M contains a probability distribution of ∆m (f ) for any f ∈ F . Among these measures we shall choose
a specific one that maximizes an analogue of the left-hand sise of (4.64):
Z ∞
2
µ∗ ∈ Arg max e(2m−1)∆ µ(d∆). (4.68)
µ∈M 0

Note that constraint (4.67) states that a measure of a right tail of a real line should be upper-bounded. However,
µ∗ should have as much mass to the right as possible. Hence constraint (4.67) should become an equality for this
specific µ∗ :
2
µ∗ ([ǫ, ∞)) = 2e−2mǫ ∀ǫ > 0. (4.69)
2
From this follows that µ∗ has density p̃∗ (∆) = 8m∆e−2m∆ .
R∞ 2
Note that an inequality similar to (4.66) holds for p̃∗ . Moreover, since µ∗ maximizes 0 e(2m−1)∆ µ(d∆), we
have the following bound:
Z ∞ Z ∞
(2m−1)(∆m (f ))2 (2m−1)∆2 (2m−1)∆2 2
E Sm e = E ∆∼µf e = e µf (d∆) ≤ e(2m−1)∆ p̃∗ (∆) d∆ ≤ 4m. (4.70)
0 0

4.2.3 Applying PAC-bayesian bounds to deterministic algorithms


Consider a deterministic learning rule A(Sm ) ∼ Q̂m , where Q̂m is a Kronecker delta. While this situation is fine for
at most countable case, whenever F is uncountable and P (f ) = 0 ∀f ∈ F , KL(Q̂m || P ) = ∞ and we arrive into a
vacuous bound.

Compression and coding


One work-around is to consider some discrete coding c, with encc () being an encoder and decc () being a decoder.
We assume that decc (encc (f )) ≈ f ∀f ∈ F and instantiate a bound of the form (4.53) for encc (f ). Equivalently, we
shall write fc for encc (f ). Following [Zhou et al., 2019], we take a prior that prioritize models of small code-length:
1
Pc (fc ) = m(|fc |)2−|fc | , (4.71)
Z
where |fc | is a code-length for f , m(k) is some probability distribution on N, and Z is a normalizing constant. In
this case a KL-divergence is given as:

KL(δfc || Pc ) = log Z + |fc | log 2 − log(m(|fc |)). (4.72)

In order to make our bound as small as possible, we need to ensure that our learning algorithm, when fed
realistic data, outputs models of small code-length. One can esnure this by coding not the model f itself, but rather
a result of its compression via a compression algorithm C. We assume that a compressed model C(f ) is still a model
from F . We also hope that its risk does not change sufficiently R(C(f )) ≈ R(f ) and a learning algorithm tends
to output models which in a compressed form have small code-length. In this case we are able to upper-bound a
test-train risk difference for an encoded compressed model C(f )c instead of the original one.
When our models are neural nets paramaterized with a set of weights θ, a typical form of a compressed model
is a tuple (S, Q, C), where
• S = s1:k ⊂ [dim θ] are locations of non-zero weights;
• C = c1:r ⊂ R is a codebook;

• Q = q1:k , qi ∈ [r] ∀i ∈ [k] are quantized values.


Then C(θ)i = cqj if i = sj else 0. In this case a naive coding for 32-bit precision gives:

|C(θ)|c = |S|c + |Q|c + |C|c ≤ k(log dim θ + log r) + 32r. (4.73)

40
Stochastization
Another work-around is to volunteerly substitute fˆm with some Q̃m , presumably satisfying E f ∼Q̃m f = fˆm , such
that KL(Q̃m || P ) is finite. In this case we get the upper-bound for R(Q̃m ) instead of R(fˆm ). One possible goal
may be to obtain as better generalization guarantee as possible; in this case one can optimize the upper-bound
on R(Q̃m ) wrt Q̃m . Another goal may be to get a generalization guarantee for fˆm itself; in this case we have to
somehow relate it with a generalization gurantee for Q̂m .
Let us discuss the former goal first. Our goal is to optimize the upper-bound on test risk wrt a stochastic model
Q: s  
1 4m
R(Q) ≤ R̂m (Q) + log + KL(Q || P ) → min . (4.74)
2m − 1 δ Q

In order to make optimization via GD possible, we first substitute R̂m with its differentiable convex surrogate L̂m :
s  
1 4m
R(Q) ≤ L̂m (Q) + log + KL(Q || P ) → min . (4.75)
2m − 1 δ Q

The second thing we have to do in order to make GD optimization feasible is switching from searching in an abstract
model distribution space to searching in some euclidian space. Let F be a space of models realizable by a given
neural network architecture. Let θ denote a set of weights. Following [Dziugaite and Roy, 2017], we consider an
optimization problem in a distribution space Q consisting of non-degenerate diagonal gaussians:
Q = {N (µ, diag(exp λ)) : µ ∈ Rdim θ , λ ∈ Rdim θ }. (4.76)
dim θ
In this case we substitute a model class F with a set of network weights R . For Q ∈ Q and a gaussian prior
P = N (µ∗ , exp λ∗ I) the KL-divergence is given as follows:
1  −λ∗ eλ + kµ − µ∗ k22 + dim θ(λ∗ − 1) − 1 · µ .
 
KL(Q || P ) = e 1
(4.77)
2
Since both the KL term and the loss term are differentiable wrt (µ, λ) we can optimize the test risk bound via GD.
[Dziugaite and Roy, 2017] suggest starting the optimization process from µ∗ = θ̂m , where θ̂m is the set of weights
for a model fˆm = A(Sm ), and λ∗ being a sufficiently large negative number.
The next question is how to choose the prior. Note that the distribution we finally choose as a result of the bound
optimization does not take stochasticity of the initialization θ(0) of the algorithm A that finds θ̂m into account. For
this reason, the prior can depend on θ(0) ; following [Dziugaite and Roy, 2017], we take µ∗ = θ(0) . The rationale for
this is that in this case the KL-term depends on kµ − θ(0) k22 . If we hope that the both optimization processes do
not lead us far away from their initializations, the KL-term will not be too large.
As for the prior log-standard deviation λ∗ , we apply the same technique as for obtaining an a-posteriori uniform
bound: see Section 4.1.2. Define λ∗j = log c − j/b, where c, b > 0, j ∈ N. Take δj = 6δ/(π 2 j 2 ). Then we get a valid
bound for any j ≥ 1:
s  
1 4m ∗ ∗
R(Q) ≤ L̂m (Q) + log + KL(Q || P (µ , λj )) w.p. ≥ 1 − δj over Sm . (4.78)
2m − 1 δj
A union bound gives:
s  
1 4m ∗ ∗
R(Q) ≤ L̂m (Q) + log + KL(Q || P (µ , λj )) ∀j ∈ N w.p. ≥ 1 − δ over Sm . (4.79)
2m − 1 δj
This allows us to optimize the bound wrt j. However, optimization wrt real numbers is preferable since this allows
us applying GD. In order to achieve this, we express j as a function of λ∗ : j = b(log c − λ∗ ). This gives us the
following expression:
s
2π 2 mb(log c − λ∗ )
 
1
R(Q) ≤ L̂m (Q)+ log ∗ ∗
+ KL(Q || P (µ , λ )) ∀λ∗ ∈ {λ∗j }∞
j=1 w.p. ≥ 1 − δ. (4.80)
2m − 1 3δ
The expression above allows us to optimize its right-hand side wrt λ∗ via GD. However, we cannot guarantee that
the optimization result lies in {λ∗j }∞
j=1 . To remedy this, [Dziugaite and Roy, 2017] simply round the result to the
closest λ∗ in this set. To sum up, we take µ∗ = θ(0) and optimize the bound (4.80) wrt µ, λ, and λ∗ via GD.

41
A bound for a deterministic model
Recall in the previous section we aimed to search for a stochastic model that optimizes the upper-bound for the
test risk. In the current section we shall discuss how to obtain a bound for a given model deterministic model fˆm
in a PAC-bayesian framework.
Consider a neural network fθ with L − 1 hidden layers with weights θ without biases; let φ(·) be an activation
function. Suppose our learning algorithm A outputs weights θ̂m when given a dataset Sm . In our current framework,
both the prior and the posterior are distributions on Rdim θ . Note that McAllester’s theorem (Theorem 11) requires
computing KL-divergence between two distributions in model space. Nevertheless, noting that weights are mapped
to models surjectively, we can upper-bound this term with KL-diveregence in the weight space:
Corollary 1 (of Theorem 11). For any δ ∈ (0, 1) w.p. ≥ 1 − δ over Sm we have:
s  
1 4m
R(Q̂m ) ≤ R̂m (Q̂m ) + log + KL(Q̂m || P ) , (4.81)
2m − 1 δ

where R(Q) = E θ∼Q R(fθ ) and R̂m (Q) = E θ∼Q R̂m (fθ ).

For deterministic A, our Q̂m is degenerate, and the bound is vacuous. The bound is, however, valid for any
distribution Q̃m in the weight space. We take Q̃m = N (θ̂m , σ 2 Idim θ ) for some σ given before-hand. We take the
prior as P = N (0, σ 2 Idim θ ); in this case the train risk term and the KL-term in the right-hand side of Corollary 1
are given as follows:
  kθk22
R̂m (Q̃m ) = E ξ∼N (0,σ2 Idim θ ) R̂m fθ̂m +ξ , KL(Q̃m || P ) = . (4.82)
2σ 2

This gives us the upper-bound for R(Q̃m ); our goal is, however, to bound R(fˆm ) instead. The following lemma tells
us that it is possible to substitute a risk of a stochastic model with a margin risk of a deterministic one:
Lemma 10 ([Neyshabur et al., 2018]). Let the prior P has density p. For any δ ∈ (0, 1) w.p. ≥ 1 − δ over Sm , for
any deterministic θ and a random a.c. ξ such that
 
γ 1
Pξ max |fθ+ξ (x) − fθ (x)| < ≥ (4.83)
x∈X 2 2

we have: s  
1 16m
R(fθ ) ≤ R̂m,γ (fθ ) + log + 2KL(q ′ || p) , (4.84)
2m − 1 δ
where q ′ denotes a probability density of θ + ξ.
This lemma requires the noise ξ to conform some property; the next lemma will help us to choose the standard
deviation σ accordingly:
Lemma 11 ([Neyshabur et al., 2018]). Let φ(z) = [z]+ . For any x ∈ XB , where XB = {x ∈ X : kxk2 ≤ B}, for
−1
any θ = vec({Wl }L L
l=1 ), and for any ξ = vec({Ul }l=1 ) such that ∀l ∈ [L] kUl k2 ≤ L kWl k2 ,
L L
!
Y X kUl k2
|fθ+ξ (x) − fθ (x)| ≤ eB kWl k2 . (4.85)
kWl k2
l=1 l=1

These lemmas will allow us to prove the following result:


Theorem 13 ([Neyshabur et al., 2018]). Assume supp x = XB and φ(z) = [z]+ ; let n = maxl nl . For any δ ∈ (0, 1)
w.p. ≥ 1 − δ over Sm we have for any θ̂m
v !
 u u 1  2
   8Lm 1 4
BR(θ) 2
R fθ̂m ≤ R̂γ,m fθ̂m + t log + log m + 8e L n log(2Ln) , (4.86)
2m − 1 δ 2L γ

42
where we have defined a spectral complexity:

L
!v
u L
Y uX kWl k2
F
R(θ) = kWl k2 t . (4.87)
kWl k22
l=1 l=1

Compare with the result of Bartlett and coauthors:


Theorem 14 ([Bartlett et al., 2017]). Assume supp x = XB and φ(z) = [z]+ ; let n = maxl nl . For any δ ∈ (0, 1)
w.p. ≥ 1 − δ over Sm we have for any θ̂m
r
    1 1
R fθ̂m ≤ R̂γ,m fθ̂m + Rad(r̃γ ◦ F≤θ̂m | Sm ) + log , (4.88)
2m δ
where we upper-bound the Rademacher complexity as
C BR(θ, L−1 ) p
  
C BR(θ, 0) p
Rad(r̃γ ◦ F≤θ | Sm ) ≤ √ log(2n) 1 − log √ log(2n) , (4.89)
m γ 2 m γ
and we define a spectral complexity as

L
!v
u L
Y uX (kW T k2,1 + ∆)2
l
R(θ, ∆) = (kWl k2 + ∆) t . (4.90)
(kWl k2 + ∆)2
l=1 l=1

QL
Both bounds grow linearly with (B/γ) l=1 kWl k2 , which is a very natural property. While the former result is
simpler, the latter does not depend explicitly on depth L and width n. Nevertheless, the proof of the latter result
is rather technically involved, while the proof of the former can be reproduced without substantial efforts.
Proof of Theorem 13. First of all, define:
L
!1/L
Y β
β= kWl k2 , W̃l = Wl . (4.91)
kWl k2
l=1

QL QL PL kW̃l k2F PL kWl k2F


Since ReLU is homogeneous, fθ̃ = fθ . Also, l=1 kWl k2 = l=1 kW̃l k2 and l=1 kW̃l k2 = l=1 kWl k22 . Hence
2

both the model and the bound do not change if we substitute θ with θ̃. Hence w.l.o.g. assume that kWl k2 = β
∀l ∈ [L].
We now use Lemma 11 to find σ > 0 for which the condition of Lemma 10 is satisfied. In particular, we have
to upper-bound the probability for kUl k2 ≥ β/L for some l ∈ [L]. Notice that for ξ ∼ N (0, σ 2 Idim θ ) Ul has i.i.d.
zero-centered gaussian entries ∀l ∈ [L]. In a trivial case of 1 × 1 matrices, we can apply a simple tail bound:
 ǫ ǫ2
Pξ∼N (0,σ2 ) (|ξ| ≥ ǫ) = 2Pξ∼N (0,1) ξ ≥ ≤ 2e− 2σ2 . (4.92)
σ
This bound follows from Chernoff bound, which is a simple corollary of Markov’s inequality:
Theorem 15 (Chernoff bound). For a real-valued random variable X, for any a ∈ R, and for any λ ∈ R+ we have:

E eλX
P(X ≥ a) ≤ . (4.93)
eλa
Indeed,
E eλξ 2 2
   
λa− λ2 − supλ − 21 (λ−a)2 + a2 a2
Pξ∼N (0,1) (ξ ≥ a) ≤ λa ≤ e− supλ (λa−log E e ) = e
λξ − supλ
=e = e− 2 , (4.94)
e
where we have used the moment-generating function for gaussians:
∞ ∞ ∞ ∞
X λk E ξ k X λ2k (2k − 1)!! X λ2k X λ2k λ2
E ξ∼N (0,1) eλξ = = = = k
=e2 . (4.95)
k! (2k)! (2k)!! 2 k!
k=0 k=0 k=0 k=0

43
We can apply the same bound for a linear combination of i.i.d. standard gaussians:
m !
ǫ2
− Pm
X
a2
Pξ1:m ∼N (0,1) ai ξi ≥ ǫ = Pξ∼N (0,Pm 2 ) (|ξ| ≥ ǫ) = 2e
2
i=1 i . (4.96)

a
i=1 i

i=1

Moreover, a similar bound holds for matrix-linear combinations:


Theorem 16 ([Tropp, 2011]). Let A1:m be n × n deterministic matrices and let ξ1:m be i.i.d. standard gaussians.
Then
m
! 2
− Pmǫ 2
X
P ξi Ai ≥ ǫ ≤ ne 2k i=1 Ai k2 . (4.97)


i=1 2
What don’t we have a factor of 2 here?
Let us return to bounding the probability of kUl k2 ≥ β/L. For any l ∈ [L] Tropp’s theorem gives:
 
n
2
− t2
X
P(kUl k2 ≥ t) ≤ P(kŨl k2 ≥ t) = Pξ1:n,1:n ∼N (0,1) 
σ ≥ t ≤ ne 2σ n ,
ξij 1ij  (4.98)
i,j=1
2

where Ũl is a n × n matrix with entries:


(
Ul,ij , 1 ≤ i ≤ nl , 1 ≤ j ≤ nl+1 ,
Ũl,ij = 2
(4.99)
N (0, σ ), else.

Hence by a union bound:


t2
P(∀l ∈ [L] kUl k2 ≥ t) ≤ Lne− 2σ2 n . (4.100)
p
Equating the right-hand side to 1/2 gives t = σ 2n log(2Ln). Next, taking t ≤ β/L gives
β
σ ≤ σmax,1 = p (4.101)
L 2n log(2Ln)

and allows us to apply Lemma 11: w.p. ≥ 1/2 over ξ,


L
X p
max |fθ+ξ (x) − fθ (x)| ≤ eBβ L−1 kUl k2 ≤ eBβ L−1 Lσ 2n log(2Ln). (4.102)
x∈XB
l=1

In order to apply Lemma 10 we need to ensure that this equation is bounded by γ/2. This gives
γ
σ ≤ σmax,2 = p . (4.103)
2eBβ L−1 L 2n log(2Ln)

Taking σ = σmax = min(σmax,1 , σmax,2 ) hence ensures the condition of Lemma 10. The problem is that σ now
depends on β and hence on θ̂m ; this means that the prior P = N (0, σ 2 Idim θ ) depends on θ̂m . For this reason, we
have to apply a union bound argument for choosing σ.
Let B̃ be a discrete subset of R+ . Hence ∀β̃ ∈ B̃ ∀δ ∈ (0, 1) w.p. ≥ 1−δ over Sm ∀θ such that σmax (β) ≥ σmax (β̃)
s
kθk22
 
1 16m
R(fθ ) ≤ R̂m,γ (fθ ) + log + . (4.104)
2m − 1 δ 2
σmax (β̃)

A union bound gives ∀δ ∈ (0, 1) w.p. ≥ 1 − δ over Sm ∀θ ∀β̃ ∈ B̃ such that σmax (β) ≥ σmax (β̃)
s
kθk22
 
1 16m
R(fθ ) ≤ R̂m,γ (fθ ) + log + + log |B̃| . (4.105)
2m − 1 δ 2
σmax (β̃)

We need B̃ to be finite in order to have a finite bound. First note that for β L < γ/(2B) we have ∀x ∈ XB
|fθ (x)| ≤ β L B ≤ γ/2 which implies R̂m,γ (fθ ) = 1. In this case the bound is trivially true.

44

Second, for β L > γ m/(2B) the second term of the final bound (see Theorem 13) is greater than 1 and the
bound again becomes√trivially true. Hence it suffices to take any finite B̃ with min(B̃) = βmin = (γ/(2B))1/L and
max(B̃) = βmax = (γ m/(2B))1/L . Note that for β ∈ [βmin , βmax ] σmax = σmax,2 ; indeed,
σmax,1
= 2eγ −1 Bβ L ≥ e > 1. (4.106)
σmax,2

Hence σmax (β) ≥ σmax (β̃) is equivalent to β ≤ β̃.


We shall take B̃ such that the following holds:

∀β ∈ [βmin , βmax ] ∃β̃ ∈ B̃ : e−1 β̃ L−1 ≤ β L−1 ≤ eβ̃ L−1 . (4.107)

In this case, obviously, β ≤ β̃ and

2 γ2 γ2
σmax (β̃) = ≥ 4 2 2L−2 2 . (4.108)
8e2 B 2 β̃ 2L−2 L2 n log(2Ln) 8e B β L n log(2Ln)

We shall prove that the following B̃ conforms condition (4.107):


  K    
2k 2k
B̃ = βmin 1 + , K = max k : βmin 1 + ≤ βmax . (4.109)
L k=0 L

Hence 2K = ⌊L(βmax /βmin − 1)⌋ = ⌊L(m1/2L − 1)⌋. This gives:


1
log |B̃| = log(K + 1) ≤ log(Lm1/2L /2) = log(L/2) + log m. (4.110)
2L

Indeed, for any β ∈ [βmin , βmax ] ∃β̃ ∈ B̃ : |β − β̃| ≤ βmin /L ≤ β̃/L. Hence

eβ̃ L−1 ≥ (β̃ + β̃/L)L−1 ≥ (β̃ + |β − β̃|)L−1 ≥ β L−1 , (4.111)

e−1 β̃ L−1 ≤ (β̃ − β̃/L)L−1 ≤ (β̃ − |β − β̃|)L−1 ≤ β L−1 , (4.112)


which proves condition (4.107).
Let us first write the expression before the (2m − 1)−1 factor:
L
16m kθk22 8Lm 1 X kWl k2F
log + + log |B̃| ≤ log + log m + 8γ −2 e4 B 2 β 2L L2 n log(2Ln) . (4.113)
δ 2
σmax (β̃) δ 2L β2
l=1

This gives the final bound:


v !
u  2
u 1 8Lm 1 BR(θ)
R(fθ ) ≤ R̂m,γ (fθ ) + t log + log m + 8e4 L2 n log(2Ln) , (4.114)
2m − 1 δ 2L γ

where we have introduced a spectral complexity:

L L
!v
u L
X kWl kF Y uX kWl k2
R(θ) = β L = kWl k2 t F
. (4.115)
β kWl k22
l=1 l=1 l=1

Proof of Lemma 10. Let θ and ξ conform Condition 4.83 and let θ′ = θ + ξ. Define:

Aθ = {θ′ : max |fθ′ (x) − fθ (x)| < γ/2}. (4.116)


x∈X

Following Condition 4.83, we get P(Aθ ) ≥ 1/2.

45
Since ξ has density, θ′ has density as well; denote it by q ′ . Define:
1 ′
q̃(θ̃) = q (θ̃)[θ̃ ∈ Aθ ], where Z = P(Aθ ). (4.117)
Z
Note that maxx∈X |fθ̃ (x) − fθ (x)| < γ/2 a.s. wrt θ̃ for θ̃ ∼ q̃(θ̃). Therefore:

R(fθ ) ≤ Rγ/2 (fθ̃ ) and R̂m,γ/2 (fθ̃ ) ≤ R̂m,γ (fθ ) a.s. wrt θ̃. (4.118)

Hence

R(fθ ) ≤ E θ̃ Rγ/2 (fθ̃ ) ≤ (w.p. ≥ 1 − δ over Sm )


s   s  
1 4m 1 4m
≤ E θ̃ R̂m,γ/2 (fθ̃ ) + log + KL(q̃ || p) ≤ R̂m,γ (fθ ) + log + KL(q̃ || p) . (4.119)
2m − 1 δ 2m − 1 δ

The only thing that remains is estimating the KL-term. Define:


1
q̃ c (θ̃) = q ′ (θ̃)[θ̃ ∈
/ Aθ ]. (4.120)
1−Z
We then get:

KL(q ′ || p) = KL(q̃Z + q̃ c (1 − Z) || p) = E θ′ ∼q′ (log(q̃(θ′ )Z + q̃ c (θ′ )(1 − Z)) − log p(θ′ )) =


= E b∼B(Z) E θ′ ∼q′ |b (log(q ′ (θ′ |1)Z + q ′ (θ′ |0)(1 − Z)) − (Z + (1 − Z)) log p(θ′ )) =
= Z(log Z + KL(q ′ |1 || p)) + (1 − Z)(log(1 − Z) + KL(q ′ |0 || p)) =
= ZKL(q̃ || p) + (1 − Z)KL(q̃ c || p) − H(B(Z)). (4.121)

This implies the following:

1
KL(q̃ || p) = (KL(q ′ || p) + H(B(Z)) − (1 − Z)KL(q̃ c || p)) ≤
Z
1
≤ (KL(q ′ || p) + log 2) ≤ 2 (KL(q ′ || p) + log 2) . (4.122)
P (Aθ )
Hence w.p. ≥ 1 − δ over Sm we have:
s   s  
1 4m 1 16m
R(fθ ) ≤ R̂m,γ (fθ ) + log + KL(q̃ || p) ≤ R̂m,γ (fθ ) + log + 2KL(q ′ || p) . (4.123)
2m − 1 δ 2m − 1 δ

Proof of Lemma 11. Recall the forward dynamics:

h2 (x; θ) = W1 x ∈ Rn2 , xl (x; θ) = φ(hl (x; θ)) ∈ Rnl , hl+1 (x; θ) = Wl xl (x; θ) ∈ Rnl+1 ∀l ∈ {2, . . . , L}. (4.124)

Assume that x, θ, and ξ are fixed. Define:

∆l = khl+1 (x; θ + ξ) − hl+1 (x; θ)k2 ∀l ∈ [L]. (4.125)

We are going to prove the following by induction:


l l
 l !
1 Y X kUl′ k2
∆l ≤ 1 + kxk2 kWl′ k2 . (4.126)
L kWl′ k 2
l′ =1 l =1

The induction base is given as:

∆1 = kh2 (x; θ + ξ) − h2 (x; θ)k2 = kU1 xk2 ≤ kU1 k2 kxk2 , (4.127)

46
and we prove the induction step below:

∆l = khl+1 (x; θ + ξ) − hl+1 (x; θ)k2 = k(Wl + Ul )xl (x; θ + ξ) − Wl xl (x; θ)k2 =
= k(Wl + Ul )(xl (x; θ + ξ) − xl (x; θ)) + Ul xl (x; θ)k2 ≤
≤ kWl + Ul k2 kxl (x; θ + ξ) − xl (x; θ)k2 + kUl k2 kxl (x; θ)k2 ≤
≤ (kWl k2 + kUl k2 )khl (x; θ + ξ) − hl (x; θ)k2 + kUl k2 khl (x; θ)k2 ≤
  l−1
1 Y
≤ kWl k2 1 + ∆l−1 + kUl k2 kxk2 kWl′ k2 ≤
L
l′ =1
l
 l ! l−1 l
1 Y X kUl′ k2 kUl k2 Y
≤ 1+ kxk2 kWl′ k2 + kxk2 kWl′ k2 ≤
L kWl′ k2 kWl k2
l′ =1 l′ =1 l′ =1
l
 l ! l
1 Y X kUl′ k2
≤ 1+ kxk2 kWl′ k2 . (4.128)
L kWl′ k2
l =1
′ ′ l =1

A simple estimate then gives the required statement:

kfθ+ξ (x) − fθ (x)k2 = khL+1 (x; θ + ξ) − hL+1 (x; θ)k2 =


L
 L ! L L
! L
1 Y X kUl k2 Y X kUl k2
= ∆L ≤ 1 + kxk2 kWl k2 ≤ eB kWl k2 . (4.129)
L kWl k2 kWl k2
l=1 l=1 l=1 l=1

47
Chapter 5

Neural tangent kernel

5.1 Gradient descent training as a kernel method


Consider a parametric model with scalar output f (x; θ) ∈ R and let θ ∈ Rd . We aim to minimize a loss
E x,y ℓ(y, f (x; θ)) via a gradient descent:

∂ℓ(y, z)
θ̇t = −ηE x,y ∇θ f (x; θt ). (5.1)
∂z z=f (x;θt )

If we define a feature map Φt (x) = ∇θ f (x; θt ), then we can express the model as:

f (x; θ) = f (x; θt ) + ΦTt (x)(θ − θt ) + O(kθ − θt k22 ). (5.2)

It is a locally linear model in the vicinity of θt given a feature map Φt .


We now multiply both parts of the equation (5.1) by ∇Tθ f (x′ ; θt ):

˙ ′ ∂ℓ(y, z)
ft (x ) = −ηE x,y Θ̂t (x′ , x), (5.3)
∂z z=ft (x)

where Θ̂t (x′ , x) = ∇Tθ f (x′ ; θt )∇θ f (x; θt ) and ft (x′ ) = f (x′ ; θt ).
Here Θ̂t is a kernel and Φt (x) = ∇θ f (x; θt ) is a corresponding feature map. We call Θ̂t an empirical tangent
kernel at time-step t. It depends on the initialization, hence it is random. Given a train dataset (~x, ~y ) of size m,
the evolution of the responses on this dataset writes as follows:

η ∂ℓ(~y, ~z)
f˙t (~x) = − Θ̂t (~x, ~x) . (5.4)
m ∂~z ~z=ft (~x)

We see that the gramian of the kernel maps loss gradients wrt model outputs to output increments. Note that while
dynamics (5.1) is complete, (5.3) is not, since Kt cannot be determined solely in terms of ft .
Nevertheless, if we consider linearized dynamics, Kt becomes independent of t and can computed once at the
initialization, thus making dynamics (5.3) complete. Let us define a linearized model:

flin (x; θ) = f (x; θ0 ) + ∇Tθ f (x; θ0 )(θ − θ0 ). (5.5)

This model then evolves similarly to f (eq. (5.3)), but with a kernel fixed at initialization:

˙ ′ ∂ℓ(y, z)
flin,t (x ) = −ηE x,y Θ̂0 (x′ , x). (5.6)
∂z z=flin,t (x)

The gradient descent dynamics becomes:



∂ℓ(y, z)
θ̇t = −ηE x,y ∇θ f (x; θ0 ). (5.7)
∂z z=flin,t (x)

48
5.1.1 Exact solution for a square loss
These equations are analytically solvable if we consider a square loss: ℓ(y, z) = 12 (y − z)2 , see [Lee et al., 2019].
Let (~x, ~y), where ~x = {xi }m y = {yi }m
i=1 and ~ x) = {f (xi )}m
i=1 , is a train set. Let f (~ i=1 ∈ R
m
be a vector of model
m×m
responses on the train data. Finally, let Θ̂t (~x, ~x) ∈ R be a Gramian of the kernel Θ̂t : Θ̂t (~x, ~x)ij = Θ̂t (xi , xj ).
Eq. (5.6) evaluated on train set becomes:
1
f˙lin,t (~x) = η Θ̂0 (~x, ~x)(~y − flin,t (~x)). (5.8)
m
Its solution writes as follows:
flin,t (~x) = ~y + e−ηΘ̂0 (~x,~x)t/m (f0 (~x) − ~y). (5.9)
Given this, the weight dynamics (5.7) becomes:
1
θ̇t = −η ∇θ f (~x; θ0 )e−ηΘ̂0 (~x,~x)t/m (f0 (~x) − ~y ), (5.10)
m
where we have assumed that ∇θ f (~x; θ0 ) ∈ Rd×m . Solving it gives:

θt = θ0 − ∇θ f (~x; θ0 )Θ̂−1 x, ~x)(I − e−ηΘ̂0 (~x,~x)t/m )(f0 (~x) − ~y).


0 (~ (5.11)

Substituting the solution back to (5.5) gives a model prediction on an arbitrary input x:

flin,t (x) = f0 (x) − Θ̂0 (x, ~x)Θ̂−1 x, ~x)(I − e−ηΘ̂0 (~x,~x)t/m )(f0 (~x) − ~y),
0 (~ (5.12)

where we have defined a row-vector Θ̂0 (x, ~x) with components Θ̂0,i (x, ~x) = ∇Tθ f (x; θ0 )∇θ f (xi ; θ0 ).

5.1.2 Convergence to a gaussian process


Consider a network with L hidden layers and no biases:

f (x) = WL φ(WL−1 . . . φ(W0 x)), (5.13)

where Wl ∈ Rnl+1 ×nl and a non-linearity φ is applied element-wise. Note that x ∈ Rn0 ; we denote with k = nL+1
the dimensionality of the output: f : Rn0 → Rk . We shall refer nl as the width of the l-th hidden layer.
Let us assume x is fixed. Define:

h1 = W0 x ∈ Rn1 , xl = φ(hl ) ∈ Rnl , hl+1 = Wl xl ∈ Rnl+1 ∀l ∈ [L]. (5.14)

Hence given x f (x) = hL+1 . Define also:


1
ql = E hTl hl . (5.15)
nl
Let us assume that the weights are initialized with zero-mean gaussians, so that the forward dynamics is
normalized:
σ2
 
Wlij ∼ N 0, w . (5.16)
nl
Obiously, all components of hl are distributed identically. Their means are zeros, let us compute the variances:
2
1 σw σ2 1 σ2
ql+1 = E xTl WlT Wl xl = E xTl xl = w E φ(hl )T φ(hl ) ∀l ∈ [L], E xT W0T W0 x = w kxk22 .
q1 =
nl+1 nl nl n1 n0
(5.17)
We are going to prove by induction that ∀l ∈ [L + 1] ∀i ∈ [nl ] hil converges weakly to N (0, ql ) as n1:l−1 → ∞
sequentially. Since components of W0 are gaussian, hi1 ∼ N (0, q1 ) ∀i ∈ [n0 ] — this gives the induction base. If all
hil converge weakly to N (0, ql ) as n1:l−1 → ∞ sequentially then limn1:l →∞ ql+1 = σw 2
E z∼N (0,ql ) (φ(z))2 . Hence by
i
the virture of CLT, hl+1 converges in distribution to N (0, ql+1 ) as n1:l → ∞ sequentially — this gives the induction
step.

49
Consider two inputs, x1 and x2 , together with their hidden representations h1l and h2l . Let us prove that
∀l ∈ [L + 1] ∀i ∈ [nl ] (h1,i 2,i T
l , hl ) converges in distribution to N (0, Σl ) as n1:l−1 → ∞ sequentially, where the
covariance matrix is defined as follows:
 11
ql12

ql 1
Σl = 12 ; qlab = E ha,T b
l hl , a, b ∈ {1, 2}. (5.18)
ql ql22 nl
We have already derived the dynamics for the diagonal terms in the subsequent limits of infinite width:

aa 2 kxa k22
lim ql+1 = σw E z∼N (0,qlaa ) (φ(z))2 , q1aa = σw
2
, a ∈ {1, 2}. (5.19)
n1:l →∞ n0
Consider the diagonal term:
2
12 1 σw
ql+1 = E h1l ,h2l E Wl φ(h1l )T WlT Wl φ(h2l ) = E 1 2 φ(h1l )T φ(h2l ). (5.20)
nl+1 nl hl ,hl
By induction hypothesis, as n1:l−1 → ∞ we have a weak limit:
 1,i 
hl
→ N (0, Σl ). (5.21)
h2,i
l

Hence
12 2
lim ql+1 = σw E (u1 ,u2 )T ∼N (0,Σl ) φ(u1 )φ(u2 ). (5.22)
n1:l →∞

Note that  1,i  X nl  1,j 


hl+1 ij xl
= Wl . (5.23)
h2,i
l+1 j=1
x2,j
l

Here we have a sum of nl i.i.d. random vectors with zero mean, and the covariance matrix of the sum is Σl+1 .
Hence by the multivariate CLT, (h1,i 2,i T
l+1 , hl+1 ) converges weakly to N (0, Σl+1 ) as n1:l → ∞ sequentially.
Similarly, for any k ≥ 1  1,i   1,j 
hl+1 nl xl
Wlij  . . .  .
X
 ...  = (5.24)
hk,i
l+1
j=1 x k,j
l

Again, these vectors converge to a gaussian by the multivariate CLT. Hence ∀l ∈ [L + 1] hil (·) converges weakly to
a gaussian process as n1:l−1 → ∞ sequentially. Note that a gaussian process is completely defined by its mean and
covariance functions:
ql (x, x) ql (x, x′ )
 

Σl (x, x ) = ∀l ∈ [L + 1]; (5.25)
ql (x′ , x) ql (x′ , x′ )
2
σw
ql+1 (x, x′ ) = σw
2
E (u,v)T ∼N (0,Σl (x,x′ )) φ(u)φ(v) ∀l ∈ [L], q1 (x, x′ ) = xT x′ . (5.26)
n0
Hence the model at initialization converges to a gaussian process with zero mean and covariance ΣL+1 (·, ·). This
GP is referred as NNGP, and qL+1 — as NNGP kernel.

5.1.3 The kernel diverges at initialization


For a fixed x, let us define the following quantity:

∂f i
Bli = ∈ Rnl ∀l ∈ [L + 1]. (5.27)
∂hl
We have then:
ij
Bli = Dl WlT Bl+1
i
∀l ∈ [L], BL+1 = δij , (5.28)
where Dl = diag(φ′ (hl )). This gives:
∇Wl f i (x) = Bl+1
i
xTl ∈ Rnl+1 ×nl . (5.29)

50
Define the scaled covariance for Bl components for two inputs:
2
σw
βlij (x, x′ ) = E Bli,T Bl′,j = E Bl+1
i,T ′,j
Wl Dl Dl′ WlT Bl+1 = i,T
E tr(Dl Dl′ )(Bl+1 ′,j
Bl+1 )=
nl
2 ij
= σw βl+1 (x, x′ )E (u,v)T ∼N (0,Σl (x,x′ )) φ′ (u)φ′ (v) ∀l ∈ [L − 1], (5.30)
2
σw
βLij = E BLi,T BL′,j = E BL+1
i,T ′
WL DL DL ′,j
WLT BL+1 = ′
E tr(DL DL 2
)δij = σw E (u,v)T ∼N (0,ΣL (x,x′ )) φ′ (u)φ′ (v)δij .
nL
(5.31)
Note that βlij = βl δij . Similarly to ql , define the following:

χl (x, x′ ) = σw
2
E (u,v)T ∼N (0,Σl (x,x′ )) φ′ (u)φ′ (v). (5.32)

This allows us to write:


L
Y
βl (x, x′ ) = χl′ (x, x′ ) ∀l ∈ [L]. (5.33)
l′ =l

In the case of non-scalar output (k > 1), the tangent kernel is a k × k matrix with components defined as:

Θ̂ij (x, x′ ) = ∇Tθ f i (x)∇θ f j (x′ ). (5.34)

For the sake of convenience, we introduce layer-wise tangent kernels:

Θ̂ij ′ T i j ′
l (x, x ) = tr(∇Wl f (x)∇Wl f (x )). (5.35)
PL
In this case Θ̂(x, x′ ) = l=0 Θ̂l (x, x′ ).
We denote Bl and hl evaluated at x′ by Bl′ and h′l , respectively. This allows us to write:
    i,T ′,j 
′,j
Θ̂ij
l (x, x′
) = tr φ(h l )B i,T
B
l+1 l+1 φ(h ′ T
l ) = φ(h ′ T
l ) φ(h l ) Bl+1 Bl+1 ∀l ∈ [L]. (5.36)

If we assume that the two scalar products are independent then the expected kernel is a product of expectations:
  i,T ′,j  nl ql+1 (x, x′ )
E Θ̂ij ′ ′ T
l (x, x ) = E φ(hl ) φ(hl ) E Bl+1 Bl+1 = 2
βl+1 (x, x′ )δij . ∀l ∈ [L]. (5.37)
σw

Hence (a) each kernel is diagonal, (b) l-th kernel expectation diverges as nl → ∞ ∀l ∈ [L].

5.1.4 NTK parameterization


It is possible to leverage the kernel divergence by altering the network parameterization:
σw σw
h1 = √ W0 x ∈ Rn1 , xl = φ(hl ) ∈ Rnl , hl+1 = √ Wl xl ∈ Rnl+1 ∀l ∈ [L]. (5.38)
n0 nl

In this case, the weights are standard gaussians:

Wlij ∼ N (0, 1) . (5.39)

∂f i
Bli = ∈ Rnl ∀l ∈ [L + 1]. (5.40)
∂hl
We have then:
σw ij
Bli = √ Dl WlT Bl+1
i
∀l ∈ [L], BL+1 = δij , (5.41)
nl
Both forward and backward dynamics at initialization remains unchanged. What changes are the gradients wrt
weights:
σw i
∇Wl f i (x) = √ Bl+1 xTl . (5.42)
nl

51
This results in a change of the tangent kernel scaling:
2
σw   i,T ′,j 
E Θ̂ij ′
l (x, x ) = E φ(h′l )T φ(hl ) E Bl+1 Bl+1 = ql+1 (x, x′ )βl+1 (x, x′ )δij ∀l ∈ [L]. (5.43)
nl
Now the kernel expectation neither diverges nor vanishes as n → ∞. Since the expectation is finite, the kernel itself
converges to it as n → ∞. Indeed, consider the l-th kernel:
2
σw   i,T ′,j 
Θ̂ij ′
l (x, x ) = φ(h′l )T φ(hl ) Bl+1 Bl+1 . (5.44)
nl
The first multiplier converges to ql+1 (x, x′ ) due to the Law of Large Numbers. Similar holds for the second one: it
converges to βl+1 (x, x′ )δij by the LLN. Together these two give:

plim . . . plim Θ̂ij ′ ij ′ ′ ′


l (x, x ) = E Θ̂l (x, x ) = ql+1 (x, x )βl+1 (x, x )δij ∀l ∈ [L]. (5.45)
nl →∞ n1 →∞

And for the whole kernel, we have:


L+1 L+1 L
!
X X Y
ij ′ ij ′ ′ ′ ′ ′
plim . . . plim Θ̂ (x, x ) = E Θ̂ (x, x ) = ql (x, x )βl (x, x )δij = ql (x, x ) χl′ (x, x ) δij . (5.46)
nL →∞ n1 →∞
l=1 l=1 l′ =l

See [Arora et al., 2019b] for the above expression for the expected kernel, and [Jacot et al., 2018] for the formal
proof of convergence in subsequent limits. See also [Arora et al., 2019b] for a convergence proof in stronger terms.

5.1.5 GD training and posterior inference in gaussian processes


Denote the limit kernel at initialization by Θ0 :
L+1 L
!
X Y
Θ0 (x, x′ ) = ql (x, x′ ) χl′ (x, x′ ) Ik×k . (5.47)
l=1 l′ =l

Unlike the empirical kernel, the limit one is deterministic. Similarly to Section 5.1.1, we assume that ~x is a training
set of size n, and k = 1. Then let Θ0 (~x, ~x) ∈ Rn×n be a Gramian for the limit kernel.
Given (a) the kernel has a deterministic limit, and (b) the model at initialization converges to a limit model,
the model trained to minimize square loss converges to the following limit model at any time t:

lim flin,t (x) = lim f0 (x) − Θ0 (x, ~x)Θ−1 x, ~x)(I − e−ηΘ0 (~x,~x)t/n )(lim f0 (~x) − ~y ).
0 (~ (5.48)

Looking at this expression we notice that since the limit model at initialization is a gaussian process (see Sec-
tion 5.1.2), the limit model is a gaussian process at any time t. Its mean and covariance are given as follows:

µlin,t (x) = Θ0 (x, ~x)Θ−1 x, ~x)(I − e−ηΘ0 (~x,~x)t/n )~y ;


0 (~ (5.49)

qlin,t (x, x′ ) = qL+1 (x, x′ )−


 
− Θ0 (x′ , ~x)Θ−10 (~
x , ~
x )(I − e −ηΘ0 (~
x,~
x)t/n
)qL+1 (~
x , x) + Θ 0 (x, ~
x )Θ −1
0 (~
x , ~
x )(I − e −ηΘ0 (~
x,~
x)t/n
)qL+1 (~
x , x′
) +
+ Θ0 (x, ~x)Θ−1 x, ~x)(I − e−ηΘ0 (~x,~x)t/n )qL+1 (~x, ~x)(I − e−ηΘ0 (~x,~x)t/n )Θ−1
0 (~ x, ~x)Θ0 (~x, x′ ). (5.50)
0 (~

Assume that the limit kernel is bounded away from zero: λmin (Θ0 (~x, ~x)) ≥ λ0 > 0. Given this, the model
converges to the following limit GP as t → ∞:

µlin,∞ (x) = Θ0 (x, ~x)Θ−1


0 (~
x, ~x)~y ; (5.51)

qlin,∞ (x, x′ ) = qL+1 (x, x′ ) + Θ0 (x, ~x)Θ−1 x, ~x)qL+1 (~x, ~x)Θ−1


0 (~ x, ~x)Θ0 (~x, x′ )−
0 (~
− Θ0 (x′ , ~x)Θ−1 x, ~x)qL+1 (~x, x) + Θ0 (x, ~x)Θ−1 x, ~x)qL+1 (~x, x′ ) . (5.52)

0 (~ 0 (~

52
Note that the exact bayesian posterior inference gives a different result:
−1
µlin (x | ~x) = qL+1 (x, ~x)qL+1 (~x, ~x)~y ; (5.53)
−1
qlin (x, x′ | ~x) = qL+1 (x, x′ ) − qL+1 (x, ~x)qL+1 (~x, ~x)qL+1 (~x, x′ ). (5.54)
Nevertheless, if we consider training only the last layer of the network, the tangent kernel becomes:
Θ(x, x′ ) = ΘL (x, x′ ) = qL+1 (x, x′ ). (5.55)
Given this, the two GPs, result of NN training and exact posterior, coincide.
Let us return to the assumption of positive defniteness of the limit kernel. [Du et al., 2019] proved that if no
inputs are parallel, this assumption holds:
Theorem 17. If for any i 6= j xTi xj < kxi k2 kxj k2 then λ0 := λmin (Θ0 (~x, ~x)) > 0.

5.2 Stationarity of the kernel


Assume k = 1; in this case NTK is scalar-valued. For analytic activation function φ we have the following:
∞   k  k
X d Θt (x1 , x2 ) t
E θ (Θt (x1 , x2 ) − Θ0 (x1 , x2 )) = Eθ . (5.56)
dtk
t=0 k!
k=1

Hence if we show that all derivatives of the NTK at t = 0 vanish as n → ∞, this would mean that the NTK does
not evolve with time for large n: Θt (x, x′ ) → Θ0 (x, x′ ) as n → ∞.
Let us consider l2 -loss: ℓ(y, z) = 12 (y − z)2 . Consider the first derivative:

d(∇Tθ ft (x1 )∇θ ft (x2 ))


   
dΘt (x1 , x2 )   
Eθ = E = Eθ θ̇tT ∇θ ∇Tθ ft (x1 )∇θ ft (x2 ) + (x1 ↔ x2 ) =

θ
dt
t=0 dt
t=0 t=0

= E x,y E θ (η(y − f0 (x))∇Tθ f0 (x)∇θ ∇Tθ f0 (x1 )∇θ f0 (x2 ) + (x1 ↔ x2 )). (5.57)
We shall show that it is O(n−1 ), and that it also implies that all higher-order derivatives are O(n−1 ) too.
From now on, we shall consider only initialization: t = 0; for this reason, we shall omit the subscript 0. Following
[Dyer and Gur-Ari, 2020], we start with a definition of a correlation function. Define a rank-k derivative tensor
Tµ1 ...µk as follows:
∂ k f (x)
Tµ1 ...µk (x; f ) = µ1 . (5.58)
∂θ . . . ∂θµk
For k = 0 we define T (x; f ) = f (x). We are now ready to define the correlation function C:
X  
C(x1 , . . . , xm ) = ∆(π)
µ1 ...µkm E θ Tµ1 ...µk1 (x1 )Tµk1 +1 ...µk2 (x2 ) . . . Tµkm−1 +1 ...µkm (xm ) . (5.59)
µ1 ,...,µkm

(π)
Here 0 ≤ k1 ≤ . . . ≤ km , km and m are even, π ∈ Skm is a permutation, and ∆µ1 ...µkm = δµπ(1) µπ(2) . . . δµπ(km −1) µπ(km ) .
For example,
X
E θ (f (x)∇Tθ f (x)∇θ ∇Tθ f (x1 )∇θ f (x2 )) = 2
E θ (f (x)∂µ f (x)∂µ,ν f (x1 )∂ν f (x2 )) =
µ,ν
X
= δµ1 µ2 δµ3 µ4 E θ (f (x)∂µ1 f (x)∂µ22 ,µ3 f (x1 )∂µ4 f (x2 )) = C(x, x, x1 , x2 ) (5.60)
µ1 ,µ2 ,µ3 ,µ4

is a correlation function with m = 4, k1 = 0, k2 = 1, k3 = 3, k4 = 4, and π(j) = j. Moreover, E θ ((f (x) −


y)∇Tθ f (x)∇θ ∇Tθ f (x1 )∇θ f (x2 )) is a correlation function too: consider fy (x) = f (x) − y instead of f (x). Hence the
whole (5.57) is a linear combination of correlation functions.
If two derivative tensors have two indices that are summed over, we shall say that they are contracted. Formally,
we shall say that Tµki−1 +1 ...µki (xi ) is contracted with Tµkj−1 +1 ...µkj (xj ) for 1 ≤ i, j ≤ m, if there exists an even
s ≤ km such that ki−1 < π(s − 1) ≤ ki , while kj−1 < π(s) ≤ kj , or vice versa.
Define the cluster graph GC (V, E) as a non-oriented non-weighted graph with vertices V = {v1 , . . . , vm } and
edges E = {(vi , vj ) | T (xi ) and T (xj ) are contracted in C}. Let ne be the number of even-sized connected compo-
nents of GC (V, E) and no be the number of odd-sized components.

53
Conjecture 1 ([Dyer and Gur-Ari, 2020]). If m is even, C(x1 , . . . , xm ) = On→∞ (nsC ), where sC = ne +no /2−m/2.
If m is odd, C(x1 , . . . , xm ) = 0.
Applying this conjecture to (5.60) gives C(x, x, x1 , x2 ) = O(n−1 ) (ne = 0, n0 = 2, m = 4), hence the whole
eq. (5.57) is O(n−1 ).
Let us show that having the first derivative of the NTK being O(n−1 ) implies all higher-order derivatives to be
O(n−1 ).
Lemma 12 ([Dyer and Gur-Ari, 2020]). Suppose Conjecture 1 holds. Let C(~x) = E θ F (~x; θ) be a correlation
function and suppose C(~x) = O(nsC ) for sC defined in Conjecture 1. Then E θ dk F (~x; θ)/dtk = O(nsC ) ∀k ≥ 1.
Proof. Consider the first derivative:

dF (~x)
Eθ = E θ (θ̇T ∇θ F (~x)) = E x,y E θ (η(y − f (x))∇Tθ f (x)∇θ F (~x)) =
dt
= ηE x,y E θ (y∇Tθ f (x)∇θ F (~x)) − ηE x,y E θ (f (x)∇Tθ f (x)∇θ F (~x)). (5.61)

This is a sum of linear combination of correlation functions. By Conjecture 1, the first sum evaluates to zero, while
the second one has m′ = m + 2, n′e even clusters, and n′o odd clusters. If ∇θ f (x) is contracted with an even cluster
of C, we have n′e = ne − 1, n′o = no + 2. In contrast, if ∇θ f (x) is contracted with an odd cluster of C, we have
n′e = ne + 1, n′o = no .
In the first case we have s′C = n′e + n′o /2 − m′ /2 = sC − 1, while for the second s′C = sC . In any case, the result
is a linear combination of correlation functions with s′C ≤ sC for each.

5.2.1 Finite width corrections for the NTK


Let us define O1,t (x) = ft (x) and for s ≥ 2

Os,t (x1 , . . . , xs ) = ∇Tθ Os−1,t (x1 , . . . , xs−1 )∇θ ft (xs ). (5.62)

In this case O2,t (x1 , x2 ) is the empirical kernel Θ̂t (x1 , x2 ). Note that Os,t evolves as follows:

Ȯs,t (x1 , . . . , xs ) = ηE x,y (y − ft (x))∇Tθ ft (x)∇θ Os,t (x1 , . . . , xs ) = ηE x,y (y − ft (x))Os+1,t (x1 , . . . , xs , x). (5.63)

Since Os has s derivative tensors and a single cluster, by the virtue of Conjecture 1, E θ Os,0 = O(n1−s/2 )
for even s and E θ Os,0 = 0 for odd s. At the same time, E θ Ȯs,0 = O(n1−(s+2)/2 ) = O(n−s/2 ) for even s and
E θ Ȯs,0 = O(n1−(s+1)/2 ) = O(n1/2−s/2 ) for odd s.
As for the second moments, we have E θ (Os,0 )2 = O(n2−s ) for even s and E θ (Os,0 )2 = O(n1−s ) for odd s.
Similarly, we have E θ (Ȯs,0 )2 = O(n2/2−(2s+2)/2 ) = O(n−s ) for even s and E θ (Ȯs,0 )2 = O(n2−(2s+2)/2 ) = O(n1−s )
for odd s.
The asymptotics for the first two moments implies the asymptotic for a random variable itself:
( (
O(n1−s/2 ) for even s; O(n−s/2 ) for even s;
Os,0 (x1:s ) = 1/2−s/2
Ȯ s,0 (x1:s ) = 1/2−s/2
(5.64)
O(n ) for odd s; O(n ) for odd s.

Lemma 12 gives ∀k ≥ 1: (
dk Os,t O(n−s/2 )

for even s;
k
(x1:s ) = (5.65)
dt t=0 O(n1/2−s/2 ) for odd s.
Then given an analytic activation function, we have ∀t ≥ 0:

(
dk Os,t k O(n−s/2 )

X t for even s;
Ȯs,t (x1:s ) = k
(x1:s ) = 1/2−s/2
(5.66)
dt t=0 k! O(n ) for odd s.
k=1

This allows us to write a finite system of ODE for the model evolution up to O(n−1 ) terms:

f˙t (x1 ) = ηE x,y (y − ft (x))Θt (x1 , x), f0 (x1 ) = f (x1 ; θ), θ ∼ N (0, I), (5.67)

54
Θ̇t (x1 , x2 ) = ηE x,y (y − ft (x))O3,t (x1 , x2 , x), Θ0 (x1 , x2 ) = ∇Tθ f0 (x1 )∇θ f0 (x2 ), (5.68)
Ȯ3,t (x1 , x2 , x3 ) = ηE x,y (y − ft (x))O4,t (x1 , x2 , x3 , x), O3,0 (x1 , x2 , x3 ) = ∇Tθ Θ0 (x1 , x2 )∇θ f0 (x3 ), (5.69)
Ȯ4,t (x1 , x2 , x3 , x4 ) = O(n−2 ), O4,0 (x1 , x2 , x3 , x4 ) = ∇Tθ O3,0 (x1 , x2 , x3 )∇θ f0 (x4 ). (5.70)
−1
Let us expand all the quantities wrt n :
(0) (1)
Os,t (x1:s ) = Os,t (x1:s ) + n−1 Os,t (x1:s ) + O(n−2 ), (5.71)
(k)
where Os,t (x1:s ) = Θn→∞ (1). Then the system above transforms into the following:

(0) (0) (0)


f˙t (x1 ) = ηE x,y (y − ft (x))Θt (x1 , x), lim f (x1 ; θ0 ), (5.72)
n→∞

(1) (0) (1) (1) (0)


f˙t (x1 ) = ηE x,y ((y − ft (x))Θt (x1 , x) − ft (x)Θt (x1 , x)), (5.73)
(0) (0) (0)
Θt (x1 , x2 ) = ∇Tθ f0 (x1 )∇θ f0 (x2 ), (5.74)
(1) (0) (1)
Θ̇t (x1 , x2 ) = ηE x,y (y − ft (x))O3,t (x1 , x2 , x), (5.75)
(1) (0) (1)
Ȯ3,t (x1 , x2 , x3 ) = ηE x,y (y − ft (x))O4,t (x1 , x2 , x3 , x), (5.76)
(1) (0) (0)
O4,t (x1 , x2 , x3 , x4 ) = ∇Tθ O3,0 (x1 , x2 , x3 )∇θ f0 (x4 ), (5.77)
where we have ignored the initial conditions for the time being. Integrating this system is straightforward:
(0)
(0) (0)
ft (~x) = ~y + e−ηΘ0 (~
x,~
x)t/n
(f0 (~x) − ~y), (5.78)

where ~x is a train dataset of size n. For the sake of brevity, let us introduce the following definition:
(0)
(0) (0)
∆ft (x) = e−ηΘ0 (x,~
x)t/n
(f0 (~x) − ~y). (5.79)

This gives: Z t
(1) (1) (1) (0)
O3,t (x1 , x2 , x3 ) = O3,0 (x1 , x2 , x3 ) − ηE x′ ,y′ O4,0 (x1 , x2 , x3 , x′ )∆ft′ (x′ ) dt′ . (5.80)
0

Z t
(1) (1) (1) (0)
Θt (x1 , x2 ) = Θ0 (x1 , x2 ) − ηE x′ ,y′ O3,0 (x1 , x2 , x)∆ft′ (x′ ) dt′ +
0
Z tZ t′′
(1) (0) (0)
+ η 2 E x′′ ,y′′ E x′ ,y′ O4,0 (x1 , x2 , x′′ , x′ )∆ft′ (x′ )∆ft′′ (x′′ ) dt′ dt′′ . (5.81)
0 0
 
(1) (0) (1) (1) (0)
f˙t (x1 ) = −ηE x,y ∆ft (x)Θt (x1 , x) + ft (x)Θt (x1 , x) , (5.82)

f˙(t) = Af (t) + g(t), f (0) = f0 ; (5.83)


At At At At −At
f (t) = C(t)e ; Ċ(t)e + C(t)Ae = C(t)Ae + g(t); Ċ(t) = g(t)e . (5.84)

(0)
(1)
ft (~x1 ) = e−ηΘ0 (~
x1 ,~
x)t/n
Ct (~x); (5.85)
(0)
(1) (0)
Ċt (~x) = −ηE x′ ,y′ e Θt (~x1 , x′ )∆ft (x′ );
ηΘ0 (~x,~
x1 )t/n
(5.86)
Z t
(0)
(1) ′ (1) (0)
Ct (~x) = f0 (~x) − ηE x′ ,y′ eηΘ0 (~x,~x1 )t /n Θt (~x1 , x′ )∆ft′ (x′ ) dt′ ; (5.87)
0

(0) (0)
Z t (0)
(1) x)t/n (1) x3 )t′ /n (1) (0)
ft (~x1 ) = e−ηΘ0 (~
x1 ,~
f0 (~x) − ηe−ηΘ0 (~
x1 ,~
x2 )t/n
E x′ ,y′ eηΘ0 (~
x2 ,~
Θt (~x3 , x′ )∆ft′ (x′ ) dt′ . (5.88)
0

55
5.2.2 Proof of Conjecture 1 for linear nets
Shallow nets.
We first consider shallow linear nets:
1
f (x) = √ aT W x. (5.89)
n
We shall use the following theorem:
Theorem 18 ([Isserlis, 1918]). Let z = (z1 , . . . , zl ) be a centered multivariate Gaussian variable. Then, for any
positive k, for any ordered set of indices i1:2k ,
1 X X Y
E z (zi1 · · · zi2k ) = E (ziπ(1) ziπ(2) ) · · · E (ziπ(m−1) ziπ(m) ) = E (zij1 zij2 ), (5.90)
2k k! 2 {j ,j }∈p
π∈S2k p∈P2k 1 2

2
where P2k is a set of all unordered pairings p of a 2k-element set, i.e.
[
2
P2k = {{π(1), π(2)}, . . . , {π(2k − 1), π(2k)}}. (5.91)
π∈S2k

At the same time,


E z (zi1 · · · zi2k−1 ) = 0. (5.92)
For example,

E z (z1 z2 z3 z4 ) = E z (z1 z2 )E z (z3 z4 ) + E z (z1 z3 )E z (z2 z4 ) + E z (z1 z4 )E z (z2 z3 ). (5.93)

Consider a correlation function without derivatives:

C(x1 , . . . , xm ) = E θ (f (x1 ) . . . f (xm )) = n−m/2 E θ (ai1 W i1 x1 . . . aim W im xm ) =


= n−m/2 E θ (ai1 · · · aim )E θ (W i1 x1 · · · W im xm ) =
  
ijw ijw
X Y X Y
= n−m/2 [m = 2k]  δija ija   δ 1 2 xTj1w xj2w  . (5.94)
1 2
2 {j a ,j a }∈p
pa ∈P2k 2 {j w ,j w }∈p
pw ∈P2k
1 2 a 1 2 w

For even m, we shall associate a graph γ with each pair (pa , pw ). Such a graph has m vertices (v1 , . . . , vm ). For any
{j1a , j2a } ∈ pa there is an edge (vj1a , vj2a ) marked a, and an edge (vj1w , vj2w ) marked W for any {j1w , j2w } ∈ pw . Hence
each vertex has a unique a-neighbor and a unique W -neighbor; these two can be the same vertex. Hence γ is a
union of cycles. We call γ a Feynman diagram of C.
Denote by Γ(C) a set of Feynman diagrams of C, and by lγ a number of cycles in the diagram γ. It is easy to
notice that each cycle contributes a factor of n when one takes a sum over i1 , . . . , im . Hence we have:
 
X Y
C(x1 , . . . , xm ) = n−m/2 [m = 2k] nlγ(pa ,pw ) xTj1w xj2w  = [m = 2k]On→∞ (nmaxγ∈Γ(C) lγ −m/2 ).
pa ,pw {j1w ,j2w }∈pw
(5.95)
Consider now a correlation function with derivatives. Assume there is an edge (vi , vj ) in GC ; hence corresponding
derivative tensors are contracted in C. In this case, we should consider only those Feynman diagrams γ that have
an edge (vi , vj ), either of a or of w type. Denoting a set of such diagrams as Γ(C), we get the same bound as before:

C(x1 , . . . , xm ) = [m = 2k]On→∞ (nmaxγ∈Γ(C) lγ −m/2 ). (5.96)

In order to illustrate this principle, let us consider the case m = 4. For simplicity, assume also all inputs to be
equal: x1 = . . . = x4 = x. If there are no derivatives, we have:

E θ ((f (x))4 ) = n−2 (δi1 i2 δi3 i4 + δi1 i3 δi2 i4 + δi1 i4 δi2 i3 )(δ i1 i2 δ i3 i4 + δ i1 i3 δ i2 i4 + δ i1 i4 δ i2 i3 )(xT x)2 = (3 + 6n−1 )(xT x)2 .
(5.97)

56
In this case there are three Feynman diagrams with two cycles each, and six diagrams with a single cycle. Let us
introduce a couple of contracted derivative tensors:

E θ ((f (x))2 ∇Tθ f (x)∇θ f (x)) = n−2 E θ (ai1 W i1 xai2 W i2 x(δi3 k W i3 xδ kl δi4 l W i4 x + ai3 δ i3 k δkl ai4 δ i4 l xT x)) =
= n−2 E θ (ai1 W i1 xai2 W i2 x(δi3 i4 W i3 xW i4 x + ai3 δ i3 i4 ai4 xT x)) =
= n−2 (δi1 i2 δi3 i4 )(δ i1 i2 δ i3 i4 + δ i1 i3 δ i2 i4 + δ i1 i4 δ i2 i3 )(xT x)2 + n−2 (δi1 i2 δi3 i4 + δi1 i3 δi2 i4 + δi1 i4 δi2 i3 )(δ i1 i2 δ i3 i4 )(xT x)2 =
= (2 + 4n−1 )(xT x)2 . (5.98)

Here we have only those Feynman diagrams that have an edge (v3 , v4 ). There are two such diagrams with two
cycles each, and four with a single cycle.
Note that if there is an edge in a cluster graph GC , there is also an edge, a or w type, in each γ from Γ(C).
Note also that each cycle in γ consists of even number of edges. Hence each cycle consists of even clusters and an
even number of odd clusters. Hence there could be at most ne + no /2 cycles in γ, which proves Conjecture 1 for
shallow linear nets:

C(x1 , . . . , xm ) = [m = 2k]On→∞ (nmaxγ∈Γ(C) lγ −m/2 ) = [m = 2k]On→∞ (nne +no /2−m/2 ). (5.99)

Deep nets.
In the case of a network with L hidden layers, there are L + 1 edges of types W0 , . . . WL adjacent to each node.
Feynman diagrams are still well-defined, however, it is not obvious how to define the number of loops in this case.
The correct way to do it is to count the loops in a corresponding double-line diagram. Given a Feynman diagram
γ, define the double-line diagram DL(γ) as follows:
(1) (L)
• Each vertex vi of γ maps to L vertices vi , . . . , vi in DL(γ).
(1) (1)
• An edge (vi , vj ) of type W0 maps to an edge (vi , vj ).
(L) (L)
• An edge (vi , vj ) of type WL maps to an edge (vi , vj ).
(l) (l) (l+1) (l+1)
• ∀l ∈ [L − 1] an edge (vi , vj ) of type Wl maps to a pair of edges: (vi , vj ) and (vi , vj ).
We see that each of the Lm vertices of a double-line diagram has degree 2; hence the number of loops is well-defined.
For L = 1, a double-line diagram recovers the corresponding Feynman diagram without edge types. For any L, we
have the following:
C(x1 , . . . , xm ) = [m = 2k]On→∞ (nmaxγ∈Γ(C) lγ −Lm/2 ), (5.100)
where now lγ is a number of loops in DL(γ).
In order to get intuition about this result, let us consider a network with two hidden layers. For the sake of
simplicity, assume x is a scalar:
1
f (x) = aT W vx. (5.101)
n
E θ (f (x1 )f (x2 )) = n−2 E θ (ai1 W i1 j1 vj1 x1 ai2 W i2 j2 vj2 x2 ) = n−2 δi1 i2 δ i1 i2 δ j1 j2 δj1 j2 x1 x2 = x1 x2 . (5.102)
Here both a and v result in a single Kronecker delta, hence they correspond to a single edge in a double-line diagram.
At the same time, W results in a product of two deltas, in its turn resulting in a pair of edges in the diagram.
Similar to the case of L = 1, contracted derivative tensors force the existence of corresponding edges in the
Feynman diagram. Given a Feynman diagram γ, define sγ = lγ − Lm/2, or, in other words, a number of loops in
DL(γ) minus a number of vertices in DL(γ) halved. Let cγ be a number of connected components of γ. We shall
prove that
m
sγ ≤ cγ − . (5.103)
2
Note that eq. (5.103) holds for L = 1 since all connected components of γ are loops in this case. Let us express
γ as a union of its connected components γ ′ ; given this, sγ = γ ′ sγ ′ . We are going to show that sγ ′ ≤ 1 − m′ /2,
P
where m′ is a number of vertices in the component γ ′ . The latter will imply sγ ≤ cγ − m/2.

57
Let v, e, and f be a number of vertices, a number of edges, and a number faces of γ ′ . We already know that
v = m′ , e = (L + 1)m′ /2, and f = lγ ′ . Hence sγ ′ = lγ ′ − Lm′ /2 = f − Lv/2. On the other hand, the Euler
characteristic of γ ′ is χ = v − e + f = sγ ′ + m′ (1 + L/2) − (L + 1)m′ /2 = sγ ′ + m′ /2. Since γ ′ is a triangulation of
a planar surface with at least one boundary, χ ≤ 1. Hence sγ ′ ≤ 1 − m′ /2, which was required.
Consequently, we may rewrite (5.100) as follows:

C(x1 , . . . , xm ) = [m = 2k]On→∞ (nmaxγ∈Γ(C) cγ −m/2 ). (5.104)

It is now easy to conclude that cγ ≤ ne + no /2. Indeed, each connected component of γ consists of connected
components of the cluster graph GC . Hence cγ ≤ ne + no . Moreover, each connected component of γ consists of
even number of vertices, hence it can contain only even number of odd connected components of GC . This gives
cγ ≤ ne + no /2, which is required.

5.3 GD convergence via kernel stability


Recall the model prediction dynamics (eq. (5.3)):

˙ ′ ∂ℓ(y, z)
ft (x ) = −ηE x,y Θ̂t (x′ , x). (5.105)
∂z z=ft (x)

On the train dataset (~x, ~y) we have the following:



η ∂ℓ(~y, ~z)
f˙t (~x) = − Θ̂t (~x, ~x) . (5.106)
m ∂~z ~z=ft (~x)

For the special case of square loss:


η
f˙t (~x) = Θ̂t (~x, ~x)(~y − ft (~x)). (5.107)
m
Let us consider the evolution of a loss:
 
d 1 η η
k~y − ft (~x)k2 = − (~y − ft (~x))T Θ̂t (~x, ~x)(~y − ft (~x)) ≤ − λmin (Θ̂t (~x, ~x))k~y − ft (~x)k22
2
(5.108)
dt 2 m m

Consider λmin ≥ 0 such that ∀t ≥ 0 λmin (Θ̂t (~x, ~x)) ≥ λmin . This allows us to solve the differential inequality:

k~y − ft (~x)k22 ≤ e−2ηλmin t/m k~y − f0 (~x)k22 . (5.109)

Hence having λmin > 0 ensures that the gradient descent converges to a zero-loss solution. There is a theorem
that guarantees that the least eigenvalue of the kernel stays separated away from zero for wide-enough NTK-
parameterized two-layered networks with ReLU activation:
Theorem 19 ([Du et al., 2019]). Consider the following model:
n
1 X
f (x; a1:n , w1:n ) = √ ai [wiT x]+ . (5.110)
n i=1

Assume we aim to minimize the square loss on the dataset (~x, ~y ) of size m via a gradient descent on the input
weights:
m
1 X
ẇi (t) = √ (yk − f (xk ; a1:n , w1:n (t)))ai [wiT (t)xk > 0]xk , wi (0) ∼ N (0, In0 ), ai ∼ U ({−1, 1}) ∀i ∈ [n].
n
k=1
(5.111)
Assume also that kxk k2 ≤ 1 and |yk | < 1 ∀k ∈ [m]. Let H ∞ be an expected gramian of the NTK at initialization
and let λ0 be its least eigenvalue:

Hkl = E w∼N (0,In0 ) [wT xk > 0][wT xl > 0]xTk xl , λ0 = λmin (H ∞ ). (5.112)

58
Then ∀δ ∈ (0, 1) taking
2 5 3 2 m6 m2
    6 
8 2m m
n > 2 max 4 , 2 log =Ω (5.113)
π λ0 δ 3 λ0 δ λ40 δ 3
guarantees that w.p. ≥ 1 − δ over initialization we have an exponential convergence to a zero-loss solution:
k~y − ft (~x)k22 ≤ e−λ0 t k~y − f0 (~x)k22 . (5.114)

Proof. From what was shown above, it suffices to show that λmin (H(t)) ≥ λ0 /2 with given probability for n
sufficiently large, where H(t) is a gram matrix of the NTK at time t:
n
1X T
Hkl (t) = [w (t)xk > 0][wiT (t)xl > 0]xTk xl . (5.115)
n i=1 i

We shall first show that H(0) ≥ 3λ0 /4:


Lemma 13. ∀δ ∈ (0, 1) taking n ≥ 128m2 λ−2 0 log(m/δ) guarantees that w.p. ≥ 1 − δ over initialization we have
kH(0) − H ∞ k2 ≤ λ0 /4 and λmin (H(0)) ≥ 3λ0 /4.
Next, we shall show that the initial Gram matrix H(0) is stable with respect to initial weights w1:n (0):
Lemma 14. ∀δ √ ∈ (0, 1) w.p. ≥ 1 − δ over initialization for any set of weights w1:n that satisfy ∀i ∈ [n] kwi (0) −
wi k2 ≤ R(δ) := ( 2π/16)δλ0 m−2 , the corresponding Gram matrix H satisfies kH − H(0)k < λ0 /4 and λmin (H) >
λ0 /2.
After that, we shall show that lower bounded eigenvalues of the Gram matrix gives exponential convergence on
the train set. Moreover, weights stay close to initialization, as the following lemma states:
2 −λ0 t
Lemma 15. Suppose for s ∈ [0, t] λmin (H(s)) p ≥ λ0 /2. Then we have k~y − ft (~x)k2 ≤ e k~y − f0 (~x)k22 and for

any i ∈ [n] kwi (t) − wi (0)k2 ≤ R := (2/λ0 ) (m/n)k~y − f0 (~x)k2 .
Finally, we shall show that when R′ < R(δ), the conditions of Lemma 14 and of Lemma 15 hold ∀t ≥ 0
simultaneously:
Lemma 16. Let δ ∈ (0, 1/3). If R′ < R(δ), then w.p. ≥ 1 − 3δ over initialization ∀t ≥ 0 λmin (H(t)) ≥ λ0 /2 and
∀i ∈ [n] kwi (t) − wi (0)k2 ≤ R′ and k~y − ft (~x)k22 ≤ e−λ0 t k~y − f0 (~x)k22 .
Hence for δ ∈ (0, 1), R′ < R(δ/3) suffices for the theorem to hold:
√ √
2 mk~y − f0 (~x)k2 2πδλ0
√ = R′ < R(δ/3) = , (5.116)
λ0 n 48m2
which is equivalent to:
29 32 m5 k~y − f0 (~x)k22
n> . (5.117)
πλ40 δ 2
We further bound:
E k~y − f0 (~x)k22 = E k~yk22 − 2~y T E f0 (~x) + E kf0 (~x)k22 ≤ 2m. (5.118)
Hence by Markov’s inequality, w.p. ≥ 1 − δ
E k~y − f0 (~x)k22 2m
k~y − f0 (~x)k22 ≤ ≤ . (5.119)
δ δ
By a union bound, in order to have the desired properties w.p. ≥ 1 − 2δ, we need:
210 32 m6
n> . (5.120)
πλ40 δ 3
If we want the things hold w.p. ≥ 1 − δ, noting Lemma 13, we finally need the following:
m6 m2
  
2m
n > max (213 32 /π) 4 3 , 28 2 log . (5.121)
λ0 δ λ0 δ

59
Let us prove the lemmas.
Proof of Lemma 13. Since all Hkl (0) are independent random variables, we can apply Hoeffding’s inequality for
each of them independently:
2

P(|Hkl (0) − Hkl | ≥ ǫ) ≤ 2e−nǫ /2 . (5.122)
−nǫ2 /2
p
For a given δ, take ǫ such that δ = 2e . This gives ǫ = −2 log(δ/2)/n, or,
p
∞ 2 log(1/δ)
|Hkl (0) − Hkl |≤ √ w.p. ≥ 1 − δ over initialization. (5.123)
n
Applying a union bound gives:
p r
∞ 2 log(m2 /δ) 8 log(m/δ)
|Hkl (0) − Hkl | ≤ √ ≤ ∀k, l ∈ [m] w.p. ≥ 1 − δ over initialization. (5.124)
n n
Hence m
X 8m2 log(m/δ)
kH(0) − H ∞ k22 ≤ kH(0) − H ∞ k2F ≤ ∞ 2
|Hkl (0) − Hkl | ≤ . (5.125)
n
k,l=1

In order to get kH(0) − H ∞ k2 ≤ λ0 /4, we need to solve:


r
8m2 log(m/δ) λ0
≤ . (5.126)
n 4
This gives:
128m2 log(m/δ)
n≥ . (5.127)
λ20
This gives kH(0) − H ∞ k2 ≤ λ0 /4, which implies:

λmin (H(0)) = λmin (H ∞ + (H(0) − H ∞ )) ≥ λmin (H ∞ ) − λmax (H(0) − H ∞ ) ≥ λ0 − λ0 /4 = 3λ0 /4. (5.128)

Proof of Lemma 14. We define the event in the space of wi (0) realizations:

Aki = {∃w : kw − wi (0)k2 ≤ R, [wT xk ≥ 0] 6= [wiT (0)xk ≥ 0]}. (5.129)


′ ′
When Aki holds, we can always take w = wki , where wki is defined as follows:
(
′ wi (0) − Rxk , if wiT (0)xk ≥ 0
wki = (5.130)
wi (0) + Rxk , if wiT (0)xk < 0.

Hence Aki is equivalent to the following:


′,T
A′ki = {[wki xk ≥ 0] 6= [wiT (0)xk ≥ 0]}. (5.131)

This event holds iff |wiT (0)xk | < R. Since wi (0) ∼ N (0, I), we have
2R
P(Aki ) = P(A′ki ) = Pz∼N (0,1) {|z| < R} ≤ √ . (5.132)

We can bound the entry-wise deviation of H ′ from the H(0) matrix:
!
n
′ 1 T X  T T ′,T ′,T

E |Hkl (0) − Hkl | =E x xl [wi (0)xk > 0][wi (0)xl > 0] − [wki xk > 0][wli xl > 0] ≤

n k i=1
n
1X 4R
≤ E [A′ki ∪ A′li ] ≤ √ . (5.133)
n i=1 2π

60
Pm ′

Hence E k,l=1 |Hkl (0) − Hkl | ≤ 4m2 R/ 2π. Hence by Markov’s inequality,
m
X
′ 4m2 R
|Hkl (0) − Hkl |≤ √ w.p. ≥ 1 − δ over initialization. (5.134)
k,l=1
2πδ
Pm
Since kH(0) − H ′ k2 ≤ kH(0) − H ′ kF ≤ k,l=1 |Hkl (0) − Hkl′
|, the same probabilistic bound holds for kH(0) − H ′ k2 .
Note that ∀k ∈ [m] ∀i ∈ [n] for any w ∈ R such that kw − wi (0)k2 ≤ R, [wT xk ≥ 0] 6= [wiT (0)xk ≥ 0] implies
n0
′,T
[wki xk ≥ 0] 6= [wiT (0)xk ≥ 0]. Hence ∀k, l ∈ [m] for any set of weights w1:n such that ∀i ∈ [n] kwi − wi (0)k2 ≤ R,

|Hkl (0) − Hkl | ≤ |Hkl (0) − Hkl |. This means that w.p. ≥ 1 − δ over initialization, for any set of weights w1:n such
that ∀i ∈ [n] kwi − wi (0)k2 ≤ R,
4m2 R
kH(0) − Hk2 ≤ kH(0) − H ′ k2 ≤ √ . (5.135)
2πδ
In order to get the required bound, it suffices to solve the equation:

4m2 R λ0 2πδλ0
√ = , which gives R = . (5.136)
2πδ 4 16m2

The bound on the minimal eigenvalue is then straightforward:

λmin (H) = λmin (H(0) + (H − H(0))) ≥ λmin (H(0)) − λmax (H − H(0)) ≥ 3λ0 /4 − λ0 /4 = λ0 /2. (5.137)

Proof of Lemma 15. For s ∈ [0, t] we have:

dk~y − fs (~x)k22
= −2(~y − fs (~x))T H(s)(~y − fs (~x)) ≤ −λ0 k~y − fs (~x)k22 , (5.138)
ds
which implies:
d(log(k~y − fs (~x)k22 ))
≤ −λ0 . (5.139)
ds
Hence
log(k~y − fs (~x)k22 ) ≤ log(k~y − f0 (~x)k22 ) − λ0 s, (5.140)
or, equivalently,
k~y − fs (~x)k22 ≤ e−λ0 s k~y − f0 (~x)k22 , (5.141)
which holds, for instance, for s = t. In order to bound weight deviation, we first bound the gradient norm:

m
dwi (s) 1 X
T

= √

(y k − f s (xk ))a i [wi (s)xk > 0]xk ≤

ds n
2
k=1

2
m r r
1 X m m −λ0 s/2
≤√ |yk − fs (xk )| ≤ k~y − fs (~x)k2 ≤ e k~y − f0 (~x)k2 . (5.142)
n n n
k=1

This gives ∀i ∈ [n]:


Z t Z t
dwi (s) dwi (s)
kwi (t) − wi (0)k2 = ds ≤

ds ds ≤


0 ds 2 0
√ 2 √
2 m   2 m
≤ √ 1 − e−λ0 t/2 k~y − f0 (~x)k2 ≤ √ k~y − f0 (~x)k2 . (5.143)
λ0 n λ0 n

61
Proof of Lemma 16. Proof by contradiction. Take δ ∈ (0, 1/3) and suppose that R′ < R(δ), however, w.p. > 3δ
over initialization ∃t∗ > 0 : either λmin (H(t∗ )) < λ0 /2, or ∃i ∈ [n] kwi (t∗ ) − wi (0)k2 > R′ , or k~y − ft∗ (~x)k2 >
exp(−λ0 t∗ )k~y − f0 (~x)k2 . If either of the last two holds, then by Lemma 15, ∃s ∈ [0, t∗ ] λmin (H(s)) < λ0 /2. If the
former holds, we can take s = t∗ . Hence by virtue of Lemma 14, for this particular s w.p. > 2δ over initialization
∃i ∈ [n] kwi (s) − wi (0)k2 > R(δ). Define:
 
t0 = inf t ≥ 0 : max kwi (t) − wi (0)k2 > R(δ) . (5.144)
i∈[n]

Note that w.p. > 2δ over initialization t0 ≤ s ≤ t∗ < ∞. Since wi (·) is a continuous map, w.p. > 2δ over
initialization maxi∈[n] kwi (t0 ) − wi (0)k2 = R(δ). Hence by Lemma 14, w.p. > δ over initialization ∀t ∈ [0, t0 ]
λmin (H(t)) ≥ λ0 /2. Hence by Lemma 15, ∀i ∈ [n] kwi (t0 ) − wi (0)k2 ≤ R′ . Hence w.p. > δ over initialization we
have a contradiction with maxi∈[n] kwi (t0 ) − wi (0)k2 = R(δ) and R′ < R(δ).

5.3.1 Component-wise convergence guarantees and kernel alignment


Denote ~u(t) = ft (~x). We have the following dynamics for quadratic loss:
n
d~u(t) 1 X
= H(t)(~y − ~u(t)), uk (0) = √ ai [wiT (0)xk ]+ ∀k ∈ [m], (5.145)
dt n i=1

where
n
1X T
Hkl (t) = [w (t)xk ≥ 0][wiT (t)xl ≥ 0]xTk xl . (5.146)
n i=1 i
Additionaly, following [Arora et al., 2019a], consider the limiting linearized dynamics:
d~u′ (t)
= H ∞ (~y − ~u′ (t)), u′k (0) = uk (0) ∀k ∈ [m], (5.147)
dt
where

Hkl = E Hkl (0) = E w∼N (0,I) [wT xk ≥ 0][wT xl ≥ 0]xTk xl . (5.148)
Solving the above gives: ∞
~u′ (t) = ~y + e−H
(~u(0) − ~y ) t
(5.149)
Pm
∞ T ∞ m
Consider an eigenvalue-eigenvector decomposition for H : H = k=1 λk ~vk ~vk , where {~vk }k=1 forms an orthonor-
mal basis in Rm and λ1 ≥ . . . ≥ λm ≥ 0. Note that exp(−H ∞ t) then has the same set of eigenvectors, and each
eigenvector ~vk corresponds to an eigenvalue exp(−λk t). Then the above solution is rewritten as:
m
X
~u′ (t) − ~
y=− e−λk t (~vkT (~y − ~u(0)))~vk , (5.150)
k=1

which implies
m
X
k~u′ (t) − ~y k22 = e−2λk t (~vkT (~y − ~u(0)))2 . (5.151)
k=1

We see that components ~vkT (~y − ~u(0)) that correspond to large λk decay faster. Hence convergence is fast if ∀k ∈ [m]
large ~vkT (~y − ~u(0)) implies large λk . In this case, we shall say that the initial kernel aligns well with the dataset.
It turns out, that realistic datasets align well with NTKs of realistic nets, however, datasets with random labels
do not. This observation substitutes a plausible explanation for a phenomenon noted in [Zhang et al., 2016]: large
networks learn corrupted datasets much slower than clean ones.
The above speculation is valid for the limiting linearized dynamics ~u′ (t). It turns out that given n large enough,
the true dynamics ~u(t) stays close to its limiting linearized version:
Theorem 20 ([Arora et al., 2019a]). Suppose λ0 = λmin (H ∞ ) > 0. Take ǫ > 0 and δ ∈ (0, 1). Then there exists a
constant Cn > 0 such that for
m7
n ≥ Cn 4 4 2 , (5.152)
λ0 δ ǫ
w.p. ≥ 1 − δ over initialization, ∀t ≥ 0 k~u(t) − ~u′ (t)k2 ≤ ǫ.

62
Proof. We start with stating a reformulation of Lemma 16:
6
Lemma 17. Let δ ∈ (0, 1). There exists Cn′ > 0 such that for n ≥ Cn′ λm
4 δ 3 , w.p. ≥ 1 − δ over initialization, ∀t ≥ 0
0

4 mk~y − ~
u (0)k 2
kwi (t) − wi (0)k2 ≤ R′ := √ ∀i ∈ [n]. (5.153)
n

We proceed with an analogue of Lemma 14:


Lemma 18. Let δ ∈ (0, 1). There exist CH , CZ > 0 such that w.p. ≥ 1 − δ over initialization, ∀t ≥ 0
s
m3 m2
kH(t) − H(0)kF ≤ CH 1/2 , kZ(t) − Z(0)k F ≤ C Z . (5.154)
n λ0 δ 3/2 n1/2 λ0 δ 3/2

The last lemma we need is an analogue of Lemma 13:



Lemma 19. Let δ ∈ (0, 1). There exist CH > 0 such that w.p. ≥ 1 − δ over initialization,
m m
kH(0) − H ∞ kF ≤ CH′
log . (5.155)
n1/2 δ

Let us elaborate the dynamics over:


d~u(t) ~
= H(t)(~y − ~u(t)) = H ∞ (~y − ~u(t)) + (H(t) − H ∞ )(~y − ~u(t)) = H ∞ (~y − ~u(t)) + ζ(t). (5.156)
dt

~
~u(t) = e−H t C(t). (5.157)
d~u(t) ~ ~

~ + e−H ∞ t dC(t) = H ∞ (~y − ~u(t)) + ζ(t)
= −H ∞ e−H t C(t) ~ − H ∞ ~y + e−H ∞ t dC(t) − ζ(t).
~ (5.158)
dt dt dt
~
dC(t) ∞
~
= eH t (H ∞ ~y + ζ(t)). (5.159)
dt
Z t
∞ ∞
~
C(t) = ~u(0) + (eH t − I)~y + ~ ) dτ.
eH τ ζ(τ (5.160)
0
Z t

~u(t) = ~y + e−H t (~u(0) − ~y) +

~ ) dτ.
eH (τ −t) ζ(τ (5.161)
0
Z t Z t
H ∞ (τ −t) ~ H ∞ (τ −t) ~
k~u(t) − ~u′ (t)k2 =

e ζ(τ ) dτ ≤ e ζ(τ ) dτ ≤

0

2 0 2
Z t Z t
~ )k2 −H ∞ τ ~
≤ max kζ(τ e dτ ≤ max kζ(τ )k2 e−λ0 τ dτ ≤
τ ∈[0,t] 0 2 τ ∈[0,t] 0

~ )k2 1 1 − e−λ0 t ≤ 1 max kζ(τ ~ )k2 . (5.162)



≤ max kζ(τ
τ ∈[0,t] λ0 λ0 τ ∈[0,t]

~ )k2 = k(H(τ ) − H ∞ )(~y − ~u(τ ))k2 ≤ (kH(τ ) − H(0)k2 + kH(0) − H ∞ k2 ) k~y − ~u(τ )k2 ≤
kζ(τ
≤ (kH(τ ) − H(0)kF + kH(0) − H ∞ kF ) k~y − ~u(0)k2 . (5.163)
p
Due to Lemma 19 and Lemma 18, and since k~y − ~u(0)k2 ≤ 2m/δ w.p. ≥ 1 − δ over initialization, we have

m3 m  m  r 2m √ m7/2 √ ′ m3/2 m
~
kζ(τ )k2 ≤ CH 1/2 + C ′
log = 2CH + 2C log (5.164)
H H
n λ0 δ 3/2 n1/2 δ δ n1/2 λ0 δ 2 n1/2 δ 1/2 δ
w.p. ≥ 1 − 3δ over initialization. Given ǫ > 0, we need
m7
n ≥ Cn (5.165)
λ40 δ 4 ǫ2
for some Cn > 0 in order to ensure k~u(t) − ~u′ (t)k2 ≤ ǫ w.p. ≥ 1 − δ over initialization.

63
Bibliography

[Arora et al., 2019a] Arora, S., Du, S., Hu, W., Li, Z., and Wang, R. (2019a). Fine-grained analysis of optimization
and generalization for overparameterized two-layer neural networks. In International Conference on Machine
Learning, pages 322–332.
[Arora et al., 2019b] Arora, S., Du, S. S., Hu, W., Li, Z., Salakhutdinov, R. R., and Wang, R. (2019b). On exact
computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems, pages
8141–8150.
[Bartlett et al., 2017] Bartlett, P. L., Foster, D. J., and Telgarsky, M. J. (2017). Spectrally-normalized margin
bounds for neural networks. In Advances in Neural Information Processing Systems, pages 6240–6249.
[Bartlett et al., 2019] Bartlett, P. L., Harvey, N., Liaw, C., and Mehrabian, A. (2019). Nearly-tight vc-dimension
and pseudodimension bounds for piecewise linear neural networks. J. Mach. Learn. Res., 20:63–1.
[Donsker and Varadhan, 1985] Donsker, M. and Varadhan, S. (1985). Large deviations for stationary gaussian
processes. Communications in Mathematical Physics, 97(1-2):187–210.
[Draxler et al., 2018] Draxler, F., Veschgini, K., Salmhofer, M., and Hamprecht, F. (2018). Essentially no barriers
in neural network energy landscape. In International Conference on Machine Learning, pages 1309–1318.
[Du et al., 2019] Du, S. S., Zhai, X., Poczos, B., and Singh, A. (2019). Gradient descent provably optimizes over-
parameterized neural networks. In International Conference on Learning Representations.
[Dudley, 1967] Dudley, R. M. (1967). The sizes of compact subsets of hilbert space and continuity of gaussian
processes. Journal of Functional Analysis, 1(3):290–330.
[Dyer and Gur-Ari, 2020] Dyer, E. and Gur-Ari, G. (2020). Asymptotics of wide networks from feynman diagrams.
In International Conference on Learning Representations.
[Dziugaite and Roy, 2017] Dziugaite, G. K. and Roy, D. M. (2017). Computing nonvacuous generalization
bounds for deep (stochastic) neural networks with many more parameters than training data. arXiv preprint
arXiv:1703.11008.
[Garipov et al., 2018] Garipov, T., Izmailov, P., Podoprikhin, D., Vetrov, D. P., and Wilson, A. G. (2018). Loss
surfaces, mode connectivity, and fast ensembling of dnns. In Advances in Neural Information Processing Systems,
pages 8789–8798.
[Glorot and Bengio, 2010] Glorot, X. and Bengio, Y. (2010). Understanding the difficulty of training deep feed-
forward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and
statistics, pages 249–256.
[He et al., 2015] He, K., Zhang, X., Ren, S., and Sun, J. (2015). Delving deep into rectifiers: Surpassing human-
level performance on imagenet classification. In Proceedings of the IEEE international conference on computer
vision, pages 1026–1034.
[Hoeffding, 1963] Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of
the American Statistical Association, 58(301):13–30.

64
[Huang and Yau, 2019] Huang, J. and Yau, H.-T. (2019). Dynamics of deep neural networks and neural tangent
hierarchy. arXiv preprint arXiv:1909.08156.
[Isserlis, 1918] Isserlis, L. (1918). On a formula for the product-moment coefficient of any order of a normal frequency
distribution in any number of variables. Biometrika, 12(1/2):134–139.
[Jacot et al., 2018] Jacot, A., Gabriel, F., and Hongler, C. (2018). Neural tangent kernel: Convergence and gener-
alization in neural networks. In Advances in neural information processing systems, pages 8571–8580.
[Jin et al., 2017] Jin, C., Ge, R., Netrapalli, P., Kakade, S. M., and Jordan, M. I. (2017). How to escape saddle
points efficiently. In International Conference on Machine Learning, pages 1724–1732.
[Kawaguchi, 2016] Kawaguchi, K. (2016). Deep learning without poor local minima. In Advances in neural infor-
mation processing systems, pages 586–594.
[Laurent and Brecht, 2018] Laurent, T. and Brecht, J. (2018). Deep linear networks with arbitrary loss: All local
minima are global. In International conference on machine learning, pages 2902–2907. PMLR.
[Lee et al., 2019] Lee, J., Xiao, L., Schoenholz, S., Bahri, Y., Novak, R., Sohl-Dickstein, J., and Pennington, J.
(2019). Wide neural networks of any depth evolve as linear models under gradient descent. In Advances in neural
information processing systems, pages 8572–8583.
[Lee et al., 2016] Lee, J. D., Simchowitz, M., Jordan, M. I., and Recht, B. (2016). Gradient descent only converges
to minimizers. In Conference on learning theory, pages 1246–1257.
[Lu and Kawaguchi, 2017] Lu, H. and Kawaguchi, K. (2017). Depth creates no bad local minima. arXiv preprint
arXiv:1702.08580.
[Marchenko and Pastur, 1967] Marchenko, V. A. and Pastur, L. A. (1967). Распределение собственных значений
в некоторых ансамблях случайных матриц. Математический сборник, 72(4):507–536.
[McAllester, 1999a] McAllester, D. A. (1999a). Pac-bayesian model averaging. In Proceedings of the twelfth annual
conference on Computational learning theory, pages 164–170.
[McAllester, 1999b] McAllester, D. A. (1999b). Some pac-bayesian theorems. Machine Learning, 37(3):355–363.
[McDiarmid, 1989] McDiarmid, C. (1989). On the method of bounded differences. Surveys in combinatorics,
141(1):148–188.
[Nagarajan and Kolter, 2019] Nagarajan, V. and Kolter, J. Z. (2019). Uniform convergence may be unable to explain
generalization in deep learning. In Advances in Neural Information Processing Systems, pages 11615–11626.
[Neyshabur et al., 2018] Neyshabur, B., Bhojanapalli, S., and Srebro, N. (2018). A PAC-bayesian approach to
spectrally-normalized margin bounds for neural networks. In International Conference on Learning Representa-
tions.
[Neyshabur et al., 2015] Neyshabur, B., Tomioka, R., and Srebro, N. (2015). In search of the real inductive bias:
On the role of implicit regularization in deep learning. In ICLR (Workshop).
[Nguyen, 2019] Nguyen, Q. (2019). On connected sublevel sets in deep learning. In International Conference on
Machine Learning, pages 4790–4799.
[Nguyen and Hein, 2017] Nguyen, Q. and Hein, M. (2017). The loss surface of deep and wide neural networks. In
Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2603–2612.
[Panageas and Piliouras, 2017] Panageas, I. and Piliouras, G. (2017). Gradient descent only converges to minimizers:
Non-isolated critical points and invariant regions. In 8th Innovations in Theoretical Computer Science Conference
(ITCS 2017). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.
[Pennington et al., 2017] Pennington, J., Schoenholz, S., and Ganguli, S. (2017). Resurrecting the sigmoid in deep
learning through dynamical isometry: theory and practice. In Advances in neural information processing systems,
pages 4785–4795.

65
[Poole et al., 2016] Poole, B., Lahiri, S., Raghu, M., Sohl-Dickstein, J., and Ganguli, S. (2016). Exponential
expressivity in deep neural networks through transient chaos. In Advances in neural information processing
systems, pages 3360–3368.
[Sauer, 1972] Sauer, N. (1972). On the density of families of sets. Journal of Combinatorial Theory, Series A,
13(1):145–147.
[Saxe et al., 2013] Saxe, A. M., McClelland, J. L., and Ganguli, S. (2013). Exact solutions to the nonlinear dynamics
of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120.
[Schoenholz et al., 2016] Schoenholz, S. S., Gilmer, J., Ganguli, S., and Sohl-Dickstein, J. (2016). Deep information
propagation. arXiv preprint arXiv:1611.01232.
[Tao, 2012] Tao, T. (2012). Topics in random matrix theory, volume 132. American Mathematical Soc.
[Tropp, 2011] Tropp, J. A. (2011). User-friendly tail bounds for sums of random matrices. Foundations of Compu-
tational Mathematics, 12(4):389–434.
[Vapnik and Chervonenkis, 1971] Vapnik, V. N. and Chervonenkis, A. Y. (1971). О равномерной сходимости
частот появления событий к их вероятностям. Теория вероятностей и ее применения, 16(2):264–279.
[Voiculescu, 1987] Voiculescu, D. (1987). Multiplication of certain non-commuting random variables. Journal of
Operator Theory, pages 223–235.
[Yu and Chen, 1995] Yu, X.-H. and Chen, G.-A. (1995). On the local minima free condition of backpropagation
learning. IEEE Transactions on Neural Networks, 6(5):1300–1303.
[Zhang et al., 2016] Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2016). Understanding deep
learning requires rethinking generalization. arXiv preprint arXiv:1611.03530.
[Zhou et al., 2019] Zhou, W., Veitch, V., Austern, M., Adams, R. P., and Orbanz, P. (2019). Non-vacuous gener-
alization bounds at the imagenet scale: a PAC-bayesian compression approach. In International Conference on
Learning Representations.

66

You might also like