OptimML

Course notes on
Optimization for Machine Learning

Gabriel Peyré
CNRS & DMA
École Normale Supérieure
[email protected]
https://mathematical-tours.github.io
www.numerical-tours.com
December 9, 2024
Abstract
This document presents first order optimization methods and their applications to machine learning.
This is not a course on machine learning (in particular it does not cover modeling and statistical consid-
erations) and it is focussed on the use and analysis of cheap methods that can scale to large datasets and
models with lots of parameters. These methods are variations around the notion of “gradient descent”,
so that the computation of gradients plays a major role. This course covers basic theoretical properties
of optimization problems (in particular convex analysis and first order differential calculus), the gradient
descent method, the stochastic gradient method, automatic differentiation, shallow and deep networks.
Contents
1 Motivation in Machine Learning 1
1.1 Unconstraint optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Basics of Convex Analysis 2

2.1 Existence of Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Convex Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3 Derivative and gradient 4

3.1 Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2 First Order Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.3 Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.4 Link with PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.5 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.6 Chain Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4 Gradient Descent Algorithm 9

4.1 Steepest Descent Direction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1
5 Convergence Analysis 11
5.1 Quadratic Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.2 General Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.3 Acceleration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
6 Mirror Descent and Implicit Bias 18

6.1 Bregman Divergences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
6.2 Mirror descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
6.3 Re-parameterized flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
6.4 Implicit Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
7 Regularization 22
7.1 Penalized Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
7.2 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
7.3 Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
7.4 Iterative Soft Thresholding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
8 Stochastic Optimization 26
8.1 Minimizing Sums and Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
8.2 Batch Gradient Descent (BGD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
8.3 Stochastic Gradient Descent (SGD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
8.4 Stochastic Gradient Descent with Averaging (SGA) . . . . . . . . . . . . . . . . . . . . . . . . 29
8.5 Stochastic Averaged Gradient Descent (SAG) . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
9 Automatic Differentiation 30
9.1 Finite Differences and Symbolic Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
9.2 Computational Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
9.3 Forward Mode of Automatic Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
9.4 Reverse Mode of Automatic Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
9.5 Feed-forward Compositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
9.6 Feed-forward Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
9.7 Recurrent Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
1 Motivation in Machine Learning

1.1 Unconstraint optimization
In most part of this Chapter, we consider unconstrained convex optimization problems of the form
inf f (x), (1)

x∈Rp
and try to devise “cheap” algorithms with a low computational cost per iteration to approximate a minimizer
when it exists. The class of algorithms considered are first order, i.e. they make use of gradient information.
In the following, we denote
def.
argmin f (x) = {x ∈ Rp ; f (x) = inf f } ,
x
to indicate the set of points (it is not necessarily a singleton since the minimizer might be non-unique) that
achieve the minimum of the function f . One might have argmin f = ∅ (this situation is discussed below),
but in case a minimizer exists, we denote the optimization problem as
min f (x). (2)

x∈Rp
2
Figure 1: Left: linear regression, middle: linear classifier, right: loss function for classification.
Figure 2: Left: non-existence of minimizer, middle: multiple minimizers, right: uniqueness.
In typical learning scenario, f (x) is the empirical risk for regression or classification, and p is the number
of parameter. For instance, in the simplest case of linear models, we denote (ai , yi )ni=1 where ai ∈ Rp are
the features. In the following, we denote A ∈ Rn×p the matrix whose rows are the ai .
1.2 Regression
For regression, yi ∈ R, in which case
n
1X 1
f (x) = (yi − ⟨x, ai ⟩)2 = ||Ax − y||2 , (3)
2 i=1 2
Pp
is the least square quadratic risk function (see Fig. 1). Here ⟨u, v⟩ = i=1 ui vi is the canonical inner product
in Rp and || · ||2 = ⟨·, ·⟩.
1.3 Classification
For classification, yi ∈ {−1, 1}, in which case
n
X
f (x) = ℓ(−yi ⟨x, ai ⟩) = L(− diag(y)Ax) (4)
i=1
where ℓ is a smooth approximation of the 0-1 loss 1R+ . For instance ℓ(u) = log(1 + exp(u)), and diag(y) ∈
Rn×n is the diagonal matrix with yiPalong the diagonal (see Fig. 1, right). Here the separable loss function
L = Rn → R is, for z ∈ Rn , L(z) = i ℓ(zi ).
2 Basics of Convex Analysis

2.1 Existence of Solutions
In general, there might be no solution to the optimization (1). This is of course the case if f is unbounded
by below, for instance f (x) = −x2 in which case the value of the minimum is −∞. But this might also
happen if f does not grow at infinity, for instance f (x) = e−x , for which min f = 0 but there is no minimizer.
3
Figure 3: Coercivity condition for least squares.
In order to show existence of a minimizer, and that the set of minimizer is bounded (otherwise one can
have problems with optimization algorithm that could escape to infinity), one needs to show that one can
replace the whole space Rp by a compact sub-set Ω ⊂ Rp (i.e. Ω is bounded and close) and that f is
continuous on Ω (one can replace this by a weaker condition, that f is lower-semi-continuous, but we ignore
this here). A way to show that one can consider only a bounded set is to show that f (x) → +∞ when
x → +∞. Such a function is called coercive. In this case, one can choose any x0 ∈ Rp and consider its
associated lower-level set
Ω = {x ∈ Rp ; f (x) ⩽ f (x0 )}
which is bounded because of coercivity, and closed because f is continuous. One can actually show that for
convex function, having a bounded set of minimizer is equivalent to the function being coercive (this is not
the case for non-convex function, for instance f (x) = min(1, x2 ) has a single minimum but is not coercive).
Example 1 (Least squares). For instance, for the quadratic loss function f (x) = 21 ||Ax − y||2 , coercivity holds
if and only if ker(A) = {0} (this corresponds to the overdetermined setting). Indeed, if ker(A) ̸= {0} if x⋆
is a solution, then x⋆ + u is also solution for any u ∈ ker(A), so that the set of minimizer is unbounded. On
contrary, if ker(A) = {0}, we will show later that the set of minimizer is unique, see Fig. 3. If ℓ is strictly
convex, the same conclusion holds in the case of classification.
2.2 Convexity
Convex functions define the main class of functions which are somehow “simple” to optimize, in the sense
that all minimizers are global minimizers, and that there are often efficient methods to find these minimizers
(at least for smooth convex functions). A convex function is such that for any pair of point (x, y) ∈ (Rp )2 ,
∀ t ∈ [0, 1], f ((1 − t)x + ty) ⩽ (1 − t)f (x) + tf (y) (5)
which means that the function is below its secant (and actually also above its tangent when this is well
defined), see Fig. 4. If x⋆ is a local minimizer of a convex f , then x⋆ is a global minimizer, i.e. x⋆ ∈ argmin f .
Convex function are very convenient because they are stable under lots of transformation. In particular,
if f , g are convex and a, b are positive, af + bg is convex (the set of convex function is itself an infinite
dimensional convex cone!) and so is max(f, g). If g : Rq → R is convex and B ∈ Rq×p , b ∈ Rq then
f (x) = g(Bx + b) is convex. This shows immediately that the square loss appearing in (3) is convex, since
|| · ||2 /2 is convex (as a sum of squares). Also, similarly, if ℓ and hence L is convex, then the classification
loss function (4) is itself convex.
Strict convexity. When f is convex, one can strengthen the condition (5) and impose that the inequality
is strict for t ∈]0, 1[ (see Fig. 4, right), i.e.
∀ t ∈]0, 1[, f ((1 − t)x + ty) < (1 − t)f (x) + tf (y). (6)
In this case, if a minimum x⋆ exists, then it is unique. Indeed, if x⋆1 ̸= x⋆2 were two different minimizer, one
x⋆ +x⋆
would have by strict convexity f ( 1 2 2 ) < f (x⋆1 ) which is impossible.
4
Figure 4: Convex vs. non-convex functions ; Strictly convex vs. non strictly convex functions.
Figure 5: Comparison of convex functions f : Rp → R (for p = 1) and convex sets C ⊂ Rp (for p = 2).
Example 2 (Least squares). For the quadratic loss function f (x) = 21 ||Ax − y||2 , strict convexity is equivalent
to ker(A) = {0}. Indeed, we see later that its second derivative is ∂ 2 f (x) = A⊤ A and that strict convexity
is implied by the eigenvalues of A⊤ A being strictly positive. The eigenvalues of A⊤ A being positive, it is
equivalent to ker(A⊤ A) = {0} (no vanishing eigenvalue), and A⊤ Az = 0 implies ⟨A⊤ Az, z⟩ = ||Az||2 = 0 i.e.
z ∈ ker(A).
2.3 Convex Sets

A set Ω ⊂ Rp is said to be convex if for any (x, y) ∈ Ω2 , (1 − t)x + ty ∈ Ω for t ∈ [0, 1]. The
connexionbetween convex function and convex sets is that a function f is convex if and only if its epigraph
def.
epi(f ) = (x, t) ∈ Rp+1 ; t ⩾ f (x) is a convex set.
Remark 1 (Convexity of the set of minimizers). In general, minimizers x⋆ might be non-unique, as shown on
Figure 3. When f is convex, the set argmin(f ) of minimizers is itself a convex set. Indeed, if x⋆1 and x⋆2 are
minimizers, so that in particular f (x⋆1 ) = f (x⋆2 ) = min(f ), then f ((1 − t)x⋆1 + tx⋆2 ) ⩽ (1 − t)f (x⋆1 ) + tf (x⋆2 ) =
f (x⋆1 ) = min(f ), so that (1 − t)x⋆1 + tx⋆2 is itself a minimizer. Figure 5 shows convex and non-convex sets.
3 Derivative and gradient

3.1 Gradient
If f is differentiable along each axis, we denote
⊤
def. ∂f (x) ∂f (x)
∇f (x) = ,..., ∈ Rp
∂x1 ∂xp
5
the gradient vector, so that ∇f : Rp → Rp is a vector field. Here the partial
derivative (when they exits) are defined as
∂f (x) def. f (x + ηδk ) − f (x)
= lim
∂xk η→0 η
where δk = (0, . . . , 0, 1, 0, . . . , 0)⊤ ∈ Rp is the k th canonical basis vector.
Beware that ∇f (x) can exist without f being differentiable. Differentiability of f at each reads
f (x + ε) = f (x) + ⟨ε, ∇f (x)⟩ + o(||ε||). (7)
Here R(ε) = o(||ε||) denotes a quantity which decays faster than ε toward 0, i.e. R(ε) ||ε|| → 0 as ε → 0. Existence
of partial derivative corresponds to f being differentiable along the axes, while differentiability should hold
for any converging sequence of ε → 0 (i.e. not along along a fixed direction). A counter example in 2-D is
f (x) = 2x1 xx22(x 1 +x2 )
2 with f (0) = 0, which is affine with different slope along each radial lines.
1 +x2
Also, ∇f (x) is the only vector such that the relation (7). This means that a possible strategy to both
prove that f is differentiable and to obtain a formula for ∇f (x) is to show a relation of the form
f (x + ε) = f (x) + ⟨ε, g⟩ + o(||ε||),
in which case one necessarily has ∇f (x) = g.

The following proposition shows that convexity is equivalent to the graph of the function being above its
tangents.
Proposition 1. If f is differentiable, then
f convex ⇔ ∀(x, x′ ), f (x) ⩾ f (x′ ) + ⟨∇f (x′ ), x − x′ ⟩.
Proof. One can write the convexity condition as

f (x + t(x′ − x)) − f (x)
f ((1 − t)x + tx′ ) ⩽ (1 − t)f (x) + tf (x′ ) =⇒ ⩽ f (x′ ) − f (x)
t
hence, taking the limit t → 0 one obtains
⟨∇f (x), x′ − x⟩ ⩽ f (x′ ) − f (x).

def.
For the other implication, we apply the right condition replacing (x, x′ ) by (x, xt = (1 − t)x + tx′ ) and
(x′ , (1 − t)x + tx′ )
f (x) ⩾ f (xt ) + ⟨∇f (xt ), x − xt ⟩ = f (xt ) − t⟨∇f (xt ), x − x′ ⟩

f (x′ ) ⩾ f (xt ) + ⟨∇f (xt ), x′ − xt ⟩ = f (xt ) + (1 − t)⟨∇f (xt ), x − x′ ⟩,
multiplying these inequality by respectively 1 − t and t, and summing them, gives
(1 − t)f (x) + tf (x′ ) ⩾ f (xt ).
3.2 First Order Conditions

The main theoretical interest (we will see later that it also have algorithmic interest) of the gradient
vector is that it is a necessarily condition for optimality, as stated below.
Proposition 2. If x⋆ is a local minimum of the function f (i.e. that f (x⋆ ) ⩽ f (x) for all x in some ball
around x⋆ ) then
∇f (x⋆ ) = 0.
6
Figure 6: Function with local maxima/minima (left), saddle point (middle) and global minimum (right).
Proof. One has for ε small enough and u fixed
f (x⋆ ) ⩽ f (x⋆ + εu) = f (x⋆ ) + ε⟨∇f (x⋆ ), u⟩ + o(ε) =⇒ ⟨∇f (x⋆ ), u⟩ ⩾ o(1) =⇒ ⟨∇f (x⋆ ), u⟩ ⩾ 0.
So applying this for u and −u in the previous equation shows that ⟨∇f (x⋆ ), u⟩ = 0 for all u, and hence
∇f (x⋆ ) = 0.
Note that the converse is not true in general, since one might have ∇f (x) = 0 but x is not a local
mininimum. For instance x = 0 for f (x) = −x2 (here x is a maximizer) or f (x) = x3 (here x is neither a
maximizer or a minimizer, it is a saddle point), see Fig. 6. Note however that in practice, if ∇f (x⋆ ) = 0 but
x is not a local minimum, then x⋆ tends to be an unstable equilibrium. Thus most often a gradient-based
algorithm will converge to points with ∇f (x⋆ ) = 0 that are local minimizers. The following proposition
shows that a much strong result holds if f is convex.
Proposition 3. If f is convex and x⋆ a local minimum, then x⋆ is also a global minimum. If f is differen-
tiable and convex,
x⋆ ∈ argmin f (x) ⇐⇒ ∇f (x⋆ ) = 0.
x
Proof. For any x, there exist 0 < t < 1 small enough such that tx + (1 − t)x⋆ is close enough to x⋆ , and so
since it is a local minimizer
f (x⋆ ) ⩽ f (tx + (1 − t)x⋆ ) ⩽ tf (x) + (1 − t)f (x⋆ ) =⇒ f (x⋆ ) ⩽ f (x)
and thus x⋆ is a global minimum.

For the second part, we already saw in (2) the ⇐ part. We assume that ∇f (x⋆ ) = 0. Since the graph of
x is above its tangent by convexity (as stated in Proposition 1),
f (x) ⩾ f (x⋆ ) + ⟨∇f (x⋆ ), x − x⋆ ⟩ = f (x⋆ ).
Thus in this case, optimizing a function is the same a solving an equation ∇f (x) = 0 (actually p
equations in p unknown). In most case it is impossible to solve this equation, but it often provides interesting
information about solutions x⋆ .
3.3 Least Squares

The most important gradient formula is the one of the square loss (3), which can be obtained by expanding
the norm
1 1 1
f (x + ε) = ||Ax − y + Aε||2 = ||Ax − y|| + ⟨Ax − y, Aε⟩ + ||Aε||2
2 2 2
= f (x) + ⟨ε, A⊤ (Ax − y)⟩ + o(||ε||).
7
Here, we have used the fact that ||Aε||2 = o(||ε||) and use the transpose matrix A⊤ . This matrix is obtained
by exchanging the rows and the columns, i.e. A⊤ = (Aj,i )j=1,...,p
i=1,...,n , but the way it should be remember and
used is that it obeys the following swapping rule of the inner product,
∀ (u, v) ∈ Rp × Rn , ⟨Au, v⟩Rn = ⟨u, A⊤ v⟩Rp .
Computing gradient for function involving linear operator will necessarily requires such a transposition step.
This computation shows that
∇f (x) = A⊤ (Ax − y). (8)
This implies that solutions x⋆ minimizing f (x) satisfies the linear system (A⊤ A)x⋆ = A⊤ y. If A⋆ A ∈ Rp×p
is invertible, then f has a single minimizer, namely
x⋆ = (A⊤ A)−1 A⊤ y. (9)
This shows that in this case, x⋆ depends linearly on the data y, and the corresponding linear operator
(A⊤ A)−1 A⋆ is often called the Moore-Penrose pseudo-inverse of A (which is not invertible in general, since
typically p ̸= n). The condition that A⊤ A is invertible is equivalent to ker(A) = {0}, since
A⊤ Ax = 0 =⇒ ||Ax||2 = ⟨A⊤ Ax, x⟩ = 0 =⇒ Ax = 0.
In particular, if n < p (under-determined regime, there is too much parameter or too few data) this can
never holds. If n ⩾ p and the features xi are “random” then ker(A) = {0} with probability one. In this
overdetermined situation n ⩾ p, ker(A) = {0} only holds if the features {ai }ni=1 spans a linear space Im(A⊤ )
of dimension strictly smaller than the ambient dimension p.
3.4 Link with PCA

Let us assume the (ai )ni=1 are centered, i.e.
P
i ai = 0. If this is not the case, one needs to replace
def. n ⊤
ai by ai − m where m = n1 i=1 ai ∈ Rp is the empirical mean. In this case, C p×p
P
n = A A/n ∈ R is
the empirical covariance of the point cloud (ai )i , it encodes the covariances between the coordinates of the
points. Denoting ai = (ai,1 , . . . , ai,p )⊤ ∈ Rp (so that A = (ai,j )i,j ) the coordinates, one has
n
2 Ck,ℓ 1X
∀ (k, ℓ) ∈ {1, . . . , p} , = ai,k ai,ℓ .
n n i=1
In particular, Ck,k /n is the variance along the axis k. More generally, for any unit vector u ∈ Rp , ⟨Cu, u⟩/n ⩾
0 is the variance along the axis u.
For instance, in dimension p = 2,
Pn Pn
a2i,1

C 1 i=1 i=1 ai,1 ai,2
= P n P n 2 .
n n i=1 ai,1 ai,2 i=1 ai,2
Since C is a symmetric, it diagonalizes in an ortho-basis U = (u1 , . . . , up ) ∈ Rp×p . Here, the vectors

uk ∈ Rp are stored in the columns of the matrix U . The diagonalization means that there exist scalars (the
eigenvalues) (λ1 , . . . , λp ) so that ( n1 C)uk = λk uk . Since the matrix is orthogononal, U U ⊤ = U ⊤ U = Idp , and
equivalently U −1 = U ⊤ . The diagonalization property can be conveniently written as n1 C = U diag(λk )U ⊤ .
One can thus re-write the covariance quadratic form in the basis U as being a separable sum of p squares
p
1 X
⟨Cx, x⟩ = ⟨U diag(λk )U ⊤ x, x⟩ = ⟨diag(λk )(U ⊤ x), (U ⊤ x)⟩ = λk ⟨x, uk ⟩2 . (10)
n
k=1
Here (U ⊤ x)k = ⟨x, uk ⟩ is the coordinate k of x in the basis U . Since ⟨Cx, x⟩ = ||Ax||2 , this shows that all
the eigenvalues λk ⩾ 0 are positive.
8
Figure 7: Left: point clouds (ai )i with associated PCA directions, right: quadratic part of f (x).
If one assumes that the eigenvalues are ordered λ1 ⩾ λ2 ⩾ . . . ⩾ λp , then projecting the points ai on the
first m eigenvectors can be shown to be in some sense the best linear dimensionality reduction possible (see
next paragraph), and it is called Principal Component Analysis (PCA). It is useful to perform compression
or dimensionality reduction, but in practice, it is mostly used for data visualization in 2-D (m = 2) and 3-D
(m = 3).
The matrix C/n encodes the covariance, so one can approximate√ the point cloud by an ellipsoid whose
main axes are the (uk )k and the width along each axis is ∝ λk (the standard deviations). If the data
are approximately drawn from a Gaussian distribution, whose density is proportional to exp( −1 2 ⟨C
−1
a, a⟩),
1
then the fit is good. This should be contrasted with the shape of quadratic part 2 ⟨Cx, x⟩ of f (x), since the
√
ellipsoid x ; n1 ⟨Cx, x⟩ ⩽ 1 has the same main axes, but the widths are the inverse 1/ λk . Figure 7 shows
this in dimension p = 2.
3.5 Classification
We can do a similar computation for the gradient of the classification loss (4). Assuming that L is
differentiable, and using the Taylor expansion (7) at point − diag(y)Ax, one has
f (x + ε) = L(− diag(y)Ax − diag(y)Aε)

= L(− diag(y)Ax) + ⟨∇L(− diag(y)Ax), − diag(y)Aε⟩ + o(|| diag(y)Aε||).
Using the fact that o(|| diag(y)Aε||) = o(||ε||), one obtains
f (x + ε) = f (x) + ⟨∇L(− diag(y)Ax), − diag(y)Aε⟩ + o(||ε||)

= f (x) + ⟨−A⊤ diag(y)∇L(− diag(y)Ax), ε⟩ + o(||ε||),
where we have used the fact that (AB)⊤ = B ⊤ A⊤ and that diag(y)⊤ = diag(y). This shows that
∇f (x) = −A⊤ diag(y)∇L(− diag(y)Ax).
Since L(z) = i ℓ(zi ), one has ∇L(z) = (ℓ′ (zi ))ni=1 . For instance, for the logistic classification method,
P
eu
ℓ(u) = log(1 + exp(u)) so that ℓ′ (u) = 1+e u ∈ [0, 1] (which can be interpreted as a probability of predicting
+1).
3.6 Chain Rule

One can formalize the previous computation, if f (x) = g(Bx) with B ∈ Rq×p and g : Rq → R, then
f (x + ε) = g(Bx + Bε) = g(Bx) + ⟨∇g(Bx), Bε⟩ + o(||Bε||) = f (x) + ⟨ε, B ⊤ ∇g(Bx)⟩ + o(||ε||),
which shows that

∇(g ◦ B) = B ⊤ ◦ ∇g ◦ B (11)
9
where “◦” denotes the composition of functions.
To generalize this to composition of possibly non-linear functions, one needs to use the notion of differ-
ential. For a function F : Rp → Rq , its differentiable at x is a linear operator ∂F (x) : Rp → Rq , i.e. it can
be represented as a matrix (still denoted ∂F (x)) ∂F (x) ∈ Rq×p . The entries of this matrix are the partial
differential, denoting F (x) = (F1 (x), . . . , Fq (x)),
def. ∂Fi (x)

∀ (i, j) ∈ {1, . . . , q} × {1, . . . , p}, [∂F (x)]i,j = .
∂xj
The function F is then said to be differentiable at x if and only if one has the following Taylor expansion
F (x + ε) = F (x) + [∂F (x)](ε) + o(||ε||). (12)
where [∂F (x)](ε) is the matrix-vector multiplication. As for the definition of the gradient, this matrix is the
only one that satisfies this expansion, so it can be used as a way to compute this differential in practice.
For the special case q = 1, i.e. if f : Rp → R, then the differential ∂f (x) ∈ R1×p and the gradient
∇f (x) ∈ Rp×1 are linked by equating the Taylor expansions (12) and (7)
∀ ε ∈ Rp , [∂f (x)](ε) = ⟨∇f (x), ε⟩ ⇔ [∂f (x)](ε) = ∇f (x)⊤ .
The differential satisfies the following chain rule
∂(G ◦ H)(x) = [∂G(H(x))] × [∂H(x)]
where “×” is the matrix product. For instance, if H : Rp → Rq and G = g : Rq 7→ R, then f = g◦H : Rp → R
and one can compute its gradient as follow
⊤
∇f (x) = (∂f (x))⊤ = ([∂g(H(x))] × [∂H(x)]) = [∂H(x)]⊤ × [∂g(H(x))]⊤ = [∂H(x)]⊤ × ∇g(H(x)).
When H(x) = Bx is linear, one recovers formula (11).
4 Gradient Descent Algorithm

4.1 Steepest Descent Direction
The Taylor expansion (7) computes an affine approximation of the function f near x, since it can be
written as
def.
f (z) = Tx (z) + o(||x − z||) where Tx (z) = f (x) + ⟨∇f (x), z − x⟩,
see Fig. 8. First order methods operate by locally replacing f by Tx .
The gradient ∇f (x) should be understood as a direction along which the function increases. This means
that to improve the value of the function, one should move in the direction −∇f (x). Given some fixed x,
let us look as the function f along the 1-D half line
τ ∈ R+ = [0, +∞[7−→ f (x − τ ∇f (x)) ∈ R.
If f is differentiable at x, one has
f (x − τ ∇f (x)) = f (x) − τ ⟨∇f (x), ∇f (x)⟩ + o(τ ) = f (x) − τ ||∇f (x)||2 + o(τ ).
So there are two possibility: either ∇f (x) = 0, in which case we are already at a minimum (possibly a local
minimizer if the function is non-convex) or if τ is chosen small enough,
f (x − τ ∇f (x)) < f (x)
which means that moving from x to x − τ ∇f (x) has improved the objective function.
10
Figure 8: Left: First order Taylor expansion in 1-D and 2-D. Right: orthogonality of gradient and level sets
and schematic of the proof.
Remark 2 (Orthogonality to level sets). The level sets of f are the sets of point sharing the same value of
f , i.e. for any s ∈ R
def.
Ls = {x ; f (x) = s} .
At some x ∈ Rp , denoting s = f (x), then x ∈ Ls (x belong to its level set). The gradient vector ∇f (x) is
orthogonal to the level set (as shown on Fig. 8 right), and points toward level set of higher value (which is
consistent with the previous computation showing that it is a valid ascent direction). Indeed, lets consider
def.
around x inside Ls a smooth curve of the form t ∈ R 7→ c(t) where c(0) = x. Then the function h(t) = f (c(t))
′
is constant h(t) = s since c(t) belong to the level set. So h (t) = 0. But at the same time, we can compute
its derivate at t = 0 as follow
h(t) = f (c(0) + tc′ (0) + o(t)) = h(0) + δ⟨c′ (0), ∇f (c(0))⟩ + o(t)
i.e. h′ (0) = ⟨c′ (0), ∇f (x)⟩ = 0, so that ∇f (x) is orthogonal to the tangent c′ (0) of the curve c, which lies in
the tangent plane of Ls (as shown on Fig. 8, right). Since the curve c is arbitrary, the whole tangent plane
is thus orthogonal to ∇f (x).
Remark 3 (Local optimal descent direction). One can prove something even stronger, that among all possible
∇f (x)
direction u with ||u|| = r, r ||∇f (x)|| becomes the optimal one as r → 0 (so for very small step this is locally
the best choice), more precisely,
1 r→0 ∇f (x)
argmin f (x + u) −→ − .
r ||u||=r ||∇f (x)||
Indeed, introducing a Lagrange multiplier λ ∈ R for this constraint optimization problem, one obtains that
∇f (x+u)
the optimal u satisfies ∇f (x + u) = λu and ||u|| = r. Thus ur = ± ||∇f (x+u)|| , and assuming that ∇f is
u ∇f (x)
continuous, when ||u|| = r → 0, this converges to ||u|| = ± ||∇f (x)|| . The sign ± should be +1 to obtain a
maximizer and −1 for the minimizer.
4.2 Gradient Descent

The gradient descent algorithm reads, starting with some x0 ∈ Rp
def.
xk+1 = xk − τk ∇f (xk ) (13)
where τk > 0 is the step size (also called learning rate). For a small enough τk , the previous discussion shows
that the function f is decaying through the iteration. So intuitively, to ensure convergence, τk should be
chosen small enough, but not too small so that the algorithm is as fast as possible. In general, one use a fix
step size τk = τ , or try to adapt τk at each iteration (see Fig. 9).
11
Figure 9: Influence of τ on the gradient descent (left) and optimal step size choice (right).
Remark 4 (Greedy choice). Although this is in general too costly to perform exactly, one can use a “greedy”
choice, where the step size is optimal at each iteration, i.e.
def. def.
τk = argmin h(τ ) = f (xk − τ ∇f (xk )).
τ
Here h(τ ) is a function of a single variable. One can compute the derivative of h as
h(τ + δ) = f (xk − τ ∇f (xk ) − δ∇f (xk )) = f (xk − τ ∇f (xk )) − ⟨∇f (xk − τ ∇f (xk )), ∇f (xk )⟩ + o(δ).
One note that at τ = τk , ∇f (xk − τ ∇f (xk )) = ∇f (xk+1 ) by definition of xk+1 in (13). Such an optimal
τ = τk is thus characterized by
h′ (τk ) = −⟨∇f (xk ), ∇f (xk+1 )⟩ = 0.
This means that for this greedy algorithm, two successive descent direction ∇f (xk ) and ∇f (xk+1 ) are
orthogonal (see Fig. 9).
Remark 5 (Armijo rule). Instead of looking for the optimal τ , one can looks for an admissible τ which
guarantees a large enough decay of the functional, in order to ensure convergence of the descent. Given some
parameter 0 < α < 1 (which should be actually smaller than 1/2 in order to ensure a sufficient decay), one
consider a τ to be valid for a descent direction dk (for instance dk = −∇f (xk )) if it satisfies
f (xk + τ dk ) ⩽ f (xk ) + ατ ⟨dk , ∇f (xk )⟩ (14)
For small τ , one has f (xk + τ dk ) = f (xk ) + τ ⟨dk , ∇f (xk )⟩, so that, assuming dk is a valid descent direction
(i.e ⟨dk , ∇f (xk )⟩ < 0), condition (14) will always be satisfied for τ small enough (if f is convex, the set of
allowable τ is of the form [0, τmax ]). In practice, one perform gradient descent by initializing τ very large,
and decaying it τ ← βτ (for β < 1) until (14) is satisfied. This approach is often called “backtracking” line
search.
5 Convergence Analysis
5.1 Quadratic Case
Convergence analysis for the quadratic case. We first analyze this algorithm in the case of the
quadratic loss, which can be written as
(
def.
1 1 C = A⊤ A ∈ Rp×p ,
f (x) = ||Ax − y||2 = ⟨Cx, x⟩ − ⟨x, b⟩ + cst where def.
2 2 b = A ⊤ y ∈ Rp .
We already saw that in (9) if ker(A) = {0}, which is equivalent to C being invertible, then there exists a
single global minimizer x⋆ = (A⊤ A)−1 A⊤ y = C −1 u.
Note that a function of the form 21 ⟨Cx, x⟩ − ⟨x, b⟩ is convex if and only if the symmetric matrix C is
positive semi-definite, i.e. that all its eigenvalues are non-negative (as already seen in (10)).
12
Proposition 4. For f (x) = ⟨Cx, x⟩−⟨b, x⟩ (C being symmetric semi-definite positive) with the eigen-values
of C upper-bounded by L and lower-bounded by µ > 0, assuming there exists (τmin , τmax ) such that
2
0 < τmin ⩽ τℓ ⩽ τ̃max <
L
then there exists 0 ⩽ ρ̃ < 1 such that
||xk − x⋆ || ⩽ ρ̃ℓ ||x0 − x⋆ ||. (15)
The best rate ρ̃ is obtained for
2 def. L−µ 2ε def.
τℓ = =⇒ ρ̃ = =1− where ε = µ/L. (16)
L+µ L+µ 1+ε
Proof. One iterate of gradient descent reads
xk+1 = xk − τℓ (Cxk − b).
Since the solution x⋆ (which by the way is unique by strict convexity) satisfy the first order condition
Cx⋆ = b, it gives
xk+1 − x⋆ = xk − x⋆ − τℓ C(xk − x⋆ ) = (Idp − τℓ C)(xk − x⋆ ).
If S ∈ Rp×p is a symmetric matrix, one has
def.
||Sz|| ⩽ ||S||op ||z|| where ||S||op = max |λk (S)|,
k
def.
where λk (S) are the eigenvalues of S and σk (S) = |λk (S)| are its singular values. Indeed, S can be
diagonalized in an orthogonal basis U , so that S = U diag(λk (S))U ⊤ , and S ⊤ S = S 2 = U diag(λk (S)2 )U ⊤
so that
||Sz||2 = ⟨S ⊤ Sz, z⟩ = ⟨U diag(λk )U ⊤ z, z⟩ = ⟨diag(λ2k )U ⊤ z, U ⊤ z⟩

X
= λ2k (U ⊤ z)2k ⩽ max(λ2k )||U ⊤ z||2 = max(λ2k )||z||2 .
k k
i
Applying this to S = Idp − τℓ C, one has

def.
h(τ ) = ||Idp − τℓ C||op = max |λk (Idp − τℓ C)| = max |1 − τℓ λk (C)| = max(|1 − τℓ σmax (C)|, |1 − τ σmin (C)|)
k k
For a quadratic function, one has σmin (C) = µ, σmax (C) = L. Figure 10, right, shows a display of h(τ ). One
2
has that for 0 < τ < 2/L, h(τ ) < 1. The optimal value is reached at τ ⋆ = L+µ and then
2L L−µ
h(τ ⋆ ) = 1 − = .
L+µ L+µ
def.
Note that when the condition number ξ = µ/L ≪ 1 is small (which is the typical setup for ill-posed
problems), then the contraction constant appearing in (16) scales like
ρ̃ ∼ 1 − 2ξ. (17)
The quantity ε in some sense reflects the inverse-conditioning of the problem. For quadratic function, it
indeed corresponds exactly to the inverse of the condition number (which is the ratio of the largest to
smallest singular value). The condition number is minimum and equal to 1 for orthogonal matrices.
The error decay rate (15), although it is geometrical O(ρk ) is called a “linear rate” in the optimization
literature. It is a “global” rate because it hold for all k (and not only for large enough k).
13
Figure 10: Contraction constant h(τ ) for a quadratic function (right).
If ker(A) ̸= {0}, then C is not definite positive (some of its eigenvalues vanish), and the set of solution is
infinite. One can however still show a linear rate, by showing that actually the iterations xk are orthogonal to
ker(A) and redo the above proof replacing µ by the smaller non-zero eigenvalue of C. This analysis however
leads to a very poor rate ρ (very close to 1) because µ can be arbitrary close to 0. Furthermore, such a proof
does not extends to non-quadratic functions. It is thus necessary to do a different theoretical analysis, which
only shows a sublinear rate on the objective function f itself rather than on the iterates xk .
Proposition 5. For f (x) = ⟨Cx, x⟩ − ⟨b, x⟩, assuming the eigenvalue of C are bounded by L, then if
0 < τk = τ < 1/L is constant, then
dist(x0 , argmin f )2
f (xk ) − f (x⋆ ) ⩽ .
τ 8k
where
def.
dist(x0 , argmin f ) = min ||x0 − x⋆ ||.
x⋆ ∈argmin f
Proof. We have Cx⋆ = b for any minimizer x⋆ and xk+1 = xk − τ (Cxk − b) so that as before
xk − x⋆ = (Idp − τ C)k (x0 − x⋆ ).
Now one has

1 1 1
⟨C(xk − x⋆ ), xk − x⋆ ⟩ = ⟨Cxk , xk ⟩ − ⟨Cxk , x⋆ ⟩ + ⟨Cx⋆ , x⋆ ⟩
2 2 2
and we have ⟨Cxk , x⋆ ⟩ = ⟨xk , Cx⋆ ⟩ = ⟨xk , b⟩ and also ⟨Cx⋆ , x⋆ ⟩ = ⟨x⋆ , b⟩ so that
1 1 1 1
⟨C(xk − x⋆ , xk − x⋆ ⟩ = ⟨Cxk , xk ⟩ − ⟨xk , b⟩ + ⟨x⋆ , b⟩ = f (xk ) + ⟨x⋆ , b⟩.
2 2 2 2
Note also that
1 Cx⋆ 1 1
f (x⋆ ) = ⋆
− ⟨x⋆ , b⟩ = ⟨x⋆ , b⟩ − ⟨x⋆ , b⟩ = − ⟨x⋆ , b⟩.
2 x 2 2
This shows that
1
⟨C(xk − x⋆ , xk − x⋆ ⟩ = f (xk ) − f (x⋆ ).
2
This thus implies
1 σmax (Mk )
f (xk ) − f (x⋆ ) = ⟨(Idp − τ C)k C(Idp − τ C)k (x0 − x⋆ ), x0 − x⋆ ⟩ ⩽ ||x0 − x⋆ ||2
2 2
where we have denoted
def.
Mk = (Idp − τ C)k C(Idp − τ C)k .
14
Since x⋆ can be chosen arbitrary, one can replace ||x0 − x⋆ || by dist(x0 , argmin f ). One has, for any ℓ, the
following bound
1
σℓ (Mk ) = σℓ (C)(1 − τ σℓ (C))2k ⩽
τ 4k
since one can show that (setting t = τ σℓ (C) ⩽ 1 because of the hypotheses)
1
∀ t ∈ [0, 1], (1 − t)2k t ⩽ .
4k
Indeed, one has
1 1 1 1
(1 − t)2k t ⩽ (e−t )2k t = (2kt)e−2kt ⩽ sup ue−u = ⩽ .
2k 2k u⩾0 2ek 4k
5.2 General Case

We detail the theoretical analysis of convergence for general smooth convex functions. The general idea
is to replace the linear operator C involved in the quadratic case by the second order derivative (the hessian
matrix).
Hessian. If the function is twice differentiable along the axes, the hessian matrix is
2
2 ∂ f (x)
(∂ f )(x) = ∈ Rp×p .
∂xi ∂xj 1⩽i,j⩽p
∂ 2 f (x) ∂f (x)
Where recall that ∂xi ∂xj is the differential along direction xj of the function x 7→ ∂xi . We also recall that
2 2
∂ f (x) ∂ f (x) 2
∂xi ∂xj = ∂xj ∂xiso that ∂ f (x) is a symmetric matrix.
A differentiable function f is said to be twice differentiable at x if
1
f (x + ε) = f (x) + ⟨∇f (x), ε⟩ + ⟨∂ 2 f (x)ε, ε⟩ + o(||ε||2 ). (18)
2
This means that one can approximate f near x by a quadratic function. The hessian matrix is uniquely
determined by this relation, so that if one is able to write down an expansion with some matrix H
1
f (x + ε) = f (x) + ⟨∇f (x), ε⟩ + ⟨Hε, ε⟩ + o(||ε||2 ).
2
then equating this with the expansion (18) ensure that ∂ 2 f (x) = H. This is thus a way to actually determine
the hessian without computing all the p2 partial derivative. This Hessian can equivalently be obtained by
performing an expansion (i.e. computing the differential) of the gradient since
∇f (x + ε) = ∇f (x) + [∂ 2 f (x)](ε) + o(||ε||)
where [∂ 2 f (x)](ε) ∈ Rp denotes the multiplication of the matrix ∂ 2 f (x) with the vector ε.
One can show that a twice differentiable function f on Rp is convex if and only if for all x the symmetric
matrix ∂ 2 f (x) is positive semi-definite, i.e. all its eigenvalues are non-negative. Furthermore, if these
eigenvalues are strictly positive then f is strictly convex (but the converse is not true, for instance x4 is
strictly convex on R but its second derivative vanishes at x = 0).
For instance, for a quadratic function f (x) = ⟨Cx, x⟩ − ⟨x, u⟩, one has ∇f (x) = Cx − u and thus
∂ 2 f (x) = C (which is thus constant). For the classification function, one has
∇f (x) = −A⊤ diag(y)∇L(− diag(y)Ax).
15
and thus
∇f (x + ε) = −A⊤ diag(y)∇L(− diag(y)Ax − − diag(y)Aε)
= ∇f (x) − A⊤ diag(y)[∂ 2 L(− diag(y)Ax)](− diag(y)Aε)
Since ∇L(u) = (ℓ′ (ui )) one has ∂ 2 L(u) = diag(ℓ′′ (ui )). This means that
∂ 2 f (x) = A⊤ diag(y) × diag(ℓ′′ (− diag(y)Ax)) × diag(y)A.
One verifies that this matrix is symmetric and positive if ℓ is convex and thus ℓ′′ is positive.
Remark 6 (Second order optimality condition). The first use of Hessian is to decide wether a point x⋆
with ∇f (x⋆ ) is a local minimum or not. Indeed, if ∂ 2 f (x⋆ ) is a positive matrix (i.e. its eigenvalues are
strictly positive), then x⋆ is a strict local minimum. Note that if ∂ 2 f (x⋆ ) is only non-negative (i.e. some its
eigenvalues might vanish) then one cannot deduce anything (such as for instance x3 on R). Conversely, if x⋆
is a local minimum then ∂ 2 f (x)
Remark 7 (Second order algorithms). A second use, is to be used in practice to define second order method
(such as Newton’s algorithm), which converge faster than gradient descent, but are more costly. The gener-
alized gradient descent reads
xk+1 = xk − Hk ∇f (xk )
where Hk ∈ Rp×p is a positive symmetric matrix. One recovers the gradient descent when using Hk = τk Idp ,
and Newton’s algorithm corresponds to using the inverse of the Hessian Hk = [∂ 2 f (xk )]−1 . Note that
f (xk ) = f (xk ) − ⟨Hk ∇f (xk ), ∇f (xk )⟩ + o(||Hk ∇f (xk )||).
Since Hk is positive, if xk is not a minimizer, i.e. ∇f (xk ) ̸= 0, then ⟨Hk ∇f (xk ), ∇f (xk )⟩ > 0. So if Hk is
small enough one has a valid descent method in the sense that f (xk+1 ) < f (xk ). It is not the purpose of
this chapter to explain in more detail these type of algorithm.
The last use of Hessian, that we explore next, is to study theoretically the convergence of the gradient
descent. One simply needs to replace the boundedness of the eigenvalue of C of a quadratic function by a
boundedness of the eigenvalues of ∂ 2 f (x) for all x. Roughly speaking, the theoretical analysis of the gradient
descent for a generic function is obtained by applying this approximation and using the proofs of the previous
section.
Smoothness and strong convexity. One also needs to quantify the

smoothness of f . This is enforced by requiring that the gradient is L-Lipschitz,
i.e.
∀ (x, x′ ) ∈ (Rp )2 , ||∇f (x) − ∇f (x′ )|| ⩽ L||x − x′ ||. (RL )
In order to obtain fast convergence of the iterates themselve, it is needed that
the function has enough “curvature” (i.e. is not too flat), which corresponds
to imposing that f is µ-strongly convex
∀ (x, x′ ), ∈ (Rp )2 , ⟨∇f (x) − ∇f (x′ ), x − x′ ⟩ ⩾ µ||x − x′ ||2 . (Sµ )
The following proposition express these conditions as constraints on the hessian for C 2 functions.
Proposition 6. Conditions (RL ) and (Sµ ) imply
µ L
∀ (x, x′ ), f (x′ ) + ⟨∇f (x), x′ − x⟩ + ||x − x′ ||2 ⩽ f (x) ⩽ f (x′ ) + ⟨∇f (x′ ), x′ − x⟩ + ||x − x′ ||2 . (19)
2 2
If f is of class C 2 , conditions (RL ) and (Sµ ) are equivalent to
∀ x, µIdp ⪯ ∂ 2 f (x) ⪯ LIdp (20)
where ∂ 2 f (x) ∈ Rp×p is the Hessian of f , and where ⪯ is the natural order on symmetric matrices, i.e.
A⪯B ⇐⇒ ∀ x ∈ Rp , ⟨Au, u⟩ ⩽ ⟨Bu, u⟩.
16
Proof. We prove (19), using Taylor expansion with integral remain
Z 1 Z 1
f (x′ ) − f (x) = ⟨∇f (xt ), x′ − x⟩dt = ⟨∇f (x), x′ − x⟩ + ⟨∇f (xt ) − ∇f (x), x′ − x⟩dt
0 0
def.
where xt = x + t(x′ − x). Using Cauchy-Schwartz, and then the smoothness hypothesis (RL )
Z 1 Z 1
′ ′ ′ ′ ′ 2
f (x ) − f (x) ⩽ ⟨∇f (x), x − x⟩ + L||xt − x||||x − x||dt ⩽ ⟨∇f (x), x − x⟩ + L||x − x|| tdt
0 0
which is the desired upper-bound. Using directly (Sµ ) gives

1 1
xt − x 1
Z Z
′ ′
f (x ) − f (x) = ⟨∇f (x), x − x⟩ + ⟨∇f (xt ) − ∇f (x), ⟩dt ⩾ ⟨∇f (x), x′ − x⟩ + µ ||xt − x||2 dt
0 t 0 t
which gives the desired result since ||xt − x||2 /t = t||x′ − x||2 .
The relation (19) shows that a smooth (resp. strongly convex) functional is bounded by below (resp.
above) by a quadratic tangential majorant (resp. minorant).
Condition (20) thus reads that the singular values of ∂ 2 f (x) should be contained in the interval [µ, L].
The upper bound is also equivalent to ||∂ 2 f (x)||op ⩽ L where || · ||op is the operator norm, i.e. the largest
singular value. In the special case of a quadratic function of the form ⟨Cx, x⟩ − ⟨b, x⟩ (recall that necessarily
C is semi-definite symmetric positive for this function to be convex), ∂ 2 f (x) = C is constant, so that [µ, L]
can be chosen to be the range of the eigenvalues of C.
Convergence analysis. We now give convergence theorem for a general convex function. On contrast to
quadratic function, if one does not assumes strong convexity, one can only show a sub-linear rate on the
function values (and no rate at all on the iterates themselves). It is only when one assume strong convexity
that linear rate is obtained. Note that in this case, the solution of the minimization problem is not necessarily
unique.
Theorem 1. If f satisfy conditions (RL ), assuming there exists (τmin , τmax ) such that
2
0 < τmin ⩽ τℓ ⩽ τmax < ,
L
then xk converges to a solution x⋆ of (1) and there exists C > 0 such that
C
f (xk ) − f (x⋆ ) ⩽ . (21)
ℓ+1
If furthermore f is µ-strongly convex, then there exists 0 ⩽ ρ < 1 such that ||xk − x⋆ || ⩽ ρℓ ||x0 − x⋆ ||.
Proof. In the case where f is not strongly convex, we only prove (21) since the proof that xk converges
is more technical. Note indeed that if the minimizer x⋆ is non-unique, then it might be the case that the
iterate xk “cycle” while approaching the set of minimizer, but actually convexity of f prevents this kind of
pathological behavior. For simplicity, we do the proof in the case τℓ = 1/L, but it extends to the general
case. The L-smoothness property imply (19), which reads
L
f (xk+1 ) ⩽ f (xk ) + ⟨∇f (xk ), xk+1 − xk ⟩ + ||xk+1 − xk ||2 .
2
Using the fact that xk+1 − xk = − L1 ∇f (xk ), one obtains
1 1 1
f (xk+1 ) ⩽ f (xk ) − ||∇f (xk )||2 + ||∇f (xk )||2 ⩽ f (xk ) − ||∇f (xk )||2 (22)
L 2L 2L
17
This shows that (f (xk ))ℓ is a decaying sequence. By convexity
f (xk ) + ⟨∇f (xk ), x⋆ − xk ⟩ ⩽ f (x⋆ )
and plugging this in (22) shows

1
f (xk+1 ) ⩽ f (x⋆ ) − ⟨∇f (xk ), x⋆ − xk ⟩ − ||∇f (xk )||2 (23)
2L
L 1
= f (x⋆ ) + ||xk − x⋆ ||2 − ||xk − x⋆ − ∇f (xk )||2 (24)
2 L
L
= f (x⋆ ) + ||xk − x⋆ ||2 − ||x⋆ − xk+1 ||2 .

(25)
2
Summing these inequalities for ℓ = 0, . . . , k, one obtains
k
X L
f (xk+1 ) − (k + 1)f (x⋆ ) ⩽ ||x0 − x⋆ ||2 − ||x(k+1) − x⋆ ||2
2
ℓ=0
Pk
and since f (xk+1 ) is decaying ℓ=0 f (xk+1 ) ⩾ (k + 1)f (x(k+1) ), thus
L||x0 − x⋆ ||2
f (x(k+1) ) − f (x⋆ ) ⩽
2(k + 1)
def.
which gives (21) for C = L||x0 − x⋆ ||2 /2.
µ
If we now assume f is µ-strongly convex, then, using ∇f (x⋆ ) = 0, one has 2 ||x
⋆
− x||2 ⩽ f (x) − f (x⋆ ) for
all x. Re-manipulating (25) gives
µ L
||xk+1 − x⋆ ||2 ⩽ f (xk+1 ) − f (x⋆ ) ⩽ ||xk − x⋆ ||2 − ||x⋆ − xk+1 ||2 ,

2 2
and hence s
L
||xk+1 − x⋆ || ⩽ ||xk+1 − x⋆ ||, (26)
L+µ
which is the desired result.
Note that in the low conditioning setting ε ≪ 1, one retrieve a dependency of the rate (26) similar to the
one of quadratic functions (17), indeed
s
L 1 1
= (1 + ε)− 2 ∼ 1 − ε.
L+µ 2
5.3 Acceleration
The previous analysis shows that for L-smooth functions (i.e. with a hessian uniformly bounded by L,
||∂ 2 f (x)||op ⩽ L), the gradient descent with fixed step size converges with a speed on the function value
f (xk ) − min f = O(1/k). Even using various line search strategies, it is not possible to improve over this
rate. A way to improve this rate is by introducing some form of “momentum” extrapolation and rather
consider a pair of variables (xk , yk ) with the following update rule, for some step size s (which should be
smaller than 1/L)
xk+1 = yk − s∇f (yk )
yk+1 = xk+1 + βk (xk+1 − xk )
where the extrapolation parameter satisfies 0 < βk < 1. The case of a fixed βk = β corresponds to the
so-called “heavy-ball” method. In order for the method to bring an improvement for the 1/k “worse case”
18
rate (which does not means it improves for all possible case), one needs to rather use increasing momentum
βk → 1, one popular choice being
k−1 1
βk ∼1− .
k+2 k
This corresponds to the so-called “Nesterov” acceleration (although Nesterov used a slightly different choice,
with the similar 1 − 1/k asymptotic behavior).
−x⋆ ||
When using s ⩽ 1/L, one can show that f (xk ) − min f = O( ||x0sk 2 ), so that in the worse case scenario,
the convergence rate is improved. Note however that in some situation, acceleration actually deteriorates the
rates. For instance, if the function is strongly convex (and even on the simple case f (x) = ||x||2 ), Nesterov
acceleration does not enjoy linear convergence rate.
A way to interpret this scheme is by looking at a time-continuous ODE limit when √ s → 0. On the
contrary to the√classical gradient descent, the step size here should be taker as τ = s so that the time
evolves as t = sk. The update reads
xk+1 − xk xk − xk−1
= (1 − 3/k) − ∇f (yk )
τ τ
which can be re-written as
xk+1 + xk−1 − 2xk 3 xk − xk−1
− + τ ∇f (yk ) = 0.
τ2 kτ τ
Assuming (xk , yk ) → (x(t), y(t)), one obtains in the limit the following second order ODE

′′ 3 ′ x(0) = x0 ,
x (t) + x (t) + ∇f (x(t)) = 0 with
t x′ (0) = 0.
This corresponds to the movement of a ball in the potential field f , where the term 3t x′ (t) plays the role
of a friction which vanishes in the limit. So for small t, the method is similar to a gradient descent x′ =
−∇f (x), while for large t, it ressemble a Newtonian evolution x′′ = −∇f (x) (which keeps oscillating without
converging). The momentum decay rate 3/t is very important, it is the only rule which enable the speed
improvement from 1/k to 1/k 2 .
6 Mirror Descent and Implicit Bias

6.1 Bregman Divergences
We consider a smooth strictly convex “entropy” function ψ such that ||∇ψ(x)|| goes to +∞ as x →
∂ dom(ψ). We denote
ψ ∗ (u) =
def.
sup ⟨u, x⟩ − ψ(x)
x∈dom(ψ)
its Legendre transform. In this case of “Legendre-type” entropy function, ∇ψ : dom(ψ) → dom(ψ ⋆ ) and
∇ψ ∗ are bijection reciprocal one from the other.
One then defines the associated Bregman divergence
Dψ (x|y) ≜ ψ(x) − ψ(y) − ⟨∇ψ(y), x − y⟩.
It is positive, convex in x (but not necessarily in y), not necessarily symmetric, and “distance-like”.
2 ∗
P For ψ = || · || one has ∇ψ = ∇ψ = Id, and ∗
one recovers the Euclidean distance. For ψKL (x) =
i xi log(xi ) − xi + 1 one has ∇ψ = log and ∇ψ = exp, and one obtains the relative entropy, also known
as Kullback-Leibler X
DψKL (x|y) = xi log(xi /yi ) − xi + yi .
i
19
− log(xi ) + xi − 1 on Rd+ , ∇ψBurg (x) = ∇ψ ∗ (x) = −1/x and associated divergence
P
When ψBurg (x) = i
X
DψBurg (x|y) = − log(yi /xi ) − xi /yi + 1. (27)
i
These examples can be generalized to power entropies

X |xi |α − α(xi − 1) − 1
ψα (x) ≜ (28)
i
α(α − 1)
with special cases

X X
ψ1 (x) ≜ ψKL = xi log(xi ) − xi + 1 and ψ0 (x) ≜ ψBurg = − log(xi ) + xi − 1.
i i
They are defined on Rd if α > 1 and Rd+ if α ⩽ 1.

Remark 8 (Matricial divergences). Given an entropy function ψ0 (x) on vectors x ∈ Rd which is invariant
under permutation of the indices, one lifts it to symmetric matrices X ∈ Rd×d as
⊤
ψ(X) ≜ ψ0 (Λ(X)) where X = UX diag(Λ(X))UX
d d
is
Pthe eigen-decomposition of X, where Λ(X) = (λi (X))i=1 ∈ R are the eigenvalues. Typically, ⊤ if ψ0 (x) =
i h(x i ) then ψ(X) = tr(h(X)) where h is extended to matrices as h(X) ≜ UX diag(h(λi (X)))UX . If ψ0 is
convex and smooth, so is ψ, and
⊤
∇ψ(X) = UX diag(∇ψ0 (Λ(X)))UX .
For instance, if h(s) = s log(x) − s + 1 is the Shannon entropy, this defines the quantum Shannon entropy as
Dψ (X) = tr(X log(X) − X log(Y ) − X + Y )
and if h(s) = − log(s) then Dψ (X) = − log det(X).

Remark 9 (Cizard divergences). When defined on Rd+ , these divergence should not be confounded with Cizar
divergences which reads X X
def. ′
Cψ (x|y) = yi ψ(xi /yi ) + ψ∞ xi ,
i yi =0
which are jointly convex in x and y. Only for ψ = ψKL one has DψKL = CψKL .
6.2 Mirror descent

We consider the following implicit stepping
1
xk+1 = argmin f (x) + Dψ (x|xk ).
x∈dom(ψ) τ
Its explicit version then reads by Taylor expanding f at xk

1
xk+1 = argmin f (xk ) + ⟨x − xk , ∇f (xk )⟩ + Dψ (x|xk ),
x∈dom(ψ) τ
1
= argmin ⟨x, ∇f (xk )⟩ + Dψ (x|xk ).
x∈dom(ψ) τ
The fact that ψ is Legendre type allows to ignore the constraint, and the solution satisfies the following first
order condition
∇f (xk ) + 1/τ [∇ψ(xk+1 ) − ∇ψ(xk )] = 0
20
so that it can be explicitly computed
xk+1 = (∇ψ ∗ )[∇ψ(xk ) − τ ∇f (xk )] (29)
For ψ = || · ||2 /2 one recovers the usual Euclidean gradient descent. For ψ(x) =
P
i xi log(xi ), this defines the
multiplicative updates
xk+1 = xk ⊙ exp(−τ ∇f (xk ))
where ⊙ is the entry-wise multiplication of vectors.
Note that introducing the “dual” variable uk ≜ ∇ψ(xk ), one has
uk+1 = uk − τ h(uk ) where h(u) ≜ ∇f (∇ψ ∗ (u)). (30)
Note however that in general h is not a gradient field, so this is not in general a gradient flow.
Mirror flow. When τ → 0, one obtains the following expansion
xk+1 = (∇ψ ∗ )[∇ψ(xk )] − τ [∂ 2 ψ ∗ ](∇ψ(xk )) × ∇f (xk ) + o(τ )
so that defining x(t) = xk for t = kτ the limit is the following flow
ẋ(t) = −H(x(t))∇f (x(t)) where H(x) ≜ [∂ 2 ψ ∗ ](∇ψ(x)) = [∂ 2 ψ(x)]−1 (31)
so that this is a gradient flow on a very particular type of manifold, of “Hessian type”. Note that if ψ = f ,
then one recovers the flow associated to Newton’s method.
Convergence. Convergence theory (ensuring convergence and rates) for mirror descent is the same as for
the usual gradient descent, and one needs to consider relative L-smoothness, and if possible also relative
µ-strong convexity,
µ Dψ ⩽ Df ⩽ L Dψ ⇐⇒ ∀ x, µ∂ 2 ψ(x) ⩽ ∂ 2 f (x) ⩽ L∂ 2 ψ(x).
If L < +∞, then one has f (xk )−f (x⋆ ) ⩽ O(Dψ (x⋆ |x0 )/k) while if both 0 < µ ⩽ L < +∞, then Dψ (xk |x⋆ ) ⩽
O(Dψ (x⋆ |x0 )(1 − µ/L)k ). The advantages of using Bregman geometry are two-fold: this can improves the
conditioning µ/L (some function might be non-smooth for the Euclidean geometry but smooth for some
Bregman geometry, and can avoid introducing constraint in the optimization problem) and this can also
lower the radius of the domain Dψ (x⋆ |x0 ). For instance, assuming the solution belongs to the simplex, and
using x0 = 1d /d, then DψKL (x⋆ |x0 ) ⩽ log(d) whereas for the ℓ2 Euclidean distance, one only has the bound
||x⋆ − x0 ||2 ⩽ d.
6.3 Re-parameterized flows

One can consider a change of variable x = φ(z) where φ : Rp 7→ X ⊂ Rd is a smooth map, and perform
the gradient descent on the function g(z) ≜ f (φ(z)). Then one has
∇g(z) = [∂φ(z)]⊤ ∇f (x)
so that, denoting z(t) the gradient flow ż = −∇g(z) of g, and x(t) ≜ φ(z(t)), one has ẋ(t) = [∂φ(z(t))]ż(t)
and thus x(t) solves the following equation
ẋ = −Q(z)∇f (x) with Q(z) ≜ [∂φ(z)][∂φ(z)]⊤ ∈ S+

d×d
So unless φ is a bijection, this is not a gradient flow over the x variable. If φ is a bijection, then this is a
gradient flow associated to the field of tensors (“manifold”) Q(φ−1 (x)). The issue is that even in this case,
in general H might fail to be a Hessian manifold, so this does not correspond to a mirror descent flow.
21
Dual parameterization If ψ is an entropy function, then the parametrization x = ∇ψ ∗ (z), i.e. φ = ∇ψ ∗ ,
then Q(z) = [∂ 2 ψ ∗ (z)]2 , i.e. Q(φ−1 (x)) = [∂ 2 ψ(x)]−2 is not of Hessian-type in general, but rather a squared-
Hessian manifold. For instance, when ψ ∗ (z) = exp(z), −1 2
P then Q(φ (x)) = diag(1/xi ), which surprisingly is
the hessian metric associated to Burg’s entropy − i log(xi ).
Example: power-type parameterization We consider power entropies (28), on Rd+ , for α ⩽ 1, for
which
H(x) = [∂ 2 ψ(x)]−1 ∝ diag(x2−α
i ).
Remark than when using the parameterization x = φ(z) = (zib )i then
2(b−1) 2(b−1)/b
Q(φ−1 (x)) = [∂φ(z)][∂φ(z)]⊤ ∝ diag(zi ) = diag(xi )
so if one selects 2(1 − 1/b) = 2 − α i.e. 2/b = α, the re-parameterized flow is equal to the flow on a Hessian
manifold. For instance, when setting b = 2, α = 1, i.e. using the parmeterization x = z 2 , one retrieves
the flow on the manifold for the Shannon entropy (“Fisher-Rao” geometry). P Note that when b → +∞, one
obtains α = 0, i.e. the flow is the one of the Burg’s entropy ψ(x) = − i log(xi ) (which we saw above as
also being associated to the parameterization x = exp(z)).
d×d
Counter-example: SDP matrices We now consider semi-definite symmetric matrices X ∈ S+ , to-
⊤
gether with the parameterization X = φ(Z) = ZZ for Z ∈ R d×d
. In this case, denoting g(Z) = f (ZZ ⊤ ),
one has
∇g(Z) = [∇f (X) + ∇f (X)⊤ ]Z
so that the flow Ż = −∇g(Z) is equivalent to the following flow on symmetric (and it maintains positivity
as well)
Ẋ = X[∇S f (X)] + [∇S f (X)]X (32)
where the symmetric gradient is
∇S f (X) ≜ [∇f (X)] + [∇f (X)]⊤
So most likely (32) cannot be written as a usual gradient flow on a manifold which would be a hessian of a
convex function. To mimic the diagonal case (or vectorial case above), the most natural quantitate would
have been the spectral entropy ψ(X) ≜ tr(X log(X) − X + Id), whose gradient is log(X), but there is no
closed form expression for the derivative of the log unfortunately. Another simpler approach to mimic ψ−1
is to use ψ(X) = − tr(log(X)) = − log det(X), because the Hessian and its inverse can be computed
∂ 2 ψ(X) : S 7→ −X −1 SX −1 .
6.4 Implicit Bias

We consider the problem X
min f (x) = L(Ax) ≜ ℓ(⟨ai , x⟩, yi ),
x∈Rd
i
where the loss is coercive such that ℓ(·, yi ) has a unique minimizer at yi . The typical example is f (x) =
||Ax − y||2 for ℓ(u, v) = (u − v)2 . We do not impose that L is convex, and simply assumes convergence of
the considered optimization method to the set of global minimizers. The set of global minimizers is thus the
affine space
argmin f = {x ; Ax = y} .
The simplest optimization method is just gradient descent
xk+1 = xk − τ ∇f (xk ) where ∇f (x) = A⊤ ∇L(Ax).
22
As τ → 0, one defines x(t) = xk for t = kτ and consider the flow
ẋ(t) = −∇f (x(t)).
The implicit bias of the descent (and the flow) is given by the orthogonal projection.
Proposition 7. If xk → x⋆ ∈ argmin f , then
x⋆ = argmin ||x − x0 ||.

x∈argmin f
The following Proposition, whose proof can be found in [?] generalizes this proposition to the case of an
arbitrary mirror flow.
Proposition 8. If xk defined by (29) (resp. x(t) defined by (31)) is such that xk (resp. x(t)) converges to
x⋆ ∈ argmin f , then
x⋆ = argmin Dψ (x|x0 ). (33)
x∈argmin f
Proof. From the dual variable evolution (30), since ∇f (x) ∈ Im(A⊤ ), one has that yk − y0 ∈ Im(A⊤ ), so
that in the limit
y ⋆ − y0 = ∇ψ(x⋆ ) − ∇ψ(x0 ) ∈ Im(A⊤ ). (34)
Note that ∇ Dψ (x|x0 ) = ∇ψ(x) − ∇ψ(x0 ), and Im(A⊤ ) = Ker(A)⊥ is the space orthogonal to argmin f so
that (34) are the optimality conditions of the strictly convex problem (33).
In particular, for the Shannon entropy (equivalently when using the x = z 2 parameterization), as x0 → 0,
by doing the expansion of KL(x|x0 ) one has
X
x⋆ → argmin | log((x0 )i )|xi ,
x∈argmin f,x⩾0 i
which is a weighted ℓ1 norm (so in particular it induces sparsity in the solution, it is a Lasso-type problem).
When using more general parameterizations of the form x = z b for b > 0, this corresponds to using the
power entropy ψα for α = 2/b, and one can check that the associated limit bias for small x0 is still an ℓ1 ,
but with a different weighting scheme. For x = exp(z) (or b → +∞) one obtains Burg’s entropy defined
in (27) so that the limit bias is i xi /(x0 )i . The use of x = z 2 parameterization (which can be generalized
P
to x = u ⊙ v for signed vectors) was introduced in [?], and its associated implicit regularization is detailed
in [?, ?]. It is possible to analyze this sparsity-inducing behavior in a quantitative way, see for instance [?,
Thm.2] One can generalize this parameterization to arbitrary (not only positive vector) by using x = u2 − v 2
or x = u ⊙ v and the same type of bias appears, with now rather a (weighted) ℓ1 norm.
7 Regularization
When the number n of sample is not large enough with respect to the dimension p of the model, it makes
sense to regularize the empirical risk minimization problem.
7.1 Penalized Least Squares

For the sake of simplicity, we focus here on regression and consider
def. 1
min fλ (x) = ||Ax − y||2 + λR(x) (35)
x∈Rp 2
where R(x) is the regularizer and λ ⩾ 0 the regularization parameter. The regularizer enforces some prior
knowledge on the weight vector x (such as small amplitude or sparsity, as we detail next) and λ needs to be
tuned using cross-validation.
23
We assume for simplicity that R is positive and coercive, i.e. R(x) → +∞ as ||x|| → +∞. The following
proposition that in the small λ limit, the regularization select a sub-set of the possible minimizer. This is
especially useful when ker(A) ̸= 0, i.e. the equation Ax = y has an infinite number of solutions.
Proposition 9. If (xλk )k is a sequence of minimizers of fλ , then this sequence is bounded, and any accu-
mulation x⋆ is a solution of the constrained optimization problem
min R(x). (36)

Ax=y
Proof. Let x0 be so that Ax0 = y, then by optimality of xλk

1
||Axλk − y||2 + λk R(xλk ) ⩽ λk R(x0 ). (37)
2
Since all the term are positive, one has R(xλk ) ⩽ R(x0 ) so that (xλk )k is bounded by coercivity of R. Then
also ||Axλk − y|| ⩽ λk R(x0 ), and passing to the limit, one obtains Ax⋆ = y. And passing to the limit in
R(xλk ) ⩽ R(x0 ) one has R(x⋆ ) ⩽ R(x0 ) which shows that x⋆ is a solution of (36).
7.2 Ridge Regression

Ridge regression is by far the most popular regularizer, and corresponds to using R(x) = ||x||2Rp . Since it
is strictly convex, the solution of (35) is unique
def. 1
xλ = argmin fλ (x) = ||Ax − y||2Rn + λ||x||2Rp .
x∈Rp 2
One has
∇fλ (x) = A⊤ (Axλ − y) + λxλ = 0
so that xλ depends linearly on y and can be obtained by solving a linear system. The following proposition
shows that there are actually two alternate formula.
Proposition 10. One has
xλ = (A⊤ A + λIdp )−1 A⊤ y, (38)

⊤ ⊤ −1
= A (AA + λIdn ) y. (39)
def. def.
Proof. Denoting B = (A⊤ A + λIdp )−1 A⊤ and C = A⊤ (AA⊤ + λIdn )−1 , one has (A⊤ A + λIdp )B = A⊤
while
(A⊤ A + λIdp )C = (A⊤ A + λIdp )A⊤ (AA⊤ + λIdn )−1 = A⊤ (AA⊤ + λIdn )(AA⊤ + λIdn )−1 = A⊤ .
Since A⊤ A + λIdp is invertible, this gives the desired result.

The solution of these linear systems can be computed using either a direct method such as Cholesky
factorization or an iterative method such as a conjugate gradient (which is vastly superior to the vanilla
gradient descent scheme).
If n > p, then one should use (38) while if n < p one should rather use (39).
Pseudo-inverse. As λ → 0, then xλ → x0 which is, using (36)
argmin ||x||.
Ax=y
If ker(A) = {0} (overdetermined setting), A⊤ A ∈ Rp×p is an invertible matrix, and (A⊤ A + λIdp )−1 →
(A⊤ A)−1 , so that
x0 = A+ y where A+ = (A⊤ A)−1 A⊤ .
def.
24
Conversely, if ker(A⊤ ) = {0}, or equivalently Im(A) = Rn (undertermined setting) then one has
A+ = A⊤ (AA⊤ )−1 .
def.
x 0 = A+ y where
In the special case n = p and A is invertible, then both definitions of A+ coincide, and A+ = A−1 . In the
general case (where A is neither injective nor surjective), A+ can be computed using the Singular Values
Decomposition (SVD). The matrix A+ is often called the Moore-Penrose pseudo-inverse.
q=0 q = 0. 5 q=1 q = 1. 5 q=2
Figure 11: ℓq balls {x ; |xk |q ⩽ 1} for varying q.

P
k
7.3 Lasso
The Lasso corresponds to using a ℓ1 penalty
p
X
def.
R(x) = ||x||1 = |xk |.
k=1
The underlying idea is that solutions xλ of a Lasso problem

1
xλ ∈ argmin fλ (x) = ||Ax − y||2Rn + λ||x||1
x∈Rp 2
are sparse, i.e. solutions xλ (which might be non-unique) have many zero entries. To get some insight about
this, Fig. 11 display the ℓq “balls” which shrink toward the axes as q → 0 (thus enforcing more sparsity) but
are non-convex for q < 1.
This can serve two purposes: (i) one knows before hand that the solution is expected to be sparse, which
is the case for instance in some problems in imaging, (ii) one want to perform model selection by pruning
some of the entries in the feature (to have simpler predictor, which can be computed more efficiently at test
time, or that can be more interpretable). For typical ML problems though, the performance of the Lasso
predictor is usually not better than the one obtained by Ridge.
Minimizing f (x) is still a convex problem, but R is non-smooth, so that one cannot use a gradient descent.
Section 7.4 shows how to modify the gradient descent to cope with this issue. In general, solutions xλ cannot
be computed in closed form, excepted when the design matrix A is orthogonal.
Proposition 11. When n = p and A = Idn , one has
1
argmin ||x − y||2 + λ||x1 || = Sλ (x) where Sλ (x) = (sign(xk ) max(|xk | − λ, 0))k
x∈Rp 2
Proof. One has fλ (x) = k 12 (xk − yk )2 + λ|xk |, so that one needs to find the minimum of the 1-D function
P
x ∈ R 7→ 12 (x − y)2 + λ|x|. We can do this minimization “graphically” as shown on Fig. 12. For x > 0, one
has F ′ (x) = x − y + λ wich is 0 at x = y − λ. The minimum is at x = y − λ for λ ⩽ y, and stays at 0 for all
λ > y. The problem is symmetric with respect to the switch x 7→ −x.
Here, Sλ is the celebrated soft-thresholding non-linear function.
25
def. 1
Figure 12: Evolution with λ of the function F (x) = 2 || · −y||2 + λ| · |.
7.4 Iterative Soft Thresholding

We now derive an algorithm using a classical technic of surrogate function minimization. We aim at
minimizing
def. 1
f (x) = ||y − Ax||2 + λ||x||1
2
and we introduce for any fixed x′ the function
1 1
fτ (x, x′ ) = f (x) − ||Ax − Ax′ ||2 + ||x − x′ ||2 .
def.
2 2τ
We notice that fτ (x, x) = 0 and one the quadratic part of this function reads

′ def. 1 ′ 2 1 ′ 1 1
K(x, x ) = − ||Ax − Ax || + ||x − x || = ⟨ IdN − A A (x − x′ ), x − x′ ⟩.
⊤
2 2τ 2 τ
This quantity K(x, x′ ) is positive if λmax (A⊤ A) ⩽ 1/τ (maximum eigenvalue), i.e. τ ⩽ 1/||A||2op , where we
recall that ||A||op = σmax (A) is the operator (algebra) norm. This shows that fτ (x, x′ ) is a valid surrogate
functional, in the sense that
f (x) ⩽ fτ (x, x′ ), fτ (x, x′ ) = 0, and f (·) − fτ (·, x′ ) is smooth.
We also note that this majorant fτ (·, x′ ) is convex. This leads to define
def.
xk+1 = argmin fτ (x, xk ) (40)
x
which by construction satisfies

f (xk+1 ) ⩽ f (xk ).
Proposition 12. The iterates xk defined by (40) satisfy
xk+1 = Sλτ xk − τ A⊤ (Axk − y)

(41)
where Sλ (x) = (sλ (xm ))m where sλ (r) = sign(r) max(|r| − λ, 0) is the soft thresholding operator.
Proof. One has
1 1 1
fτ (x, x′ ) = ||Ax − y||2 − ||Ax − Ax′ ||2 + ||x − x′ ||2 + λ||x||1
2 2 2τ
1 1 1 1
= C + ||Ax||2 − ||Ax||2 + ||x||2 − ⟨Ax, y⟩ + ⟨Ax, Ax′ ⟩ − ⟨x, x′ ⟩ + λ||x||1
2 2 2τ τ
1 2 ⊤ ⊤ ′ 1 ′
=C+ ||x|| + ⟨x, −A y + AA x − x ⟩ + λ||x||1
2τ τ
′ 1 1 ′ ⊤ ′
=C + ||x − (x − τ A (Ax − y))||2 + τ λ||x||1 .
τ 2
Proposition (11) shows that the minimizer of fτ (x, x′ ) is thus indeed Sλτ (x′ − τ A⊤ (Ax′ − y)) as claimed.
26
Equation (41) defines the iterative soft-thresholding algorithm. It follows from a valid convex surrogate
function if τ ⩽ 1/||A||2 , but one can actually shows that it converges to a solution of the Lasso as soon as
τ < 2/||A||2 , which is exactly as for the classical gradient descent.
8 Stochastic Optimization
We detail some important stochastic Gradient Descent methods, which enable to perform optimization
in the setting where the number of samples n is large and even infinite.
8.1 Minimizing Sums and Expectation

A large class of functionals in machine learning can be expressed as minimizing large sums of the form
n
def. 1X
minp f (x) = fi (x) (42)
x∈R n i=1
or even expectations of the form

Z
def.
minp f (x) = Ez∼π (f (x, z)) = f (x, z)dπ(z). (43)
x∈R Z
Problem
Pn (42) can be seen as a special case of (43), when using a discrete empirical uniform measure π =
δ
i=1 i and setting f (x, i) = fi (x). One can also viewed (42) as a discretized “empirical” version of (43)
when drawing (zi )i i.i.d. according to z and defining fi (x) = f (x, zi ). In this setup, (42) converges to (43)
as n → +∞.
A typical example of such a class of problems is empirical risk minimization for linear model, where in
these cases
fi (x) = ℓ(⟨ai , x⟩, yi ) and f (x, z) = ℓ(⟨a, x⟩, y) (44)
for z = (a, y) ∈ Z = (A = Rp ) × Y (typically Y = R or Y = {−1, +1} for regression and classification),
where ℓ is some loss function. We illustrate below the methods on binary logistic classification, where
def.
L(s, y) = log(1 + exp(−sy)). (45)
But this extends to arbitrary parametric models, and in particular deep neural networks.
While some algorithms (in particular batch gradient descent) are specific to finite sums (42), the stochastic
methods we detail next work verbatim (with the same convergence guarantees) in the expectation case (43).
For the sake of simplicity, we however do the exposition for the finite sums case, which is sufficient in the
vast majority of cases. But one should keep in mind that n can be arbitrarily large, so it is not acceptable
in this setting to use algorithms whose complexity per iteration depend on n.
If the functions fi (x) are very similar (the extreme case being that they are all equal), then of course
there is a gain in using stochastic optimization (since in this case, ∇fi ≈ ∇f but ∇fi is n times cheaper).
But in general stochastic optimization methods are notÂ necessarily faster than batch gradient descent. If
n is not too large so that one afford the price of doing a few non-stochastic iterations, then deterministic
methods can be faster. But if n is so large that one cannot do even a single deterministic iteration, then
stochastic methods allow one to have a fine grained scheme by breaking the cost of determinstic iterations
in smaller chunks. Another advantage is that they are quite easy to parallelize.
8.2 Batch Gradient Descent (BGD)

The usual deterministic (batch) gradient descent (BGD) is studied in details in Section 4. Its iterations
read
xk+1 = xk − τk ∇f (xk )
27
0.62
0.6
0.58
0.56
200 400 600 800 1000 1200
E(w l ) log(E(wl ) - min E)
0.62 -2
0.6 -3
-4
0.58
-5
0.56
200 400 600 800 1000 1200 200 400 600 800 1000 1200
f (x )
log(E(wk
l
) - min E) log10 (f (xk ) − f (x⋆ ))
-2
Figure 13: Evolution of the error of the BGD for logistic classification.
-3
-4
def.
and the
-5 step size should be chosen as 0 < τmin < τk < τmax = 2/L where L is the Lipschitz constant of the
gradient ∇f .200
In particular,
400
in this 800
600
deterministic
1000
setting,
1200
this step size should not go to zero and this ensures
quite fast convergence (even linear rates if f is strongly convex).
The computation of the gradient in our setting reads
n
1X
∇f (x) = ∇fi (x) (46)
n i=1
so it typically has complexity O(np) if computing ∇fi has linear complexity in p.

For ERM-type functions of the form (44), one can do the Taylor expansion of fi
fi (x + ε) = ℓ(⟨ai , x⟩ + ⟨ai , ε⟩, yi ) = ℓ(⟨ai , x⟩, yi ) + ℓ′ (⟨ai , x⟩, yi )⟨ai , ε⟩ + o(||ε||)

= fi (x) + ⟨ℓ′ (⟨ai , x⟩, yi )ai , x⟩ + o(||ε||),
where ℓ(y, y ′ ) ∈ R is the derivative with respect to the first variable, i.e. the gradient of the map y ∈ R 7→
L(y, y ′ ) ∈ R. This computation shows that
∇fi (x) = ℓ′ (⟨ai , x⟩, yi )ai . (47)
For the logistic loss, one has

e−sy
L′ (s, y) = −s .
1 + e−sy
8.3 Stochastic Gradient Descent (SGD)

For very large n, computing the full gradient ∇f as in (46) is prohibitive.
The idea of SGD is to trade this exact full gradient by an inexact proxy using a
single functional fi where i is drawn uniformly at random. The main idea that
makes this work is that this sampling scheme provides an unbiased estimate of
the gradient, in the sense that
Ei ∇fi (x) = ∇f (x) (48)
where i is a random variable distributed uniformly in {1, . . . , n}.

Starting from some x0 ,the iterations of stochastic gradient descent (SGD)
Figure 14: Unbiased gradi-
read
ent estimate
xk+1 = xk − τk ∇fi(k) (xk )
where, for each iteration index k, i(k) is drawn uniformly at random in {1, . . . , n}. It is important that the
iterates xk+1 are thus random vectors, and the theoretical analysis of the method thus studies wether this
sequence of random vectors converges (in expectation or in probability for instance) toward a deterministic
vector (minimizing f ), and at which speed.
28
Figure 16: Display of a large number of trajectories k 7→ xk ∈ R generated by several runs of SGD. On the
top row, each curve is a trajectory, and the bottom row displays the corresponding density.
Note that each step of a batch gradient descent has complexity O(np), while a step
of SGD only has complexity O(p). SGD is thus advantageous when n is very large, and
one cannot afford to do several passes through the data. In some situation, SGD can
provide accurate results even with k ≪ n, exploiting redundancy between the samples.
A crucial question is the choice of step size schedule τk . It must tends to 0 in order
to cancel the noise induced on the gradient by the stochastic sampling. But it should
not go too fast to zero in order for the method to keep converging.
A typical schedule that ensures both properties is to have asymptotically τk ∼ k −1
for k → +∞. We thus propose to use
def. τ0
τk = (49) Figure 15:
1 + k/k0
Schematic view
where k0 indicates roughly the number of iterations serving as a “warmup” phase. of SGD iterates
Figure 16 shows a simple 1-D example to minimize f1 (x) + f2 (x) for x ∈ R and
f1 (x) = (x − 1)2 and f2 (x) = (x + 1)2 . One can see how the density of the distribution of xk progressively
clusters around the minimizer x⋆ = 0. Here the distribution of x0 is uniform
√ on [−1/2, 1/2].
The following theorem shows the convergence in expectation with a 1/ k rate on the objective.
Theorem 2. We assume f is µ-strongly convex as defined in (Sµ ) (i.e. Idp ⪯ ∂ 2 f (x) if f is C 2 ), and is
1
such that ||∇fi (x)||2 ⩽ C 2 . For the step size choice τk = µ(k+1) , one has
R
E(||xk − x⋆ ||2 ) ⩽ where R = max(||x0 − x⋆ ||, C 2 /µ2 ), (50)
k+1
where E indicates an expectation with respect to the i.i.d. sampling performed at each iteration.
Proof. By strong convexity, one has

µ
f (x⋆ ) − f (xk ) ⩾ ⟨∇f (xk ), x⋆ − xk ⟩ + ||xk − x⋆ ||2
2
µ
f (xk ) − f (x⋆ ) ⩾ ⟨∇f (x⋆ ), xk − x⋆ ⟩ + ||xk − x⋆ ||2 .
2
Summing these two inequalities and using ∇f (x⋆ ) = 0 leads to
⟨∇f (xk ) − ∇f (x⋆ ), xk − x⋆ ⟩ = ⟨∇f (xk ), xk − x⋆ ⟩ ⩾ µ||xk − x⋆ ||2 . (51)
29
f (xk ) log10 (f (xk ) − f (x⋆ ))
Figure 17: Evolution of the error of the SGD for logistic classification (dashed line shows BGD).
Considering only the expectation with respect to the ransom sample of i(k) ∼ ik , one has
Eik (||xk+1 − x⋆ ||2 ) = Eik (||xk − τk ∇fik (xk ) − x⋆ ||2 )

= ||xk − x⋆ ||2 + 2τk ⟨Eik (∇fik (xk )), x⋆ − xk ⟩ + τk2 Eik (||∇fik (xk )||2 )
⩽ ||xk − x⋆ ||2 + 2τk ⟨∇f (xk )), x⋆ − xk ⟩ + τk2 C 2
where we used the fact (48) that the gradient is unbiased. Taking now the full expectation with respect to
all the other previous iterates, and using (51) one obtains
E(||xk+1 − x⋆ ||2 ) ⩽ E(||xk − x⋆ ||2 ) − 2µτk E(||xk − x⋆ ||2 ) + τk2 C 2 = (1 − 2µτk )E(||xk − x⋆ ||2 ) + τk2 C 2 . (52)
def.
We show by recursion that the bound (50) holds. We denote εk = E(||xk − x⋆ ||2 ). Indeed, for k = 0, this it
is true that
max(||x0 − x⋆ ||, C 2 /µ2 ) R
ε0 ⩽ = .
1 1
R 1
We now assume that εk ⩽ k+1 . Using (52) in the case of τk = µ(k+1) , one has, denoting m = k + 1
C2

2
εk+1 ⩽ (1 − 2µτk )εk + τk2 C 2 = 1 − εk +
m (µm)2
m2 − 1 1

2 R R 1 1 m−1 R
⩽ 1− + 2 = − 2 R= R = R⩽
m m m m m m2 m2 m + 1 m+1
A weakness of SGD (as well as the SGA scheme studied next) is that it only weakly benefit from strong
convexity of f . This is in sharp contrast with BGD, which enjoy a fast linear rate for strongly convex
functionals, see Theorem 1.
Figure 17 displays the evolution of the energy f (xk ). It overlays on top (black dashed curve) the conver-
gence of the batch gradient descent, with a careful scaling of the number of iteration to account for the fact
that the complexity of a batch iteration is n times larger.
8.4 Stochastic Gradient Descent with Averaging (SGA)

Stochastic gradient descent is slow because of the fast decay of τk toward zero. To improve somehow the
convergence speed, it is possible to average the past iterate, i.e. run a “classical” SGD on auxiliary variables
(x̃k )k
x̃(ℓ+1) = x̃k − τk ∇fi(k) (x̃k )
and output as estimated weight vector the Cesaro average
k
def. 1X
xk = x̃ℓ .
k
ℓ=1
30
This defines the Stochastic Gradient Descent with Averaging (SGA) algorithm.
Note that it is possible to avoid explicitly storing all the iterates by simply updating a running average
as follow
1 k−1
xk+1 = x̃k + xk .
k k
In this case, a typical choice of decay is rather of the form
τ
p0
def.
τk = .
1 + k/k0
Notice that the step size now goes much slower to 0, at rate k −1/2 .
Typically, because the averaging stabilizes the iterates, the choice of (k0 , τ0 ) is less important than for
SGD.
Bach proves that for logistic classification, it leads to a faster convergence (the constant involved are
smaller) than SGD, since on contrast to SGD, SGA is adaptive to the local strong convexity of E.
8.5 Stochastic Averaged Gradient Descent (SAG)

For problem size n where the dataset (of size n × p) can fully fit into memory, it is possible to further
improve the SGA method by bookkeeping the previous gradients. This gives rise to the Stochastic Averaged
Gradient Descent (SAG) algorithm.
We store all the previously computed gradients in (Gi )ni=1 , which necessitates O(n × p) memory. The
iterates are defined by using a proxy g for the batch gradient, which is progressively enhanced during the
iterates.
The algorithm reads 
 h ← ∇fi(k) (x̃k ),
xk+1 = xk − τ g where g ← g − Gi(k) + h,
 i(k)
G ← h.
Note that in contrast to SGD and SGA, this method uses a fixed step size τ . Similarly to the BGD, in order
to ensure convergence, the step size τ should be of the order of 1/L where L is the Lipschitz constant of f .
This algorithm improves over SGA and SGD since it has a convergence rate of O(1/k) as does BGD.
Furthermore, in the presence of strong convexity (for instance when X is injective for logistic classification),
it has a linear convergence rate, i.e.
E(f (xk )) − f (x⋆ ) = O ρk ,

for some 0 < ρ < 1.

Note that this improvement over SGD and SGA is made possible only because SAG explicitly uses the
fact that n is finite (while SGD and SGA can be extended to infinite n and more general minimization of
expectations (43)).
Figure 18 shows a comparison of SGD, SGA and SAG.
9 Automatic Differentiation
The main computational bottleneck of gradient descent methods (batch or stochastic) is the computation
of gradients ∇f (x). For simple functionals, such as those encountered in ERM for linear models, and also for
MLP with a single hidden layer, it is possible to compute these gradients in closed form, and that the main
computational burden is the evaluation of matrix-vector products. For more complicated functionals (such as
those involving deep networks), computing the formula for the gradient quickly becomes cumbersome. Even
worse: computing these gradients using the usual chain rule formula is sub-optimal. We presents methods
to compute recursively in an optimal manner these gradients. The purpose of this approach is to automatize
this computational step.
31
Figure 18: Evolution of log10 (f (xk ) − f (x⋆ )) for SGD, SGA and SAG.
Figure 19: A computational graph.
9.1 Finite Differences and Symbolic Calculus

We consider f : Rp → R and want to derive a method to evaluate ∇f : Rp 7→ Rp . Approximating this
vector field using finite differences, i.e. introducing ε > 0 small enough and computing
1
(f (x + εδ1 ) − f (x), . . . , f (x + εδp ) − f (x))⊤ ≈ ∇f (x)
ε
requires p + 1 evaluations of f , where we denoted δk = (0, . . . , 0, 1, 0, . . . , 0) where the 1 is at index k. For
a large p, this is prohibitive. The method we describe in this section (the so-called reverse mode automatic
differentiation) has in most cases a cost proportional to a single evaluation of f . This type of method is similar
to symbolic calculus in the sense that it provides (up to machine precision) exact gradient computation. But
symbolic calculus does not takes into account the underlying algorithm which compute the function, while
automatic differentiation factorizes the computation of the derivative according to an efficient algorithm.
9.2 Computational Graphs

We consider a generic function f (x) where x = (x1 , . . . , xs ) are the input variables. We assume that
f is implemented in an algorithm, with intermediate variable (xs+1 , . . . , xt ) where t is the total number
of variables. The output is xt , and we thus denote xt = f (x) this function. We denote xk ∈ Rnk the
dimensionality of the variables. The goal is to compute the derivatives ∂f (x)
∂xk ∈ R
nt ×nk
for k = 1, . . . , s. For
the sake of simplicity, one can assume in what follows that nk = 1 so that all the involved quantities are
scalar (but if this is not the case, beware that the order of multiplication of the matrices of course matters).
A numerical algorithm can be represented as a succession of functions of the form
∀ k = s + 1, . . . , t, xk = fk (x1 , . . . , xk−1 )
where fk is a function which only depends on the previous variables, see Fig. 19. One can represent this
algorithm using a directed acyclic graph (DAG), linking the variables involved in fk to xk . The node of this
graph are thus conveniently ordered by their indexing, and the directed edges only link a variable to another
32
Figure 20: Relation between the variable for the forward (left) and backward (right) modes.
one with a strictly larger index. The evaluation of f (x) thus corresponds to a forward traversal of this graph.
Note that the goal of automatic differentiation is not to define an efficient computational graph, it is up
to the user to provide this graph. Computing an efficient graph associated to a mathematical formula is a
complicated combinatorial problem, which still has to be solved by the user. Automatic differentiation thus
leverage the availability of an efficient graph to provide an efficient algorithm to evaluate derivatives.
9.3 Forward Mode of Automatic Differentiation

The forward mode correspond to the usual way of computing differentials. It compute the derivative
∂xk
∂x1 of all variables xk with respect to x1 . One then needs to repeat this method p times to compute all
the derivative with respect to x1 , x2 , . . . , xp (we only write thing for the first variable, the method being of
course the same with respect to the other ones).
The method initialize the derivative of the input nodes
∂x1 ∂x2 ∂xs
= Idn1 ×n1 , = 0n2 ×n1 , . . . , = 0ns ×n1 ,
∂x1 ∂x1 ∂x1
(and thus 1 and 0’s for scalar variables), and then iteratively make use of the following recursion formula

∂xk X ∂xk ∂xℓ X ∂fk ∂xℓ
∀ k = s + 1, . . . , t, = × = (x1 , . . . , xk−1 ) × .
∂x1 ∂xℓ ∂x1 ∂xℓ ∂x1
ℓ∈parent(k) ℓ∈parent(k)
The notation “parent(k)” denotes the nodes ℓ < k of the graph that are connected to k, see Figure 20,
∂xℓ
left. Here the quantities being computed (i.e. stored in computer variables) are the derivatives ∂x 1
, and
× denotes in full generality matrix-matrix multiplications. We have put in [. . .] an informal notation, since
here ∂x
∂xℓ should be interpreted not as a numerical variable but needs to be interpreted as derivative of the
k
function fk , which can be evaluated on the fly (we assume that the derivative of the function involved are
accessible in closed form).
∂fk
Assuming all the involved functions ∂x k
have the same complexity (which is likely to be the case if all
the nk are for instance scalar or have the same dimension), and that the number of parent node is bounded,
one sees that the complexity of this scheme is p times the complexity of the evaluation of f (since this needs
∂
to be repeated p times for ∂x 1
, . . . , ∂x∂ p ). For a large p, this is prohibitive.
Simple example. We consider the fonction

p
f (x, y) = y log(x) + y log(x) (53)
33
Figure 21: Example of a simple computational graph.
whose computational graph is displayed on Figure 21. The iterations of the forward mode to compute the
derivative with respect to x read
∂x ∂y
= 1, =0
∂x ∂x
∂a ∂a ∂x 1 ∂x
= = {x 7→ a = log(x)}
∂x ∂x ∂x x ∂x

∂b ∂b ∂a ∂b ∂y ∂a
= + =y +0 {(y, a) 7→ b = ya}
∂x ∂a ∂x ∂y ∂x ∂x
√

∂c ∂c ∂b 1 ∂b
= = √ {b 7→ c = b}
∂x ∂b ∂x 2 b ∂x

∂f ∂f ∂b ∂f ∂c ∂b ∂c
= + =1 +1 {(b, c) 7→ f = b + c}
∂x ∂b ∂x ∂c ∂x ∂x ∂x
One needs to run another forward pass to compute the derivative with respect to y
∂x ∂y
= 0, =1
∂y ∂y

∂a ∂a ∂x
= =0 {x 7→ a = log(x)}
∂y ∂x ∂y

∂b ∂b ∂a ∂b ∂y ∂y
= + =0+a {(y, a) 7→ b = ya}
∂y ∂a ∂y ∂y ∂y ∂y
√

∂c ∂c ∂b 1 ∂b
= = √ {b 7→ c = b}
∂y ∂b ∂y 2 b ∂y

∂f ∂f ∂b ∂f ∂c ∂b ∂c
= + =1 +1 {(b, c) 7→ f = b + c}
∂y ∂b ∂y ∂c ∂y ∂y ∂y
Dual numbers. A convenient way to implement this forward pass is to make use of so called “dual
number”, which is an algebra over the real where the number have the form x + εx′ where ε is a symbol
obeying the rule that ε2 = 0. Here (x, x′ ) ∈ R2 and x′ is intended to store a derivative with respect to some
input variable. These number thus obeys the following arithmetic operations
1 1 x′
(x + εx′ )(y + εy ′ ) = xy + ε(xy ′ + yx′ ) and ′
= − ε 2.
x + εx x x
If f is a polynomial or a rational function, from these rules one has that
f (x + ε) = f (x) + εf ′ (x).
For a more general basic function f , one needs to overload it so that
f (x + εx′ ) = f (x) + εf ′ (x)x′ .

def.
34
Using this definition, one has that
(f ◦ g)(x + ε) = f (g(x)) + εf ′ (g(x))g ′ (x)
which corresponds to the usual chain rule. More generally, if f (x1 , . . . , xs ) is a function implemented using
these overloaded basic functions, one has
∂f
f (x1 + ε, x2 , . . . , xs ) = f (x1 , . . . , xs ) + ε (x1 , . . . , xs )
∂x1
and this evaluation is equivalent to applying the forward mode of automatic differentiation to compute
∂f
∂x1 (x1 , . . . , xs ) (and similarly for the other variables).
9.4 Reverse Mode of Automatic Differentiation

Instead of evaluating the differentials ∂x
∂x1 which is problematic for a large p, the reverse mode evaluates
k
∂xt
the differentials ∂xk , i.e. it computes the derivative of the output node with respect to the all the inner
nodes.
The method initialize the derivative of the final node
∂xt
= Idnt ×nt ,
∂xt
and then iteratively makes use, from the last node to the first, of the following recursion formula

∂xt X ∂xt ∂xm X ∂xt ∂fm (x1 , . . . , xm )
∀ k = t − 1, t − 2, . . . , 1, = × = × .
∂xk ∂xm ∂xk ∂xm ∂xk
m∈son(k) m∈son(k)
The notation “parent(k)” denotes the nodes ℓ < k of the graph that are connected to k, see Figure 20, right.
∂xt
Back-propagation. In the special case where xt ∈ R, then ∂xk = [∇xk f (x)]⊤ ∈ R1×nk and one can write
the recursion on the gradient vector as follow
⊤
X ∂fm (x1 , . . . , xm )
∀ k = t − 1, t − 2, . . . , 1, ∇xk f (x) = (∇xm f (x)) .
∂xk
m∈son(k)
⊤
where ∂fm (x∂x 1 ,...,xm )
k
∈ Rnk ×nm is the adjoint of the Jacobian of fm . This form of recursion using adjoint
is often referred to as “back-propagation”, and is the most frequent setting in applications to ML.
In general, when nt = 1, the backward is the optimal way to compute the gradient of a function. Its
drawback is that it necessitate the pre-computation of all the intermediate variables (xk )tk=p , which can be
prohibitive in term of memory usage when t is large. There exists check-pointing method to alleviate this
issue, but it is out of the scope of this course.
35
Figure 22: Complexity of forward (left) and backward (right) modes for composition of functions.
Simple example. We consider once again the fonction f (x) of (53), the iterations of the reverse mode
read
∂f
=1
∂f

∂f ∂f ∂f ∂f
= = 1 {c 7→ f = b + c}
∂c ∂f ∂c ∂f
√

∂f ∂f ∂c ∂f ∂f ∂f 1 ∂f
= + = √ + 1 {b 7→ c = b, b 7→ f = b + c}
∂b ∂c ∂b ∂f ∂b ∂c 2 b ∂f

∂f ∂f ∂b ∂f
= = y {a 7→ b = ya}
∂a ∂b ∂a ∂b

∂f ∂f ∂b ∂f
= = a {y 7→ b = ya}
∂y ∂b ∂y ∂b

∂f ∂f ∂a ∂f 1
= = {x 7→ a = log(x)}
∂x ∂a ∂x ∂a x
The advantage of the reverse mode is that a single traversal of the computational graph allows to compute
both derivatives with respect to x, y, while the forward more necessitates two passes.
9.5 Feed-forward Compositions

The simplest computational graphs are purely feedforward, and corresponds to the computation of
f = ft ◦ ft−1 ◦ . . . ◦ f2 ◦ f1 (54)
for functions fk : Rnk−1 → Rnk .

The forward function evaluation algorithm initializes x0 = x ∈ Rn0 and then computes
∀ k = 1, . . . , t, xk = fk (xk−1 )
where at the output, one retrieves f (x) = xt .

def.
Denoting Ak = ∂fk (xk−1 ) ∈ Rnk ×nk−1 the Jacobian, one has
∂f (x) = At × At−1 × . . . A2 × A1 .
The forward (resp. backward) mode corresponds to the computation of the product of the Jacobian from
right to left (resp. left to right)
∂f (x) = At × (At−1 × (. . . × (A3 × (A2 × A1 )))) ,

∂f (x) = ((((At × At−1 ) × At−2 ) × . . .) × A2 ) × A1 .
36
Figure 23: Computational graph for a feedforward architecture.
We note that the computation of the product A × B of A ∈ Rn×p with B ∈ Rp×q necessitates npq
operations. As shown on Figure 22, the complexity of the forward and backward modes are
t−1
X t−2
X
n0 nk nk+1 and nt nk nk+1
k=1 k=0
So if nt ≪ n0 (which is the typical case in ML scenario where nt = 1) then the backward mode is cheaper.
9.6 Feed-forward Architecture

We can generalize the previous example to account for feed-forward architectures, such as neural networks,
which are of the form
∀ k = 1, . . . , t, xk = fk (xk−1 , θk−1 ) (55)
where θk−1 is a vector of parameters and x0 ∈ Rn0 is given. The function to minimize has the form
def.
f (θ) = L(xt ) (56)
where L : Rnt → R is some loss function (for instance a least square or logistic prediction risk) and θ =
(θk )t−1
k=0 . Figure 23, top, displays the associated computational graph.
One can use the reverse mode automatic differentiation to compute the gradient of f by computing
successively the gradient with respect to all (xk , θk ). One initializes
∇xt f = ∇L(xt )
and then recurse from k = t − 1 to 0
zk−1 = [∂x fk (xk−1 , θk−1 )]⊤ zk and ∇θk−1 f = [∂θ fk (xk−1 , θk−1 )]⊤ (∇xk f ) (57)
def.
where we denoted zk = ∇xk f (θ) the gradient with respect to xk .
Multilayers perceptron. For instance, feedforward deep network (fully connected for simplicity) corre-
sponds to using
∀ xk−1 ∈ Rnk−1 , fk (xk−1 , θk−1 ) = ρ(θk−1 xk−1 ) (58)
where θk−1 ∈ Rnk ×nk−1 are the neuron’s weights and ρ a fixed pointwise linearity, see Figure 24. One has,
for a vector zk ∈ Rnk (typically equal to ∇xk f )
[∂x fk (xk−1 , θk−1 )]⊤ (zk ) = θk−1

⊤

wk zk ,
where wk = diag(ρ′ (θk−1 xk−1 )).
def.
[∂θ fk (xk−1 , θk−1 )] (zk ) = wk x⊤

⊤
k−1
37
Figure 24: Multi-layer perceptron parameterization.
Figure 25: Computational graph for a recurrent architecture.
Link with adjoint state method. One can interpret (55) as a time discretization of a continuous ODE.
One imposes that the dimension nk = n is fixed, and denotes x(t) ∈ Rn a continuous time evolution, so that
xk → x(kτ ) when k → +∞ and kτ → t. Imposing then the structure
fk (xk−1 , θk−1 ) = xk−1 + τ u(xk−1 , θk−1 , kτ ) (59)
where u(x, θ, t) ∈ Rn is a parameterized vector field, as τ → 0, one obtains the non-linear ODE
ẋ(t) = u(x(t), θ(t), t) (60)
with x(t = 0) = x0 .
Denoting z(t) = ∇x(t) f (θ) the “adjoint” vector field, the discrete equations (62) becomes the so-called
adjoint equations, which is a linear ODE
ż(t) = −[∂x u(x(t), θ(t), t)]⊤ z(t) and ∇θ(t) f (θ) = [∂θ u(x(t), θ(t), t)]⊤ z(t).
1
Note that the correct normalization is τ ∇θk−1 f → ∇θ(t) f (θ)
9.7 Recurrent Architectures

Parametric recurrent functions are obtained by using the same parameter θ = θk and fk = h recursively
in (58), so that
∀ k = 1, . . . , t, xk = h(xk−1 , θ). (61)
We consider a real valued function of the form
f (θ) = L(xt , θ)
so that here the final loss depends on θ (which is thus more general than (56)). Figure 25, bottom, displays
the associated computational graph.
The back-propagation then operates as
X
∇xk−1 f = [∂x h(xk−1 , θ)]⊤ ∇xk f and ∇θ f = ∇θ L(xt , θ) + [∂θ h(xk−1 , θ)]⊤ ∇xk f. (62)
k
38
Figure 26: Recurrent residual perceptron parameterization.
Similarly, writing h(x, θ) = x + τ u(x, θ), letting (k, kτ ) → (+∞, t), one obtains the forward non-linear ODE
with a time-stationary vector field
ẋ(t) = u(x(t), θ)
and the following linear backward adjoint equation, for f (θ) = L(x(T ), θ)
Z T
ż(t) = −[∂x u(x(t), θ)]⊤ z(t) and ∇θ f (θ) = ∇θ L(x(T ), θ) + [∂θ f (x(t), θ)]⊤ z(t)dt. (63)
0
with z(0) = ∇x L(xt , θ).
Residual recurrent networks. A recurrent network is defined using

h(x, θ) = x + W2⊤ ρ(W1 x)
as displayed on Figure 26, where θ = (W1 , W2 ) ∈ (Rn×q )2 are the weights and ρ is a pointwise non-linearity.
The number q of hidden neurons can be increased to approximate more complex functions. In the special
case where W2 = −τ W1 , and ρ = ψ ′ , then this is a special case ofPan argmin layer (65) to minimize the
function E(x, θ) = ψ(W1 x) using gradient descent, where ψ(u) = i ψ(ui ) is a separable function. The
Jacobians ∂θ h and ∂x h are computed similarly to (63).
Mitigating memory requirement. The main issue of applying this backpropagation method to com-
pute ∇f (θ) is that it requires a large memory to store all the iterates (xk )tk=0 . A workaround is to use
checkpointing, which stores some of these intermediate results and re-run partially the forward algorithm to
reconstruct missing values during the backward pass. Clever hierarchical method perform this recursively in
order to only require log(t) stored values and a log(t) increase on the numerical complexity.
In some situation, it is possible to avoid the storage of the forward result, if one assume that the algorithm
can be run backward. This means that there exists some functions gk so that
xk = gk (xk+1 , . . . , xt ).
In practice, this function typically also depends on a few extra variables, in particular on the input values
(x0 , . . . , xs ).
An example of this situation is when one can split the (continuous time) variable as x(t) = (r(t), s(t))
and the vector field u in the continuous ODE (60) has a symplectic structure of the form u((r, s), θ, t) =
(F (s, θ, t), G(r, θ, t)). One can then use a leapfrog integration scheme, which defines
rk+1 = rk + τ F (sk , θk , τ k) and sk+1 = sk + τ G(rk+1 , θk+1/2 , τ (k + 1/2)).
One can reverse these equation exactly as
sk = sk+1 − τ G(rk+1 , θk+1/2 , τ (k + 1/2)). and rk = rk+1 − τ F (sk , θk , τ k).
39
Fixed point maps In some applications (some of which are detailed below), the iterations xk converges
to some x⋆ (θ) which is thus a fixed point
x⋆ (θ) = h(x⋆ (θ), θ).
Instead of applying the back-propagation to compute the gradient of f (θ) = L(xt , θ), one can thus apply the
implicit function theorem to compute the gradient of f ⋆ (θ) = L(x⋆ (θ), θ). Indeed, one has
∇f ⋆ (θ) = [∂x⋆ (θ)]⊤ (∇x L(x⋆ (θ), θ)) + ∇θ L(x⋆ (θ), θ). (64)
Using the implicit function theorem one can compute the Jacobian as
−1
∂h ⋆ ∂h ⋆
∂x⋆ (θ) = − (x (θ), θ) (x (θ), θ).
∂x ∂θ
In practice, one replace in these formulas x⋆ (θ) by xt , which produces an approximation of ∇f (θ). The
disadvantage of this method is that it requires the resolution of a linear system, but its advantage is that it
bypass the memory storage issue of the backpropagation algorithm.
Argmin layers One can define a mapping from some parameter θ to a point x(θ) by solving a parametric
optimization problem
x(θ) = argmin E(x, θ).
x
The simplest approach to solve this problem is to use a gradient descent scheme, x0 = 0 and
xk+1 = xk − τ ∇E(xk , θ). (65)
This has the form (59) when using the vector field u(x, θ) = ∇E(xk , θ).
Using formula (64) in this case where h = ∇E, one obtains
⊤ −1
∂2E ⋆ ∂2E ⋆

⋆
∇f (θ) = − (x (θ), θ) (x (θ), θ) (∇x L(x⋆ (θ), θ)) + ∇θ L(x⋆ (θ), θ)
∂x∂θ ∂x2
In the special case where the function f (θ) is the minimized function itself, i.e. f (θ) = E(x⋆ (θ), θ), i.e.
L = E, then one can apply the implicit function theorem formula (64), which is much simpler since in this
case ∇x L(x⋆ (θ), θ) = 0 so that
∇f ⋆ (θ) = ∇θ L(x⋆ (θ), θ). (66)
This result is often called Danskin theorem or the envelope theorem.
Sinkhorn’s algorithm Sinkhorn algorithm approximates the optimal distance between two histograms
def.
a ∈ Rn and b ∈ Rm using the following recursion on multipliers, initialized as x0 = (u0 , v0 ) = (1n , 1m )
a b
uk+1 = , and vk+1 = .
Kvk K ⊤ uk
where ·· is the pointwise division and K ∈ Rn×m
+ is a kernel. Denoting θ = (a, b) ∈ Rn+m and xk = (uk , vk ) ∈
n+m
R , the OT distance is then approximately equal to
def.
f (θ) = E(xt , θ) = ⟨a, log(ut )⟩ + ⟨b, log(vt )⟩ − ε⟨Kut , vt ⟩.
Sinkhorn iteration are alternate

minimization to find a minimizer of E.
def. 0 K
Denoting K = ∈ R(n+m)×(n+m) , one can re-write these iterations in the form (61) using
K⊤ 0
θ
h(x, θ) = and L(xt , θ) = E(xt , θ) = ⟨θ, log(xt )⟩ − ε⟨Kut , vt ⟩.
Kx
40
One has the following differential operator

⊤ ⊤ θ ⊤ 1
[∂x h(x, θ)] = −K diag , [∂θ h(x, θ)] = diag .
(Kx)2 Kx
Similarly as for the argmin layer, at convergence xk → x⋆ (θ), one finds a minimizer of E, so that ∇x L(x⋆ (θ), θ) =
0 and thus the gradient of f ⋆ (θ) = E(x⋆ (θ), θ) can be computed using (66) i.e.
∇f ⋆ (θ) = log(x⋆ (θ)).
41

OptimML

Uploaded by

OptimML

Uploaded by

Course notes on

Optimization for Machine Learning

2 Basics of Convex Analysis 2

3 Derivative and gradient 4

4 Gradient Descent Algorithm 9

6 Mirror Descent and Implicit Bias 18

1 Motivation in Machine Learning

inf f (x), (1)

min f (x). (2)

Figure 2: Left: non-existence of minimizer, middle: multiple minimizers, right: uniqueness.

2 Basics of Convex Analysis

2.3 Convex Sets

3 Derivative and gradient

f (x + ε) = f (x) + ⟨ε, ∇f (x)⟩ + o(||ε||). (7)

f (x + ε) = f (x) + ⟨ε, g⟩ + o(||ε||),

in which case one necessarily has ∇f (x) = g.

f convex ⇔ ∀(x, x′ ), f (x) ⩾ f (x′ ) + ⟨∇f (x′ ), x − x′ ⟩.

Proof. One can write the convexity condition as

⟨∇f (x), x′ − x⟩ ⩽ f (x′ ) − f (x).

f (x) ⩾ f (xt ) + ⟨∇f (xt ), x − xt ⟩ = f (xt ) − t⟨∇f (xt ), x − x′ ⟩

multiplying these inequality by respectively 1 − t and t, and summing them, gives

(1 − t)f (x) + tf (x′ ) ⩾ f (xt ).

3.2 First Order Conditions

Proof. One has for ε small enough and u fixed

f (x⋆ ) ⩽ f (tx + (1 − t)x⋆ ) ⩽ tf (x) + (1 − t)f (x⋆ ) =⇒ f (x⋆ ) ⩽ f (x)

and thus x⋆ is a global minimum.

f (x) ⩾ f (x⋆ ) + ⟨∇f (x⋆ ), x − x⋆ ⟩ = f (x⋆ ).

3.3 Least Squares

∀ (u, v) ∈ Rp × Rn , ⟨Au, v⟩Rn = ⟨u, A⊤ v⟩Rp .

x⋆ = (A⊤ A)−1 A⊤ y. (9)

A⊤ Ax = 0 =⇒ ||Ax||2 = ⟨A⊤ Ax, x⟩ = 0 =⇒ Ax = 0.

3.4 Link with PCA

Since C is a symmetric, it diagonalizes in an ortho-basis U = (u1 , . . . , up ) ∈ Rp×p . Here, the vectors

f (x + ε) = L(− diag(y)Ax − diag(y)Aε)

Using the fact that o(|| diag(y)Aε||) = o(||ε||), one obtains

f (x + ε) = f (x) + ⟨∇L(− diag(y)Ax), − diag(y)Aε⟩ + o(||ε||)

∇f (x) = −A⊤ diag(y)∇L(− diag(y)Ax).

3.6 Chain Rule

which shows that

def. ∂Fi (x)

F (x + ε) = F (x) + [∂F (x)](ε) + o(||ε||). (12)

∀ ε ∈ Rp , [∂f (x)](ε) = ⟨∇f (x), ε⟩ ⇔ [∂f (x)](ε) = ∇f (x)⊤ .

The differential satisfies the following chain rule

∂(G ◦ H)(x) = [∂G(H(x))] × [∂H(x)]

When H(x) = Bx is linear, one recovers formula (11).

4 Gradient Descent Algorithm

τ ∈ R+ = [0, +∞[7−→ f (x − τ ∇f (x)) ∈ R.

If f is differentiable at x, one has

f (x − τ ∇f (x)) < f (x)

4.2 Gradient Descent

xk+1 = xk − τℓ (Cxk − b).

||Sz||2 = ⟨S ⊤ Sz, z⟩ = ⟨U diag(λk )U ⊤ z, z⟩ = ⟨diag(λ2k )U ⊤ z, U ⊤ z⟩

Applying this to S = Idp − τℓ C, one has

xk − x⋆ = (Idp − τ C)k (x0 − x⋆ ).

Now one has

5.2 General Case

∇f (x + ε) = ∇f (x) + [∂ 2 f (x)](ε) + o(||ε||)

∇f (x) = −A⊤ diag(y)∇L(− diag(y)Ax).

Smoothness and strong convexity. One also needs to quantify the

which is the desired upper-bound. Using directly (Sµ ) gives

f (xk ) + ⟨∇f (xk ), x⋆ − xk ⟩ ⩽ f (x⋆ )

and plugging this in (22) shows

6 Mirror Descent and Implicit Bias

Dψ (x|y) ≜ ψ(x) − ψ(y) − ⟨∇ψ(y), x − y⟩.

These examples can be generalized to power entropies

with special cases

They are defined on Rd if α > 1 and Rd+ if α ⩽ 1.

Dψ (X) = tr(X log(X) − X log(Y ) − X + Y )

and if h(s) = − log(s) then Dψ (X) = − log det(X).

6.2 Mirror descent

Its explicit version then reads by Taylor expanding f at xk

xk+1 = (∇ψ ∗ )[∇ψ(xk ) − τ ∇f (xk )] (29)

uk+1 = uk − τ h(uk ) where h(u) ≜ ∇f (∇ψ ∗ (u)). (30)

Mirror flow. When τ → 0, one obtains the following expansion

xk+1 = (∇ψ ∗ )[∇ψ(xk )] − τ [∂ 2 ψ ∗ ](∇ψ(xk )) × ∇f (xk ) + o(τ )

so that defining x(t) = xk for t = kτ the limit is the following flow

Sinkhorn iteration are alternate