OptimML
OptimML
Abstract
This document presents first order optimization methods and their applications to machine learning.
This is not a course on machine learning (in particular it does not cover modeling and statistical consid-
erations) and it is focussed on the use and analysis of cheap methods that can scale to large datasets and
models with lots of parameters. These methods are variations around the notion of “gradient descent”,
so that the computation of gradients plays a major role. This course covers basic theoretical properties
of optimization problems (in particular convex analysis and first order differential calculus), the gradient
descent method, the stochastic gradient method, automatic differentiation, shallow and deep networks.
Contents
1 Motivation in Machine Learning 1
1.1 Unconstraint optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1
5 Convergence Analysis 11
5.1 Quadratic Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.2 General Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.3 Acceleration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
7 Regularization 22
7.1 Penalized Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
7.2 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
7.3 Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
7.4 Iterative Soft Thresholding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
8 Stochastic Optimization 26
8.1 Minimizing Sums and Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
8.2 Batch Gradient Descent (BGD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
8.3 Stochastic Gradient Descent (SGD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
8.4 Stochastic Gradient Descent with Averaging (SGA) . . . . . . . . . . . . . . . . . . . . . . . . 29
8.5 Stochastic Averaged Gradient Descent (SAG) . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
9 Automatic Differentiation 30
9.1 Finite Differences and Symbolic Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
9.2 Computational Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
9.3 Forward Mode of Automatic Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
9.4 Reverse Mode of Automatic Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
9.5 Feed-forward Compositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
9.6 Feed-forward Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
9.7 Recurrent Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
and try to devise “cheap” algorithms with a low computational cost per iteration to approximate a minimizer
when it exists. The class of algorithms considered are first order, i.e. they make use of gradient information.
In the following, we denote
def.
argmin f (x) = {x ∈ Rp ; f (x) = inf f } ,
x
to indicate the set of points (it is not necessarily a singleton since the minimizer might be non-unique) that
achieve the minimum of the function f . One might have argmin f = ∅ (this situation is discussed below),
but in case a minimizer exists, we denote the optimization problem as
2
Figure 1: Left: linear regression, middle: linear classifier, right: loss function for classification.
In typical learning scenario, f (x) is the empirical risk for regression or classification, and p is the number
of parameter. For instance, in the simplest case of linear models, we denote (ai , yi )ni=1 where ai ∈ Rp are
the features. In the following, we denote A ∈ Rn×p the matrix whose rows are the ai .
1.2 Regression
For regression, yi ∈ R, in which case
n
1X 1
f (x) = (yi − ⟨x, ai ⟩)2 = ||Ax − y||2 , (3)
2 i=1 2
Pp
is the least square quadratic risk function (see Fig. 1). Here ⟨u, v⟩ = i=1 ui vi is the canonical inner product
in Rp and || · ||2 = ⟨·, ·⟩.
1.3 Classification
For classification, yi ∈ {−1, 1}, in which case
n
X
f (x) = ℓ(−yi ⟨x, ai ⟩) = L(− diag(y)Ax) (4)
i=1
where ℓ is a smooth approximation of the 0-1 loss 1R+ . For instance ℓ(u) = log(1 + exp(u)), and diag(y) ∈
Rn×n is the diagonal matrix with yiPalong the diagonal (see Fig. 1, right). Here the separable loss function
L = Rn → R is, for z ∈ Rn , L(z) = i ℓ(zi ).
3
Figure 3: Coercivity condition for least squares.
In order to show existence of a minimizer, and that the set of minimizer is bounded (otherwise one can
have problems with optimization algorithm that could escape to infinity), one needs to show that one can
replace the whole space Rp by a compact sub-set Ω ⊂ Rp (i.e. Ω is bounded and close) and that f is
continuous on Ω (one can replace this by a weaker condition, that f is lower-semi-continuous, but we ignore
this here). A way to show that one can consider only a bounded set is to show that f (x) → +∞ when
x → +∞. Such a function is called coercive. In this case, one can choose any x0 ∈ Rp and consider its
associated lower-level set
Ω = {x ∈ Rp ; f (x) ⩽ f (x0 )}
which is bounded because of coercivity, and closed because f is continuous. One can actually show that for
convex function, having a bounded set of minimizer is equivalent to the function being coercive (this is not
the case for non-convex function, for instance f (x) = min(1, x2 ) has a single minimum but is not coercive).
Example 1 (Least squares). For instance, for the quadratic loss function f (x) = 21 ||Ax − y||2 , coercivity holds
if and only if ker(A) = {0} (this corresponds to the overdetermined setting). Indeed, if ker(A) ̸= {0} if x⋆
is a solution, then x⋆ + u is also solution for any u ∈ ker(A), so that the set of minimizer is unbounded. On
contrary, if ker(A) = {0}, we will show later that the set of minimizer is unique, see Fig. 3. If ℓ is strictly
convex, the same conclusion holds in the case of classification.
2.2 Convexity
Convex functions define the main class of functions which are somehow “simple” to optimize, in the sense
that all minimizers are global minimizers, and that there are often efficient methods to find these minimizers
(at least for smooth convex functions). A convex function is such that for any pair of point (x, y) ∈ (Rp )2 ,
∀ t ∈ [0, 1], f ((1 − t)x + ty) ⩽ (1 − t)f (x) + tf (y) (5)
which means that the function is below its secant (and actually also above its tangent when this is well
defined), see Fig. 4. If x⋆ is a local minimizer of a convex f , then x⋆ is a global minimizer, i.e. x⋆ ∈ argmin f .
Convex function are very convenient because they are stable under lots of transformation. In particular,
if f , g are convex and a, b are positive, af + bg is convex (the set of convex function is itself an infinite
dimensional convex cone!) and so is max(f, g). If g : Rq → R is convex and B ∈ Rq×p , b ∈ Rq then
f (x) = g(Bx + b) is convex. This shows immediately that the square loss appearing in (3) is convex, since
|| · ||2 /2 is convex (as a sum of squares). Also, similarly, if ℓ and hence L is convex, then the classification
loss function (4) is itself convex.
Strict convexity. When f is convex, one can strengthen the condition (5) and impose that the inequality
is strict for t ∈]0, 1[ (see Fig. 4, right), i.e.
∀ t ∈]0, 1[, f ((1 − t)x + ty) < (1 − t)f (x) + tf (y). (6)
In this case, if a minimum x⋆ exists, then it is unique. Indeed, if x⋆1 ̸= x⋆2 were two different minimizer, one
x⋆ +x⋆
would have by strict convexity f ( 1 2 2 ) < f (x⋆1 ) which is impossible.
4
Figure 4: Convex vs. non-convex functions ; Strictly convex vs. non strictly convex functions.
Figure 5: Comparison of convex functions f : Rp → R (for p = 1) and convex sets C ⊂ Rp (for p = 2).
Example 2 (Least squares). For the quadratic loss function f (x) = 21 ||Ax − y||2 , strict convexity is equivalent
to ker(A) = {0}. Indeed, we see later that its second derivative is ∂ 2 f (x) = A⊤ A and that strict convexity
is implied by the eigenvalues of A⊤ A being strictly positive. The eigenvalues of A⊤ A being positive, it is
equivalent to ker(A⊤ A) = {0} (no vanishing eigenvalue), and A⊤ Az = 0 implies ⟨A⊤ Az, z⟩ = ||Az||2 = 0 i.e.
z ∈ ker(A).
5
the gradient vector, so that ∇f : Rp → Rp is a vector field. Here the partial
derivative (when they exits) are defined as
∂f (x) def. f (x + ηδk ) − f (x)
= lim
∂xk η→0 η
where δk = (0, . . . , 0, 1, 0, . . . , 0)⊤ ∈ Rp is the k th canonical basis vector.
Beware that ∇f (x) can exist without f being differentiable. Differentiability of f at each reads
Here R(ε) = o(||ε||) denotes a quantity which decays faster than ε toward 0, i.e. R(ε) ||ε|| → 0 as ε → 0. Existence
of partial derivative corresponds to f being differentiable along the axes, while differentiability should hold
for any converging sequence of ε → 0 (i.e. not along along a fixed direction). A counter example in 2-D is
f (x) = 2x1 xx22(x 1 +x2 )
2 with f (0) = 0, which is affine with different slope along each radial lines.
1 +x2
Also, ∇f (x) is the only vector such that the relation (7). This means that a possible strategy to both
prove that f is differentiable and to obtain a formula for ∇f (x) is to show a relation of the form
6
Figure 6: Function with local maxima/minima (left), saddle point (middle) and global minimum (right).
f (x⋆ ) ⩽ f (x⋆ + εu) = f (x⋆ ) + ε⟨∇f (x⋆ ), u⟩ + o(ε) =⇒ ⟨∇f (x⋆ ), u⟩ ⩾ o(1) =⇒ ⟨∇f (x⋆ ), u⟩ ⩾ 0.
So applying this for u and −u in the previous equation shows that ⟨∇f (x⋆ ), u⟩ = 0 for all u, and hence
∇f (x⋆ ) = 0.
Note that the converse is not true in general, since one might have ∇f (x) = 0 but x is not a local
mininimum. For instance x = 0 for f (x) = −x2 (here x is a maximizer) or f (x) = x3 (here x is neither a
maximizer or a minimizer, it is a saddle point), see Fig. 6. Note however that in practice, if ∇f (x⋆ ) = 0 but
x is not a local minimum, then x⋆ tends to be an unstable equilibrium. Thus most often a gradient-based
algorithm will converge to points with ∇f (x⋆ ) = 0 that are local minimizers. The following proposition
shows that a much strong result holds if f is convex.
Proposition 3. If f is convex and x⋆ a local minimum, then x⋆ is also a global minimum. If f is differen-
tiable and convex,
x⋆ ∈ argmin f (x) ⇐⇒ ∇f (x⋆ ) = 0.
x
Proof. For any x, there exist 0 < t < 1 small enough such that tx + (1 − t)x⋆ is close enough to x⋆ , and so
since it is a local minimizer
Thus in this case, optimizing a function is the same a solving an equation ∇f (x) = 0 (actually p
equations in p unknown). In most case it is impossible to solve this equation, but it often provides interesting
information about solutions x⋆ .
7
Here, we have used the fact that ||Aε||2 = o(||ε||) and use the transpose matrix A⊤ . This matrix is obtained
by exchanging the rows and the columns, i.e. A⊤ = (Aj,i )j=1,...,p
i=1,...,n , but the way it should be remember and
used is that it obeys the following swapping rule of the inner product,
Computing gradient for function involving linear operator will necessarily requires such a transposition step.
This computation shows that
∇f (x) = A⊤ (Ax − y). (8)
This implies that solutions x⋆ minimizing f (x) satisfies the linear system (A⊤ A)x⋆ = A⊤ y. If A⋆ A ∈ Rp×p
is invertible, then f has a single minimizer, namely
This shows that in this case, x⋆ depends linearly on the data y, and the corresponding linear operator
(A⊤ A)−1 A⋆ is often called the Moore-Penrose pseudo-inverse of A (which is not invertible in general, since
typically p ̸= n). The condition that A⊤ A is invertible is equivalent to ker(A) = {0}, since
In particular, if n < p (under-determined regime, there is too much parameter or too few data) this can
never holds. If n ⩾ p and the features xi are “random” then ker(A) = {0} with probability one. In this
overdetermined situation n ⩾ p, ker(A) = {0} only holds if the features {ai }ni=1 spans a linear space Im(A⊤ )
of dimension strictly smaller than the ambient dimension p.
In particular, Ck,k /n is the variance along the axis k. More generally, for any unit vector u ∈ Rp , ⟨Cu, u⟩/n ⩾
0 is the variance along the axis u.
For instance, in dimension p = 2,
Pn Pn
a2i,1
C 1 i=1 i=1 ai,1 ai,2
= P n P n 2 .
n n i=1 ai,1 ai,2 i=1 ai,2
Here (U ⊤ x)k = ⟨x, uk ⟩ is the coordinate k of x in the basis U . Since ⟨Cx, x⟩ = ||Ax||2 , this shows that all
the eigenvalues λk ⩾ 0 are positive.
8
Figure 7: Left: point clouds (ai )i with associated PCA directions, right: quadratic part of f (x).
If one assumes that the eigenvalues are ordered λ1 ⩾ λ2 ⩾ . . . ⩾ λp , then projecting the points ai on the
first m eigenvectors can be shown to be in some sense the best linear dimensionality reduction possible (see
next paragraph), and it is called Principal Component Analysis (PCA). It is useful to perform compression
or dimensionality reduction, but in practice, it is mostly used for data visualization in 2-D (m = 2) and 3-D
(m = 3).
The matrix C/n encodes the covariance, so one can approximate√ the point cloud by an ellipsoid whose
main axes are the (uk )k and the width along each axis is ∝ λk (the standard deviations). If the data
are approximately drawn from a Gaussian distribution, whose density is proportional to exp( −1 2 ⟨C
−1
a, a⟩),
1
then the fit is good. This should be contrasted with the shape of quadratic part 2 ⟨Cx, x⟩ of f (x), since the
√
ellipsoid x ; n1 ⟨Cx, x⟩ ⩽ 1 has the same main axes, but the widths are the inverse 1/ λk . Figure 7 shows
this in dimension p = 2.
3.5 Classification
We can do a similar computation for the gradient of the classification loss (4). Assuming that L is
differentiable, and using the Taylor expansion (7) at point − diag(y)Ax, one has
where we have used the fact that (AB)⊤ = B ⊤ A⊤ and that diag(y)⊤ = diag(y). This shows that
Since L(z) = i ℓ(zi ), one has ∇L(z) = (ℓ′ (zi ))ni=1 . For instance, for the logistic classification method,
P
eu
ℓ(u) = log(1 + exp(u)) so that ℓ′ (u) = 1+e u ∈ [0, 1] (which can be interpreted as a probability of predicting
+1).
f (x + ε) = g(Bx + Bε) = g(Bx) + ⟨∇g(Bx), Bε⟩ + o(||Bε||) = f (x) + ⟨ε, B ⊤ ∇g(Bx)⟩ + o(||ε||),
9
where “◦” denotes the composition of functions.
To generalize this to composition of possibly non-linear functions, one needs to use the notion of differ-
ential. For a function F : Rp → Rq , its differentiable at x is a linear operator ∂F (x) : Rp → Rq , i.e. it can
be represented as a matrix (still denoted ∂F (x)) ∂F (x) ∈ Rq×p . The entries of this matrix are the partial
differential, denoting F (x) = (F1 (x), . . . , Fq (x)),
where [∂F (x)](ε) is the matrix-vector multiplication. As for the definition of the gradient, this matrix is the
only one that satisfies this expansion, so it can be used as a way to compute this differential in practice.
For the special case q = 1, i.e. if f : Rp → R, then the differential ∂f (x) ∈ R1×p and the gradient
∇f (x) ∈ Rp×1 are linked by equating the Taylor expansions (12) and (7)
where “×” is the matrix product. For instance, if H : Rp → Rq and G = g : Rq 7→ R, then f = g◦H : Rp → R
and one can compute its gradient as follow
⊤
∇f (x) = (∂f (x))⊤ = ([∂g(H(x))] × [∂H(x)]) = [∂H(x)]⊤ × [∂g(H(x))]⊤ = [∂H(x)]⊤ × ∇g(H(x)).
f (x − τ ∇f (x)) = f (x) − τ ⟨∇f (x), ∇f (x)⟩ + o(τ ) = f (x) − τ ||∇f (x)||2 + o(τ ).
So there are two possibility: either ∇f (x) = 0, in which case we are already at a minimum (possibly a local
minimizer if the function is non-convex) or if τ is chosen small enough,
which means that moving from x to x − τ ∇f (x) has improved the objective function.
10
Figure 8: Left: First order Taylor expansion in 1-D and 2-D. Right: orthogonality of gradient and level sets
and schematic of the proof.
Remark 2 (Orthogonality to level sets). The level sets of f are the sets of point sharing the same value of
f , i.e. for any s ∈ R
def.
Ls = {x ; f (x) = s} .
At some x ∈ Rp , denoting s = f (x), then x ∈ Ls (x belong to its level set). The gradient vector ∇f (x) is
orthogonal to the level set (as shown on Fig. 8 right), and points toward level set of higher value (which is
consistent with the previous computation showing that it is a valid ascent direction). Indeed, lets consider
def.
around x inside Ls a smooth curve of the form t ∈ R 7→ c(t) where c(0) = x. Then the function h(t) = f (c(t))
′
is constant h(t) = s since c(t) belong to the level set. So h (t) = 0. But at the same time, we can compute
its derivate at t = 0 as follow
h(t) = f (c(0) + tc′ (0) + o(t)) = h(0) + δ⟨c′ (0), ∇f (c(0))⟩ + o(t)
i.e. h′ (0) = ⟨c′ (0), ∇f (x)⟩ = 0, so that ∇f (x) is orthogonal to the tangent c′ (0) of the curve c, which lies in
the tangent plane of Ls (as shown on Fig. 8, right). Since the curve c is arbitrary, the whole tangent plane
is thus orthogonal to ∇f (x).
Remark 3 (Local optimal descent direction). One can prove something even stronger, that among all possible
∇f (x)
direction u with ||u|| = r, r ||∇f (x)|| becomes the optimal one as r → 0 (so for very small step this is locally
the best choice), more precisely,
1 r→0 ∇f (x)
argmin f (x + u) −→ − .
r ||u||=r ||∇f (x)||
Indeed, introducing a Lagrange multiplier λ ∈ R for this constraint optimization problem, one obtains that
∇f (x+u)
the optimal u satisfies ∇f (x + u) = λu and ||u|| = r. Thus ur = ± ||∇f (x+u)|| , and assuming that ∇f is
u ∇f (x)
continuous, when ||u|| = r → 0, this converges to ||u|| = ± ||∇f (x)|| . The sign ± should be +1 to obtain a
maximizer and −1 for the minimizer.
where τk > 0 is the step size (also called learning rate). For a small enough τk , the previous discussion shows
that the function f is decaying through the iteration. So intuitively, to ensure convergence, τk should be
chosen small enough, but not too small so that the algorithm is as fast as possible. In general, one use a fix
step size τk = τ , or try to adapt τk at each iteration (see Fig. 9).
11
Figure 9: Influence of τ on the gradient descent (left) and optimal step size choice (right).
Remark 4 (Greedy choice). Although this is in general too costly to perform exactly, one can use a “greedy”
choice, where the step size is optimal at each iteration, i.e.
def. def.
τk = argmin h(τ ) = f (xk − τ ∇f (xk )).
τ
Here h(τ ) is a function of a single variable. One can compute the derivative of h as
h(τ + δ) = f (xk − τ ∇f (xk ) − δ∇f (xk )) = f (xk − τ ∇f (xk )) − ⟨∇f (xk − τ ∇f (xk )), ∇f (xk )⟩ + o(δ).
One note that at τ = τk , ∇f (xk − τ ∇f (xk )) = ∇f (xk+1 ) by definition of xk+1 in (13). Such an optimal
τ = τk is thus characterized by
h′ (τk ) = −⟨∇f (xk ), ∇f (xk+1 )⟩ = 0.
This means that for this greedy algorithm, two successive descent direction ∇f (xk ) and ∇f (xk+1 ) are
orthogonal (see Fig. 9).
Remark 5 (Armijo rule). Instead of looking for the optimal τ , one can looks for an admissible τ which
guarantees a large enough decay of the functional, in order to ensure convergence of the descent. Given some
parameter 0 < α < 1 (which should be actually smaller than 1/2 in order to ensure a sufficient decay), one
consider a τ to be valid for a descent direction dk (for instance dk = −∇f (xk )) if it satisfies
f (xk + τ dk ) ⩽ f (xk ) + ατ ⟨dk , ∇f (xk )⟩ (14)
For small τ , one has f (xk + τ dk ) = f (xk ) + τ ⟨dk , ∇f (xk )⟩, so that, assuming dk is a valid descent direction
(i.e ⟨dk , ∇f (xk )⟩ < 0), condition (14) will always be satisfied for τ small enough (if f is convex, the set of
allowable τ is of the form [0, τmax ]). In practice, one perform gradient descent by initializing τ very large,
and decaying it τ ← βτ (for β < 1) until (14) is satisfied. This approach is often called “backtracking” line
search.
5 Convergence Analysis
5.1 Quadratic Case
Convergence analysis for the quadratic case. We first analyze this algorithm in the case of the
quadratic loss, which can be written as
(
def.
1 1 C = A⊤ A ∈ Rp×p ,
f (x) = ||Ax − y||2 = ⟨Cx, x⟩ − ⟨x, b⟩ + cst where def.
2 2 b = A ⊤ y ∈ Rp .
We already saw that in (9) if ker(A) = {0}, which is equivalent to C being invertible, then there exists a
single global minimizer x⋆ = (A⊤ A)−1 A⊤ y = C −1 u.
Note that a function of the form 21 ⟨Cx, x⟩ − ⟨x, b⟩ is convex if and only if the symmetric matrix C is
positive semi-definite, i.e. that all its eigenvalues are non-negative (as already seen in (10)).
12
Proposition 4. For f (x) = ⟨Cx, x⟩−⟨b, x⟩ (C being symmetric semi-definite positive) with the eigen-values
of C upper-bounded by L and lower-bounded by µ > 0, assuming there exists (τmin , τmax ) such that
2
0 < τmin ⩽ τℓ ⩽ τ̃max <
L
then there exists 0 ⩽ ρ̃ < 1 such that
||xk − x⋆ || ⩽ ρ̃ℓ ||x0 − x⋆ ||. (15)
The best rate ρ̃ is obtained for
2 def. L−µ 2ε def.
τℓ = =⇒ ρ̃ = =1− where ε = µ/L. (16)
L+µ L+µ 1+ε
Proof. One iterate of gradient descent reads
Since the solution x⋆ (which by the way is unique by strict convexity) satisfy the first order condition
Cx⋆ = b, it gives
xk+1 − x⋆ = xk − x⋆ − τℓ C(xk − x⋆ ) = (Idp − τℓ C)(xk − x⋆ ).
If S ∈ Rp×p is a symmetric matrix, one has
def.
||Sz|| ⩽ ||S||op ||z|| where ||S||op = max |λk (S)|,
k
def.
where λk (S) are the eigenvalues of S and σk (S) = |λk (S)| are its singular values. Indeed, S can be
diagonalized in an orthogonal basis U , so that S = U diag(λk (S))U ⊤ , and S ⊤ S = S 2 = U diag(λk (S)2 )U ⊤
so that
For a quadratic function, one has σmin (C) = µ, σmax (C) = L. Figure 10, right, shows a display of h(τ ). One
2
has that for 0 < τ < 2/L, h(τ ) < 1. The optimal value is reached at τ ⋆ = L+µ and then
2L L−µ
h(τ ⋆ ) = 1 − = .
L+µ L+µ
def.
Note that when the condition number ξ = µ/L ≪ 1 is small (which is the typical setup for ill-posed
problems), then the contraction constant appearing in (16) scales like
ρ̃ ∼ 1 − 2ξ. (17)
The quantity ε in some sense reflects the inverse-conditioning of the problem. For quadratic function, it
indeed corresponds exactly to the inverse of the condition number (which is the ratio of the largest to
smallest singular value). The condition number is minimum and equal to 1 for orthogonal matrices.
The error decay rate (15), although it is geometrical O(ρk ) is called a “linear rate” in the optimization
literature. It is a “global” rate because it hold for all k (and not only for large enough k).
13
Figure 10: Contraction constant h(τ ) for a quadratic function (right).
If ker(A) ̸= {0}, then C is not definite positive (some of its eigenvalues vanish), and the set of solution is
infinite. One can however still show a linear rate, by showing that actually the iterations xk are orthogonal to
ker(A) and redo the above proof replacing µ by the smaller non-zero eigenvalue of C. This analysis however
leads to a very poor rate ρ (very close to 1) because µ can be arbitrary close to 0. Furthermore, such a proof
does not extends to non-quadratic functions. It is thus necessary to do a different theoretical analysis, which
only shows a sublinear rate on the objective function f itself rather than on the iterates xk .
Proposition 5. For f (x) = ⟨Cx, x⟩ − ⟨b, x⟩, assuming the eigenvalue of C are bounded by L, then if
0 < τk = τ < 1/L is constant, then
dist(x0 , argmin f )2
f (xk ) − f (x⋆ ) ⩽ .
τ 8k
where
def.
dist(x0 , argmin f ) = min ||x0 − x⋆ ||.
x⋆ ∈argmin f
Proof. We have Cx⋆ = b for any minimizer x⋆ and xk+1 = xk − τ (Cxk − b) so that as before
1 σmax (Mk )
f (xk ) − f (x⋆ ) = ⟨(Idp − τ C)k C(Idp − τ C)k (x0 − x⋆ ), x0 − x⋆ ⟩ ⩽ ||x0 − x⋆ ||2
2 2
where we have denoted
def.
Mk = (Idp − τ C)k C(Idp − τ C)k .
14
Since x⋆ can be chosen arbitrary, one can replace ||x0 − x⋆ || by dist(x0 , argmin f ). One has, for any ℓ, the
following bound
1
σℓ (Mk ) = σℓ (C)(1 − τ σℓ (C))2k ⩽
τ 4k
since one can show that (setting t = τ σℓ (C) ⩽ 1 because of the hypotheses)
1
∀ t ∈ [0, 1], (1 − t)2k t ⩽ .
4k
Indeed, one has
1 1 1 1
(1 − t)2k t ⩽ (e−t )2k t = (2kt)e−2kt ⩽ sup ue−u = ⩽ .
2k 2k u⩾0 2ek 4k
Hessian. If the function is twice differentiable along the axes, the hessian matrix is
2
2 ∂ f (x)
(∂ f )(x) = ∈ Rp×p .
∂xi ∂xj 1⩽i,j⩽p
∂ 2 f (x) ∂f (x)
Where recall that ∂xi ∂xj is the differential along direction xj of the function x 7→ ∂xi . We also recall that
2 2
∂ f (x) ∂ f (x) 2
∂xi ∂xj = ∂xj ∂xiso that ∂ f (x) is a symmetric matrix.
A differentiable function f is said to be twice differentiable at x if
1
f (x + ε) = f (x) + ⟨∇f (x), ε⟩ + ⟨∂ 2 f (x)ε, ε⟩ + o(||ε||2 ). (18)
2
This means that one can approximate f near x by a quadratic function. The hessian matrix is uniquely
determined by this relation, so that if one is able to write down an expansion with some matrix H
1
f (x + ε) = f (x) + ⟨∇f (x), ε⟩ + ⟨Hε, ε⟩ + o(||ε||2 ).
2
then equating this with the expansion (18) ensure that ∂ 2 f (x) = H. This is thus a way to actually determine
the hessian without computing all the p2 partial derivative. This Hessian can equivalently be obtained by
performing an expansion (i.e. computing the differential) of the gradient since
where [∂ 2 f (x)](ε) ∈ Rp denotes the multiplication of the matrix ∂ 2 f (x) with the vector ε.
One can show that a twice differentiable function f on Rp is convex if and only if for all x the symmetric
matrix ∂ 2 f (x) is positive semi-definite, i.e. all its eigenvalues are non-negative. Furthermore, if these
eigenvalues are strictly positive then f is strictly convex (but the converse is not true, for instance x4 is
strictly convex on R but its second derivative vanishes at x = 0).
For instance, for a quadratic function f (x) = ⟨Cx, x⟩ − ⟨x, u⟩, one has ∇f (x) = Cx − u and thus
∂ 2 f (x) = C (which is thus constant). For the classification function, one has
15
and thus
∇f (x + ε) = −A⊤ diag(y)∇L(− diag(y)Ax − − diag(y)Aε)
= ∇f (x) − A⊤ diag(y)[∂ 2 L(− diag(y)Ax)](− diag(y)Aε)
Since ∇L(u) = (ℓ′ (ui )) one has ∂ 2 L(u) = diag(ℓ′′ (ui )). This means that
∂ 2 f (x) = A⊤ diag(y) × diag(ℓ′′ (− diag(y)Ax)) × diag(y)A.
One verifies that this matrix is symmetric and positive if ℓ is convex and thus ℓ′′ is positive.
Remark 6 (Second order optimality condition). The first use of Hessian is to decide wether a point x⋆
with ∇f (x⋆ ) is a local minimum or not. Indeed, if ∂ 2 f (x⋆ ) is a positive matrix (i.e. its eigenvalues are
strictly positive), then x⋆ is a strict local minimum. Note that if ∂ 2 f (x⋆ ) is only non-negative (i.e. some its
eigenvalues might vanish) then one cannot deduce anything (such as for instance x3 on R). Conversely, if x⋆
is a local minimum then ∂ 2 f (x)
Remark 7 (Second order algorithms). A second use, is to be used in practice to define second order method
(such as Newton’s algorithm), which converge faster than gradient descent, but are more costly. The gener-
alized gradient descent reads
xk+1 = xk − Hk ∇f (xk )
where Hk ∈ Rp×p is a positive symmetric matrix. One recovers the gradient descent when using Hk = τk Idp ,
and Newton’s algorithm corresponds to using the inverse of the Hessian Hk = [∂ 2 f (xk )]−1 . Note that
f (xk ) = f (xk ) − ⟨Hk ∇f (xk ), ∇f (xk )⟩ + o(||Hk ∇f (xk )||).
Since Hk is positive, if xk is not a minimizer, i.e. ∇f (xk ) ̸= 0, then ⟨Hk ∇f (xk ), ∇f (xk )⟩ > 0. So if Hk is
small enough one has a valid descent method in the sense that f (xk+1 ) < f (xk ). It is not the purpose of
this chapter to explain in more detail these type of algorithm.
The last use of Hessian, that we explore next, is to study theoretically the convergence of the gradient
descent. One simply needs to replace the boundedness of the eigenvalue of C of a quadratic function by a
boundedness of the eigenvalues of ∂ 2 f (x) for all x. Roughly speaking, the theoretical analysis of the gradient
descent for a generic function is obtained by applying this approximation and using the proofs of the previous
section.
16
Proof. We prove (19), using Taylor expansion with integral remain
Z 1 Z 1
f (x′ ) − f (x) = ⟨∇f (xt ), x′ − x⟩dt = ⟨∇f (x), x′ − x⟩ + ⟨∇f (xt ) − ∇f (x), x′ − x⟩dt
0 0
def.
where xt = x + t(x′ − x). Using Cauchy-Schwartz, and then the smoothness hypothesis (RL )
Z 1 Z 1
′ ′ ′ ′ ′ 2
f (x ) − f (x) ⩽ ⟨∇f (x), x − x⟩ + L||xt − x||||x − x||dt ⩽ ⟨∇f (x), x − x⟩ + L||x − x|| tdt
0 0
which gives the desired result since ||xt − x||2 /t = t||x′ − x||2 .
The relation (19) shows that a smooth (resp. strongly convex) functional is bounded by below (resp.
above) by a quadratic tangential majorant (resp. minorant).
Condition (20) thus reads that the singular values of ∂ 2 f (x) should be contained in the interval [µ, L].
The upper bound is also equivalent to ||∂ 2 f (x)||op ⩽ L where || · ||op is the operator norm, i.e. the largest
singular value. In the special case of a quadratic function of the form ⟨Cx, x⟩ − ⟨b, x⟩ (recall that necessarily
C is semi-definite symmetric positive for this function to be convex), ∂ 2 f (x) = C is constant, so that [µ, L]
can be chosen to be the range of the eigenvalues of C.
Convergence analysis. We now give convergence theorem for a general convex function. On contrast to
quadratic function, if one does not assumes strong convexity, one can only show a sub-linear rate on the
function values (and no rate at all on the iterates themselves). It is only when one assume strong convexity
that linear rate is obtained. Note that in this case, the solution of the minimization problem is not necessarily
unique.
Theorem 1. If f satisfy conditions (RL ), assuming there exists (τmin , τmax ) such that
2
0 < τmin ⩽ τℓ ⩽ τmax < ,
L
then xk converges to a solution x⋆ of (1) and there exists C > 0 such that
C
f (xk ) − f (x⋆ ) ⩽ . (21)
ℓ+1
If furthermore f is µ-strongly convex, then there exists 0 ⩽ ρ < 1 such that ||xk − x⋆ || ⩽ ρℓ ||x0 − x⋆ ||.
Proof. In the case where f is not strongly convex, we only prove (21) since the proof that xk converges
is more technical. Note indeed that if the minimizer x⋆ is non-unique, then it might be the case that the
iterate xk “cycle” while approaching the set of minimizer, but actually convexity of f prevents this kind of
pathological behavior. For simplicity, we do the proof in the case τℓ = 1/L, but it extends to the general
case. The L-smoothness property imply (19), which reads
L
f (xk+1 ) ⩽ f (xk ) + ⟨∇f (xk ), xk+1 − xk ⟩ + ||xk+1 − xk ||2 .
2
Using the fact that xk+1 − xk = − L1 ∇f (xk ), one obtains
1 1 1
f (xk+1 ) ⩽ f (xk ) − ||∇f (xk )||2 + ||∇f (xk )||2 ⩽ f (xk ) − ||∇f (xk )||2 (22)
L 2L 2L
17
This shows that (f (xk ))ℓ is a decaying sequence. By convexity
L||x0 − x⋆ ||2
f (x(k+1) ) − f (x⋆ ) ⩽
2(k + 1)
def.
which gives (21) for C = L||x0 − x⋆ ||2 /2.
µ
If we now assume f is µ-strongly convex, then, using ∇f (x⋆ ) = 0, one has 2 ||x
⋆
− x||2 ⩽ f (x) − f (x⋆ ) for
all x. Re-manipulating (25) gives
µ L
||xk+1 − x⋆ ||2 ⩽ f (xk+1 ) − f (x⋆ ) ⩽ ||xk − x⋆ ||2 − ||x⋆ − xk+1 ||2 ,
2 2
and hence s
L
||xk+1 − x⋆ || ⩽ ||xk+1 − x⋆ ||, (26)
L+µ
which is the desired result.
Note that in the low conditioning setting ε ≪ 1, one retrieve a dependency of the rate (26) similar to the
one of quadratic functions (17), indeed
s
L 1 1
= (1 + ε)− 2 ∼ 1 − ε.
L+µ 2
5.3 Acceleration
The previous analysis shows that for L-smooth functions (i.e. with a hessian uniformly bounded by L,
||∂ 2 f (x)||op ⩽ L), the gradient descent with fixed step size converges with a speed on the function value
f (xk ) − min f = O(1/k). Even using various line search strategies, it is not possible to improve over this
rate. A way to improve this rate is by introducing some form of “momentum” extrapolation and rather
consider a pair of variables (xk , yk ) with the following update rule, for some step size s (which should be
smaller than 1/L)
xk+1 = yk − s∇f (yk )
yk+1 = xk+1 + βk (xk+1 − xk )
where the extrapolation parameter satisfies 0 < βk < 1. The case of a fixed βk = β corresponds to the
so-called “heavy-ball” method. In order for the method to bring an improvement for the 1/k “worse case”
18
rate (which does not means it improves for all possible case), one needs to rather use increasing momentum
βk → 1, one popular choice being
k−1 1
βk ∼1− .
k+2 k
This corresponds to the so-called “Nesterov” acceleration (although Nesterov used a slightly different choice,
with the similar 1 − 1/k asymptotic behavior).
−x⋆ ||
When using s ⩽ 1/L, one can show that f (xk ) − min f = O( ||x0sk 2 ), so that in the worse case scenario,
the convergence rate is improved. Note however that in some situation, acceleration actually deteriorates the
rates. For instance, if the function is strongly convex (and even on the simple case f (x) = ||x||2 ), Nesterov
acceleration does not enjoy linear convergence rate.
A way to interpret this scheme is by looking at a time-continuous ODE limit when √ s → 0. On the
contrary to the√classical gradient descent, the step size here should be taker as τ = s so that the time
evolves as t = sk. The update reads
xk+1 − xk xk − xk−1
= (1 − 3/k) − ∇f (yk )
τ τ
which can be re-written as
xk+1 + xk−1 − 2xk 3 xk − xk−1
− + τ ∇f (yk ) = 0.
τ2 kτ τ
Assuming (xk , yk ) → (x(t), y(t)), one obtains in the limit the following second order ODE
′′ 3 ′ x(0) = x0 ,
x (t) + x (t) + ∇f (x(t)) = 0 with
t x′ (0) = 0.
This corresponds to the movement of a ball in the potential field f , where the term 3t x′ (t) plays the role
of a friction which vanishes in the limit. So for small t, the method is similar to a gradient descent x′ =
−∇f (x), while for large t, it ressemble a Newtonian evolution x′′ = −∇f (x) (which keeps oscillating without
converging). The momentum decay rate 3/t is very important, it is the only rule which enable the speed
improvement from 1/k to 1/k 2 .
its Legendre transform. In this case of “Legendre-type” entropy function, ∇ψ : dom(ψ) → dom(ψ ⋆ ) and
∇ψ ∗ are bijection reciprocal one from the other.
One then defines the associated Bregman divergence
It is positive, convex in x (but not necessarily in y), not necessarily symmetric, and “distance-like”.
2 ∗
P For ψ = || · || one has ∇ψ = ∇ψ = Id, and ∗
one recovers the Euclidean distance. For ψKL (x) =
i xi log(xi ) − xi + 1 one has ∇ψ = log and ∇ψ = exp, and one obtains the relative entropy, also known
as Kullback-Leibler X
DψKL (x|y) = xi log(xi /yi ) − xi + yi .
i
19
− log(xi ) + xi − 1 on Rd+ , ∇ψBurg (x) = ∇ψ ∗ (x) = −1/x and associated divergence
P
When ψBurg (x) = i
X
DψBurg (x|y) = − log(yi /xi ) − xi /yi + 1. (27)
i
For instance, if h(s) = s log(x) − s + 1 is the Shannon entropy, this defines the quantum Shannon entropy as
which are jointly convex in x and y. Only for ψ = ψKL one has DψKL = CψKL .
The fact that ψ is Legendre type allows to ignore the constraint, and the solution satisfies the following first
order condition
∇f (xk ) + 1/τ [∇ψ(xk+1 ) − ∇ψ(xk )] = 0
20
so that it can be explicitly computed
For ψ = || · ||2 /2 one recovers the usual Euclidean gradient descent. For ψ(x) =
P
i xi log(xi ), this defines the
multiplicative updates
xk+1 = xk ⊙ exp(−τ ∇f (xk ))
where ⊙ is the entry-wise multiplication of vectors.
Note that introducing the “dual” variable uk ≜ ∇ψ(xk ), one has
Note however that in general h is not a gradient field, so this is not in general a gradient flow.
so that this is a gradient flow on a very particular type of manifold, of “Hessian type”. Note that if ψ = f ,
then one recovers the flow associated to Newton’s method.
Convergence. Convergence theory (ensuring convergence and rates) for mirror descent is the same as for
the usual gradient descent, and one needs to consider relative L-smoothness, and if possible also relative
µ-strong convexity,
If L < +∞, then one has f (xk )−f (x⋆ ) ⩽ O(Dψ (x⋆ |x0 )/k) while if both 0 < µ ⩽ L < +∞, then Dψ (xk |x⋆ ) ⩽
O(Dψ (x⋆ |x0 )(1 − µ/L)k ). The advantages of using Bregman geometry are two-fold: this can improves the
conditioning µ/L (some function might be non-smooth for the Euclidean geometry but smooth for some
Bregman geometry, and can avoid introducing constraint in the optimization problem) and this can also
lower the radius of the domain Dψ (x⋆ |x0 ). For instance, assuming the solution belongs to the simplex, and
using x0 = 1d /d, then DψKL (x⋆ |x0 ) ⩽ log(d) whereas for the ℓ2 Euclidean distance, one only has the bound
||x⋆ − x0 ||2 ⩽ d.
so that, denoting z(t) the gradient flow ż = −∇g(z) of g, and x(t) ≜ φ(z(t)), one has ẋ(t) = [∂φ(z(t))]ż(t)
and thus x(t) solves the following equation
So unless φ is a bijection, this is not a gradient flow over the x variable. If φ is a bijection, then this is a
gradient flow associated to the field of tensors (“manifold”) Q(φ−1 (x)). The issue is that even in this case,
in general H might fail to be a Hessian manifold, so this does not correspond to a mirror descent flow.
21
Dual parameterization If ψ is an entropy function, then the parametrization x = ∇ψ ∗ (z), i.e. φ = ∇ψ ∗ ,
then Q(z) = [∂ 2 ψ ∗ (z)]2 , i.e. Q(φ−1 (x)) = [∂ 2 ψ(x)]−2 is not of Hessian-type in general, but rather a squared-
Hessian manifold. For instance, when ψ ∗ (z) = exp(z), −1 2
P then Q(φ (x)) = diag(1/xi ), which surprisingly is
the hessian metric associated to Burg’s entropy − i log(xi ).
Example: power-type parameterization We consider power entropies (28), on Rd+ , for α ⩽ 1, for
which
H(x) = [∂ 2 ψ(x)]−1 ∝ diag(x2−α
i ).
Remark than when using the parameterization x = φ(z) = (zib )i then
2(b−1) 2(b−1)/b
Q(φ−1 (x)) = [∂φ(z)][∂φ(z)]⊤ ∝ diag(zi ) = diag(xi )
so if one selects 2(1 − 1/b) = 2 − α i.e. 2/b = α, the re-parameterized flow is equal to the flow on a Hessian
manifold. For instance, when setting b = 2, α = 1, i.e. using the parmeterization x = z 2 , one retrieves
the flow on the manifold for the Shannon entropy (“Fisher-Rao” geometry). P Note that when b → +∞, one
obtains α = 0, i.e. the flow is the one of the Burg’s entropy ψ(x) = − i log(xi ) (which we saw above as
also being associated to the parameterization x = exp(z)).
d×d
Counter-example: SDP matrices We now consider semi-definite symmetric matrices X ∈ S+ , to-
⊤
gether with the parameterization X = φ(Z) = ZZ for Z ∈ R d×d
. In this case, denoting g(Z) = f (ZZ ⊤ ),
one has
∇g(Z) = [∇f (X) + ∇f (X)⊤ ]Z
so that the flow Ż = −∇g(Z) is equivalent to the following flow on symmetric (and it maintains positivity
as well)
Ẋ = X[∇S f (X)] + [∇S f (X)]X (32)
where the symmetric gradient is
∇S f (X) ≜ [∇f (X)] + [∇f (X)]⊤
So most likely (32) cannot be written as a usual gradient flow on a manifold which would be a hessian of a
convex function. To mimic the diagonal case (or vectorial case above), the most natural quantitate would
have been the spectral entropy ψ(X) ≜ tr(X log(X) − X + Id), whose gradient is log(X), but there is no
closed form expression for the derivative of the log unfortunately. Another simpler approach to mimic ψ−1
is to use ψ(X) = − tr(log(X)) = − log det(X), because the Hessian and its inverse can be computed
∂ 2 ψ(X) : S 7→ −X −1 SX −1 .
where the loss is coercive such that ℓ(·, yi ) has a unique minimizer at yi . The typical example is f (x) =
||Ax − y||2 for ℓ(u, v) = (u − v)2 . We do not impose that L is convex, and simply assumes convergence of
the considered optimization method to the set of global minimizers. The set of global minimizers is thus the
affine space
argmin f = {x ; Ax = y} .
The simplest optimization method is just gradient descent
22
As τ → 0, one defines x(t) = xk for t = kτ and consider the flow
The implicit bias of the descent (and the flow) is given by the orthogonal projection.
Proposition 7. If xk → x⋆ ∈ argmin f , then
The following Proposition, whose proof can be found in [?] generalizes this proposition to the case of an
arbitrary mirror flow.
Proposition 8. If xk defined by (29) (resp. x(t) defined by (31)) is such that xk (resp. x(t)) converges to
x⋆ ∈ argmin f , then
x⋆ = argmin Dψ (x|x0 ). (33)
x∈argmin f
Proof. From the dual variable evolution (30), since ∇f (x) ∈ Im(A⊤ ), one has that yk − y0 ∈ Im(A⊤ ), so
that in the limit
y ⋆ − y0 = ∇ψ(x⋆ ) − ∇ψ(x0 ) ∈ Im(A⊤ ). (34)
Note that ∇ Dψ (x|x0 ) = ∇ψ(x) − ∇ψ(x0 ), and Im(A⊤ ) = Ker(A)⊥ is the space orthogonal to argmin f so
that (34) are the optimality conditions of the strictly convex problem (33).
In particular, for the Shannon entropy (equivalently when using the x = z 2 parameterization), as x0 → 0,
by doing the expansion of KL(x|x0 ) one has
X
x⋆ → argmin | log((x0 )i )|xi ,
x∈argmin f,x⩾0 i
which is a weighted ℓ1 norm (so in particular it induces sparsity in the solution, it is a Lasso-type problem).
When using more general parameterizations of the form x = z b for b > 0, this corresponds to using the
power entropy ψα for α = 2/b, and one can check that the associated limit bias for small x0 is still an ℓ1 ,
but with a different weighting scheme. For x = exp(z) (or b → +∞) one obtains Burg’s entropy defined
in (27) so that the limit bias is i xi /(x0 )i . The use of x = z 2 parameterization (which can be generalized
P
to x = u ⊙ v for signed vectors) was introduced in [?], and its associated implicit regularization is detailed
in [?, ?]. It is possible to analyze this sparsity-inducing behavior in a quantitative way, see for instance [?,
Thm.2] One can generalize this parameterization to arbitrary (not only positive vector) by using x = u2 − v 2
or x = u ⊙ v and the same type of bias appears, with now rather a (weighted) ℓ1 norm.
7 Regularization
When the number n of sample is not large enough with respect to the dimension p of the model, it makes
sense to regularize the empirical risk minimization problem.
23
We assume for simplicity that R is positive and coercive, i.e. R(x) → +∞ as ||x|| → +∞. The following
proposition that in the small λ limit, the regularization select a sub-set of the possible minimizer. This is
especially useful when ker(A) ̸= 0, i.e. the equation Ax = y has an infinite number of solutions.
Proposition 9. If (xλk )k is a sequence of minimizers of fλ , then this sequence is bounded, and any accu-
mulation x⋆ is a solution of the constrained optimization problem
(A⊤ A + λIdp )C = (A⊤ A + λIdp )A⊤ (AA⊤ + λIdn )−1 = A⊤ (AA⊤ + λIdn )(AA⊤ + λIdn )−1 = A⊤ .
argmin ||x||.
Ax=y
If ker(A) = {0} (overdetermined setting), A⊤ A ∈ Rp×p is an invertible matrix, and (A⊤ A + λIdp )−1 →
(A⊤ A)−1 , so that
x0 = A+ y where A+ = (A⊤ A)−1 A⊤ .
def.
24
Conversely, if ker(A⊤ ) = {0}, or equivalently Im(A) = Rn (undertermined setting) then one has
A+ = A⊤ (AA⊤ )−1 .
def.
x 0 = A+ y where
In the special case n = p and A is invertible, then both definitions of A+ coincide, and A+ = A−1 . In the
general case (where A is neither injective nor surjective), A+ can be computed using the Singular Values
Decomposition (SVD). The matrix A+ is often called the Moore-Penrose pseudo-inverse.
7.3 Lasso
The Lasso corresponds to using a ℓ1 penalty
p
X
def.
R(x) = ||x||1 = |xk |.
k=1
Proof. One has fλ (x) = k 12 (xk − yk )2 + λ|xk |, so that one needs to find the minimum of the 1-D function
P
x ∈ R 7→ 12 (x − y)2 + λ|x|. We can do this minimization “graphically” as shown on Fig. 12. For x > 0, one
has F ′ (x) = x − y + λ wich is 0 at x = y − λ. The minimum is at x = y − λ for λ ⩽ y, and stays at 0 for all
λ > y. The problem is symmetric with respect to the switch x 7→ −x.
Here, Sλ is the celebrated soft-thresholding non-linear function.
25
def. 1
Figure 12: Evolution with λ of the function F (x) = 2 || · −y||2 + λ| · |.
2 2τ
We notice that fτ (x, x) = 0 and one the quadratic part of this function reads
′ def. 1 ′ 2 1 ′ 1 1
K(x, x ) = − ||Ax − Ax || + ||x − x || = ⟨ IdN − A A (x − x′ ), x − x′ ⟩.
⊤
2 2τ 2 τ
This quantity K(x, x′ ) is positive if λmax (A⊤ A) ⩽ 1/τ (maximum eigenvalue), i.e. τ ⩽ 1/||A||2op , where we
recall that ||A||op = σmax (A) is the operator (algebra) norm. This shows that fτ (x, x′ ) is a valid surrogate
functional, in the sense that
f (x) ⩽ fτ (x, x′ ), fτ (x, x′ ) = 0, and f (·) − fτ (·, x′ ) is smooth.
We also note that this majorant fτ (·, x′ ) is convex. This leads to define
def.
xk+1 = argmin fτ (x, xk ) (40)
x
26
Equation (41) defines the iterative soft-thresholding algorithm. It follows from a valid convex surrogate
function if τ ⩽ 1/||A||2 , but one can actually shows that it converges to a solution of the Lasso as soon as
τ < 2/||A||2 , which is exactly as for the classical gradient descent.
8 Stochastic Optimization
We detail some important stochastic Gradient Descent methods, which enable to perform optimization
in the setting where the number of samples n is large and even infinite.
Problem
Pn (42) can be seen as a special case of (43), when using a discrete empirical uniform measure π =
δ
i=1 i and setting f (x, i) = fi (x). One can also viewed (42) as a discretized “empirical” version of (43)
when drawing (zi )i i.i.d. according to z and defining fi (x) = f (x, zi ). In this setup, (42) converges to (43)
as n → +∞.
A typical example of such a class of problems is empirical risk minimization for linear model, where in
these cases
fi (x) = ℓ(⟨ai , x⟩, yi ) and f (x, z) = ℓ(⟨a, x⟩, y) (44)
for z = (a, y) ∈ Z = (A = Rp ) × Y (typically Y = R or Y = {−1, +1} for regression and classification),
where ℓ is some loss function. We illustrate below the methods on binary logistic classification, where
def.
L(s, y) = log(1 + exp(−sy)). (45)
But this extends to arbitrary parametric models, and in particular deep neural networks.
While some algorithms (in particular batch gradient descent) are specific to finite sums (42), the stochastic
methods we detail next work verbatim (with the same convergence guarantees) in the expectation case (43).
For the sake of simplicity, we however do the exposition for the finite sums case, which is sufficient in the
vast majority of cases. But one should keep in mind that n can be arbitrarily large, so it is not acceptable
in this setting to use algorithms whose complexity per iteration depend on n.
If the functions fi (x) are very similar (the extreme case being that they are all equal), then of course
there is a gain in using stochastic optimization (since in this case, ∇fi ≈ ∇f but ∇fi is n times cheaper).
But in general stochastic optimization methods are not necessarily faster than batch gradient descent. If
n is not too large so that one afford the price of doing a few non-stochastic iterations, then deterministic
methods can be faster. But if n is so large that one cannot do even a single deterministic iteration, then
stochastic methods allow one to have a fine grained scheme by breaking the cost of determinstic iterations
in smaller chunks. Another advantage is that they are quite easy to parallelize.
27
0.62
0.6
0.58
0.56
200 400 600 800 1000 1200
0.62 -2
0.6 -3
-4
0.58
-5
0.56
200 400 600 800 1000 1200 200 400 600 800 1000 1200
f (x )
log(E(wk
l
) - min E) log10 (f (xk ) − f (x⋆ ))
-2
Figure 13: Evolution of the error of the BGD for logistic classification.
-3
-4
def.
and the
-5 step size should be chosen as 0 < τmin < τk < τmax = 2/L where L is the Lipschitz constant of the
gradient ∇f .200
In particular,
400
in this 800
600
deterministic
1000
setting,
1200
this step size should not go to zero and this ensures
quite fast convergence (even linear rates if f is strongly convex).
The computation of the gradient in our setting reads
n
1X
∇f (x) = ∇fi (x) (46)
n i=1
where ℓ(y, y ′ ) ∈ R is the derivative with respect to the first variable, i.e. the gradient of the map y ∈ R 7→
L(y, y ′ ) ∈ R. This computation shows that
28
Figure 16: Display of a large number of trajectories k 7→ xk ∈ R generated by several runs of SGD. On the
top row, each curve is a trajectory, and the bottom row displays the corresponding density.
Note that each step of a batch gradient descent has complexity O(np), while a step
of SGD only has complexity O(p). SGD is thus advantageous when n is very large, and
one cannot afford to do several passes through the data. In some situation, SGD can
provide accurate results even with k ≪ n, exploiting redundancy between the samples.
A crucial question is the choice of step size schedule τk . It must tends to 0 in order
to cancel the noise induced on the gradient by the stochastic sampling. But it should
not go too fast to zero in order for the method to keep converging.
A typical schedule that ensures both properties is to have asymptotically τk ∼ k −1
for k → +∞. We thus propose to use
def. τ0
τk = (49) Figure 15:
1 + k/k0
Schematic view
where k0 indicates roughly the number of iterations serving as a “warmup” phase. of SGD iterates
Figure 16 shows a simple 1-D example to minimize f1 (x) + f2 (x) for x ∈ R and
f1 (x) = (x − 1)2 and f2 (x) = (x + 1)2 . One can see how the density of the distribution of xk progressively
clusters around the minimizer x⋆ = 0. Here the distribution of x0 is uniform
√ on [−1/2, 1/2].
The following theorem shows the convergence in expectation with a 1/ k rate on the objective.
Theorem 2. We assume f is µ-strongly convex as defined in (Sµ ) (i.e. Idp ⪯ ∂ 2 f (x) if f is C 2 ), and is
1
such that ||∇fi (x)||2 ⩽ C 2 . For the step size choice τk = µ(k+1) , one has
R
E(||xk − x⋆ ||2 ) ⩽ where R = max(||x0 − x⋆ ||, C 2 /µ2 ), (50)
k+1
where E indicates an expectation with respect to the i.i.d. sampling performed at each iteration.
29
f (xk ) log10 (f (xk ) − f (x⋆ ))
Figure 17: Evolution of the error of the SGD for logistic classification (dashed line shows BGD).
Considering only the expectation with respect to the ransom sample of i(k) ∼ ik , one has
where we used the fact (48) that the gradient is unbiased. Taking now the full expectation with respect to
all the other previous iterates, and using (51) one obtains
E(||xk+1 − x⋆ ||2 ) ⩽ E(||xk − x⋆ ||2 ) − 2µτk E(||xk − x⋆ ||2 ) + τk2 C 2 = (1 − 2µτk )E(||xk − x⋆ ||2 ) + τk2 C 2 . (52)
def.
We show by recursion that the bound (50) holds. We denote εk = E(||xk − x⋆ ||2 ). Indeed, for k = 0, this it
is true that
max(||x0 − x⋆ ||, C 2 /µ2 ) R
ε0 ⩽ = .
1 1
R 1
We now assume that εk ⩽ k+1 . Using (52) in the case of τk = µ(k+1) , one has, denoting m = k + 1
C2
2
εk+1 ⩽ (1 − 2µτk )εk + τk2 C 2 = 1 − εk +
m (µm)2
m2 − 1 1
2 R R 1 1 m−1 R
⩽ 1− + 2 = − 2 R= R = R⩽
m m m m m m2 m2 m + 1 m+1
A weakness of SGD (as well as the SGA scheme studied next) is that it only weakly benefit from strong
convexity of f . This is in sharp contrast with BGD, which enjoy a fast linear rate for strongly convex
functionals, see Theorem 1.
Figure 17 displays the evolution of the energy f (xk ). It overlays on top (black dashed curve) the conver-
gence of the batch gradient descent, with a careful scaling of the number of iteration to account for the fact
that the complexity of a batch iteration is n times larger.
30
This defines the Stochastic Gradient Descent with Averaging (SGA) algorithm.
Note that it is possible to avoid explicitly storing all the iterates by simply updating a running average
as follow
1 k−1
xk+1 = x̃k + xk .
k k
In this case, a typical choice of decay is rather of the form
τ
p0
def.
τk = .
1 + k/k0
Notice that the step size now goes much slower to 0, at rate k −1/2 .
Typically, because the averaging stabilizes the iterates, the choice of (k0 , τ0 ) is less important than for
SGD.
Bach proves that for logistic classification, it leads to a faster convergence (the constant involved are
smaller) than SGD, since on contrast to SGD, SGA is adaptive to the local strong convexity of E.
9 Automatic Differentiation
The main computational bottleneck of gradient descent methods (batch or stochastic) is the computation
of gradients ∇f (x). For simple functionals, such as those encountered in ERM for linear models, and also for
MLP with a single hidden layer, it is possible to compute these gradients in closed form, and that the main
computational burden is the evaluation of matrix-vector products. For more complicated functionals (such as
those involving deep networks), computing the formula for the gradient quickly becomes cumbersome. Even
worse: computing these gradients using the usual chain rule formula is sub-optimal. We presents methods
to compute recursively in an optimal manner these gradients. The purpose of this approach is to automatize
this computational step.
31
Figure 18: Evolution of log10 (f (xk ) − f (x⋆ )) for SGD, SGA and SAG.
∀ k = s + 1, . . . , t, xk = fk (x1 , . . . , xk−1 )
where fk is a function which only depends on the previous variables, see Fig. 19. One can represent this
algorithm using a directed acyclic graph (DAG), linking the variables involved in fk to xk . The node of this
graph are thus conveniently ordered by their indexing, and the directed edges only link a variable to another
32
Figure 20: Relation between the variable for the forward (left) and backward (right) modes.
one with a strictly larger index. The evaluation of f (x) thus corresponds to a forward traversal of this graph.
Note that the goal of automatic differentiation is not to define an efficient computational graph, it is up
to the user to provide this graph. Computing an efficient graph associated to a mathematical formula is a
complicated combinatorial problem, which still has to be solved by the user. Automatic differentiation thus
leverage the availability of an efficient graph to provide an efficient algorithm to evaluate derivatives.
The notation “parent(k)” denotes the nodes ℓ < k of the graph that are connected to k, see Figure 20,
∂xℓ
left. Here the quantities being computed (i.e. stored in computer variables) are the derivatives ∂x 1
, and
× denotes in full generality matrix-matrix multiplications. We have put in [. . .] an informal notation, since
here ∂x
∂xℓ should be interpreted not as a numerical variable but needs to be interpreted as derivative of the
k
function fk , which can be evaluated on the fly (we assume that the derivative of the function involved are
accessible in closed form).
∂fk
Assuming all the involved functions ∂x k
have the same complexity (which is likely to be the case if all
the nk are for instance scalar or have the same dimension), and that the number of parent node is bounded,
one sees that the complexity of this scheme is p times the complexity of the evaluation of f (since this needs
∂
to be repeated p times for ∂x 1
, . . . , ∂x∂ p ). For a large p, this is prohibitive.
33
Figure 21: Example of a simple computational graph.
whose computational graph is displayed on Figure 21. The iterations of the forward mode to compute the
derivative with respect to x read
∂x ∂y
= 1, =0
∂x ∂x
∂a ∂a ∂x 1 ∂x
= = {x 7→ a = log(x)}
∂x ∂x ∂x x ∂x
∂b ∂b ∂a ∂b ∂y ∂a
= + =y +0 {(y, a) 7→ b = ya}
∂x ∂a ∂x ∂y ∂x ∂x
√
∂c ∂c ∂b 1 ∂b
= = √ {b 7→ c = b}
∂x ∂b ∂x 2 b ∂x
∂f ∂f ∂b ∂f ∂c ∂b ∂c
= + =1 +1 {(b, c) 7→ f = b + c}
∂x ∂b ∂x ∂c ∂x ∂x ∂x
One needs to run another forward pass to compute the derivative with respect to y
∂x ∂y
= 0, =1
∂y ∂y
∂a ∂a ∂x
= =0 {x 7→ a = log(x)}
∂y ∂x ∂y
∂b ∂b ∂a ∂b ∂y ∂y
= + =0+a {(y, a) 7→ b = ya}
∂y ∂a ∂y ∂y ∂y ∂y
√
∂c ∂c ∂b 1 ∂b
= = √ {b 7→ c = b}
∂y ∂b ∂y 2 b ∂y
∂f ∂f ∂b ∂f ∂c ∂b ∂c
= + =1 +1 {(b, c) 7→ f = b + c}
∂y ∂b ∂y ∂c ∂y ∂y ∂y
Dual numbers. A convenient way to implement this forward pass is to make use of so called “dual
number”, which is an algebra over the real where the number have the form x + εx′ where ε is a symbol
obeying the rule that ε2 = 0. Here (x, x′ ) ∈ R2 and x′ is intended to store a derivative with respect to some
input variable. These number thus obeys the following arithmetic operations
1 1 x′
(x + εx′ )(y + εy ′ ) = xy + ε(xy ′ + yx′ ) and ′
= − ε 2.
x + εx x x
If f is a polynomial or a rational function, from these rules one has that
f (x + ε) = f (x) + εf ′ (x).
34
Using this definition, one has that
which corresponds to the usual chain rule. More generally, if f (x1 , . . . , xs ) is a function implemented using
these overloaded basic functions, one has
∂f
f (x1 + ε, x2 , . . . , xs ) = f (x1 , . . . , xs ) + ε (x1 , . . . , xs )
∂x1
and this evaluation is equivalent to applying the forward mode of automatic differentiation to compute
∂f
∂x1 (x1 , . . . , xs ) (and similarly for the other variables).
∂xt
the differentials ∂xk , i.e. it computes the derivative of the output node with respect to the all the inner
nodes.
The method initialize the derivative of the final node
∂xt
= Idnt ×nt ,
∂xt
and then iteratively makes use, from the last node to the first, of the following recursion formula
∂xt X ∂xt ∂xm X ∂xt ∂fm (x1 , . . . , xm )
∀ k = t − 1, t − 2, . . . , 1, = × = × .
∂xk ∂xm ∂xk ∂xm ∂xk
m∈son(k) m∈son(k)
The notation “parent(k)” denotes the nodes ℓ < k of the graph that are connected to k, see Figure 20, right.
∂xt
Back-propagation. In the special case where xt ∈ R, then ∂xk = [∇xk f (x)]⊤ ∈ R1×nk and one can write
the recursion on the gradient vector as follow
⊤
X ∂fm (x1 , . . . , xm )
∀ k = t − 1, t − 2, . . . , 1, ∇xk f (x) = (∇xm f (x)) .
∂xk
m∈son(k)
⊤
where ∂fm (x∂x 1 ,...,xm )
k
∈ Rnk ×nm is the adjoint of the Jacobian of fm . This form of recursion using adjoint
is often referred to as “back-propagation”, and is the most frequent setting in applications to ML.
In general, when nt = 1, the backward is the optimal way to compute the gradient of a function. Its
drawback is that it necessitate the pre-computation of all the intermediate variables (xk )tk=p , which can be
prohibitive in term of memory usage when t is large. There exists check-pointing method to alleviate this
issue, but it is out of the scope of this course.
35
Figure 22: Complexity of forward (left) and backward (right) modes for composition of functions.
Simple example. We consider once again the fonction f (x) of (53), the iterations of the reverse mode
read
∂f
=1
∂f
∂f ∂f ∂f ∂f
= = 1 {c 7→ f = b + c}
∂c ∂f ∂c ∂f
√
∂f ∂f ∂c ∂f ∂f ∂f 1 ∂f
= + = √ + 1 {b 7→ c = b, b 7→ f = b + c}
∂b ∂c ∂b ∂f ∂b ∂c 2 b ∂f
∂f ∂f ∂b ∂f
= = y {a 7→ b = ya}
∂a ∂b ∂a ∂b
∂f ∂f ∂b ∂f
= = a {y 7→ b = ya}
∂y ∂b ∂y ∂b
∂f ∂f ∂a ∂f 1
= = {x 7→ a = log(x)}
∂x ∂a ∂x ∂a x
The advantage of the reverse mode is that a single traversal of the computational graph allows to compute
both derivatives with respect to x, y, while the forward more necessitates two passes.
f = ft ◦ ft−1 ◦ . . . ◦ f2 ◦ f1 (54)
∀ k = 1, . . . , t, xk = fk (xk−1 )
∂f (x) = At × At−1 × . . . A2 × A1 .
The forward (resp. backward) mode corresponds to the computation of the product of the Jacobian from
right to left (resp. left to right)
36
Figure 23: Computational graph for a feedforward architecture.
We note that the computation of the product A × B of A ∈ Rn×p with B ∈ Rp×q necessitates npq
operations. As shown on Figure 22, the complexity of the forward and backward modes are
t−1
X t−2
X
n0 nk nk+1 and nt nk nk+1
k=1 k=0
So if nt ≪ n0 (which is the typical case in ML scenario where nt = 1) then the backward mode is cheaper.
where L : Rnt → R is some loss function (for instance a least square or logistic prediction risk) and θ =
(θk )t−1
k=0 . Figure 23, top, displays the associated computational graph.
One can use the reverse mode automatic differentiation to compute the gradient of f by computing
successively the gradient with respect to all (xk , θk ). One initializes
∇xt f = ∇L(xt )
zk−1 = [∂x fk (xk−1 , θk−1 )]⊤ zk and ∇θk−1 f = [∂θ fk (xk−1 , θk−1 )]⊤ (∇xk f ) (57)
def.
where we denoted zk = ∇xk f (θ) the gradient with respect to xk .
Multilayers perceptron. For instance, feedforward deep network (fully connected for simplicity) corre-
sponds to using
∀ xk−1 ∈ Rnk−1 , fk (xk−1 , θk−1 ) = ρ(θk−1 xk−1 ) (58)
where θk−1 ∈ Rnk ×nk−1 are the neuron’s weights and ρ a fixed pointwise linearity, see Figure 24. One has,
for a vector zk ∈ Rnk (typically equal to ∇xk f )
37
Figure 24: Multi-layer perceptron parameterization.
Link with adjoint state method. One can interpret (55) as a time discretization of a continuous ODE.
One imposes that the dimension nk = n is fixed, and denotes x(t) ∈ Rn a continuous time evolution, so that
xk → x(kτ ) when k → +∞ and kτ → t. Imposing then the structure
fk (xk−1 , θk−1 ) = xk−1 + τ u(xk−1 , θk−1 , kτ ) (59)
where u(x, θ, t) ∈ Rn is a parameterized vector field, as τ → 0, one obtains the non-linear ODE
ẋ(t) = u(x(t), θ(t), t) (60)
with x(t = 0) = x0 .
Denoting z(t) = ∇x(t) f (θ) the “adjoint” vector field, the discrete equations (62) becomes the so-called
adjoint equations, which is a linear ODE
ż(t) = −[∂x u(x(t), θ(t), t)]⊤ z(t) and ∇θ(t) f (θ) = [∂θ u(x(t), θ(t), t)]⊤ z(t).
1
Note that the correct normalization is τ ∇θk−1 f → ∇θ(t) f (θ)
38
Figure 26: Recurrent residual perceptron parameterization.
Similarly, writing h(x, θ) = x + τ u(x, θ), letting (k, kτ ) → (+∞, t), one obtains the forward non-linear ODE
with a time-stationary vector field
ẋ(t) = u(x(t), θ)
and the following linear backward adjoint equation, for f (θ) = L(x(T ), θ)
Z T
ż(t) = −[∂x u(x(t), θ)]⊤ z(t) and ∇θ f (θ) = ∇θ L(x(T ), θ) + [∂θ f (x(t), θ)]⊤ z(t)dt. (63)
0
Mitigating memory requirement. The main issue of applying this backpropagation method to com-
pute ∇f (θ) is that it requires a large memory to store all the iterates (xk )tk=0 . A workaround is to use
checkpointing, which stores some of these intermediate results and re-run partially the forward algorithm to
reconstruct missing values during the backward pass. Clever hierarchical method perform this recursively in
order to only require log(t) stored values and a log(t) increase on the numerical complexity.
In some situation, it is possible to avoid the storage of the forward result, if one assume that the algorithm
can be run backward. This means that there exists some functions gk so that
xk = gk (xk+1 , . . . , xt ).
In practice, this function typically also depends on a few extra variables, in particular on the input values
(x0 , . . . , xs ).
An example of this situation is when one can split the (continuous time) variable as x(t) = (r(t), s(t))
and the vector field u in the continuous ODE (60) has a symplectic structure of the form u((r, s), θ, t) =
(F (s, θ, t), G(r, θ, t)). One can then use a leapfrog integration scheme, which defines
rk+1 = rk + τ F (sk , θk , τ k) and sk+1 = sk + τ G(rk+1 , θk+1/2 , τ (k + 1/2)).
One can reverse these equation exactly as
sk = sk+1 − τ G(rk+1 , θk+1/2 , τ (k + 1/2)). and rk = rk+1 − τ F (sk , θk , τ k).
39
Fixed point maps In some applications (some of which are detailed below), the iterations xk converges
to some x⋆ (θ) which is thus a fixed point
Instead of applying the back-propagation to compute the gradient of f (θ) = L(xt , θ), one can thus apply the
implicit function theorem to compute the gradient of f ⋆ (θ) = L(x⋆ (θ), θ). Indeed, one has
∇f ⋆ (θ) = [∂x⋆ (θ)]⊤ (∇x L(x⋆ (θ), θ)) + ∇θ L(x⋆ (θ), θ). (64)
Using the implicit function theorem one can compute the Jacobian as
−1
∂h ⋆ ∂h ⋆
∂x⋆ (θ) = − (x (θ), θ) (x (θ), θ).
∂x ∂θ
In practice, one replace in these formulas x⋆ (θ) by xt , which produces an approximation of ∇f (θ). The
disadvantage of this method is that it requires the resolution of a linear system, but its advantage is that it
bypass the memory storage issue of the backpropagation algorithm.
Argmin layers One can define a mapping from some parameter θ to a point x(θ) by solving a parametric
optimization problem
x(θ) = argmin E(x, θ).
x
The simplest approach to solve this problem is to use a gradient descent scheme, x0 = 0 and
This has the form (59) when using the vector field u(x, θ) = ∇E(xk , θ).
Using formula (64) in this case where h = ∇E, one obtains
⊤ −1
∂2E ⋆ ∂2E ⋆
⋆
∇f (θ) = − (x (θ), θ) (x (θ), θ) (∇x L(x⋆ (θ), θ)) + ∇θ L(x⋆ (θ), θ)
∂x∂θ ∂x2
In the special case where the function f (θ) is the minimized function itself, i.e. f (θ) = E(x⋆ (θ), θ), i.e.
L = E, then one can apply the implicit function theorem formula (64), which is much simpler since in this
case ∇x L(x⋆ (θ), θ) = 0 so that
∇f ⋆ (θ) = ∇θ L(x⋆ (θ), θ). (66)
This result is often called Danskin theorem or the envelope theorem.
Sinkhorn’s algorithm Sinkhorn algorithm approximates the optimal distance between two histograms
def.
a ∈ Rn and b ∈ Rm using the following recursion on multipliers, initialized as x0 = (u0 , v0 ) = (1n , 1m )
a b
uk+1 = , and vk+1 = .
Kvk K ⊤ uk
where ·· is the pointwise division and K ∈ Rn×m
+ is a kernel. Denoting θ = (a, b) ∈ Rn+m and xk = (uk , vk ) ∈
n+m
R , the OT distance is then approximately equal to
def.
f (θ) = E(xt , θ) = ⟨a, log(ut )⟩ + ⟨b, log(vt )⟩ − ε⟨Kut , vt ⟩.
40
One has the following differential operator
⊤ ⊤ θ ⊤ 1
[∂x h(x, θ)] = −K diag , [∂θ h(x, θ)] = diag .
(Kx)2 Kx
Similarly as for the argmin layer, at convergence xk → x⋆ (θ), one finds a minimizer of E, so that ∇x L(x⋆ (θ), θ) =
0 and thus the gradient of f ⋆ (θ) = E(x⋆ (θ), θ) can be computed using (66) i.e.
41