Fundations Data Science
Fundations Data Science
Gabriel Peyré
CNRS & DMA
École Normale Supérieure
[email protected]
https://mathematical-tours.github.io
www.numerical-tours.com
October 6, 2019
2
Chapter 1
Inverse Problems
y = Φf0 + w ∈ H
where w ∈ H models the acquisition noise. In this section, we do not use a random noise model, and simply
assume that ||w||H is bounded.
In most applications, H = RP is finite dimensional, because the hardware involved in the acquisition
can only record a finite (and often small) number P of observations. Furthermore, in order to implement
numerically a recovery process on a computer, it also makes sense to restrict the attention to S = RN , where
N is number of point on the discretization grid, and is usually very large, N P . However, in order to
perform a mathematical analysis of the recovery process, and to be able to introduce meaningful models on
the unknown f0 , it still makes sense to consider infinite dimensional functional space (especially for the data
space S).
The difficulty of this problem is that the direct inversion of Φ is in general impossible or not advisable
because Φ−1 have a large norm or is even discontinuous. This is further increased by the addition of some
measurement noise w, so that the relation Φ−1 y = f0 + Φ−1 w would leads to an explosion of the noise Φ−1 w.
We now gives a few representative examples of forward operators Φ.
Denoising. The case of the identity operator Φ = IdS , S = H corresponds to the classical denoising
problem, already treated in Chapters ?? and ??.
De-blurring and super-resolution. For a general operator Φ, the recovery of f0 is more challenging,
and this requires to perform both an inversion and a denoising. For many problem, this two goals are in
contradiction, since usually inverting the operator increases the noise level. This is for instance the case for
the deblurring problem, where Φ is a translation invariant operator, that corresponds to a low pass filtering
with some kernel h
Φf = f ? h. (1.1)
3
One can for instance consider this convolution over S = H = L2 (Td ), see Proposition ??. In practice, this
convolution is followed by a sampling on a grid Φf = {(f ? h)(xk ) ; 0 6 k < P }, see Figure 1.1, middle, for
an example of a low resolution image Φf0 . Inverting such operator has important industrial application to
upsample the content of digital photos and to compute high definition videos from low definition videos.
Interpolation and inpainting. Inpainting corresponds to interpolating missing pixels in an image. This
is modelled by a diagonal operator over the spacial domain
0 if x ∈ Ω,
(Φf )(x) = (1.2)
f (x) if x ∈
/ Ω.
where Ω ⊂ [0, 1]d (continuous model) or {0, . . . , N − 1} which is then a set of missing pixels. Figure 1.1,
right, shows an example of damaged image Φf0 .
Medical imaging. Most medical imaging acquisition device only gives indirect access to the signal of
interest, and is usually well approximated by such a linear operator Φ. In scanners, the acquisition operator is
the Radon transform, which, thanks to the Fourier slice theorem, is equivalent to partial Fourier mesurments
along radial lines. Medical resonance imaging (MRI) is also equivalent to partial Fourier measures
n o
Φf = fˆ(x) ; x ∈ Ω . (1.3)
Here, Ω is a set of radial line for a scanner, and smooth curves (e.g. spirals) for MRI.
Other indirect application are obtained by electric or magnetic fields measurements of the brain activity
(corresponding to MEG/EEG). Denoting Ω ⊂ R3 the region around which measurements are performed (e.g.
the head), in a crude approximation of these measurements, one can assume Φf = {(ψ ? f )(x) ; x ∈ ∂Ω}
where ψ(x) is a kernel accounting for the decay of the electric or magnetic field, e.g. ψ(x) = 1/||x||2 .
Regression for supervised learning. While the focus of this chapter is on imaging science, a closely re-
lated problem is supervised learning using linear model. The typical notations associated to this problem are
usually different, which causes confusion. This problem is detailed in Chapter ??, which draws connection
between regression and inverse problems. In statistical learning, one observes pairs (xi , yi )ni=1 of n observa-
tion, where the features are xi ∈ Rp . One seeks for a linear prediction model of the form yi = hβ, xi i where
the unknown parameter is β ∈ Rp . Storing all the xi as rows of a matrix X ∈ Rn×p , supervised learning
aims at approximately solving Xβ ≈ y. The problem is similar to the inverse problem Φf = y where one
4
performs the change of variable Φ 7→ X and f 7→ β, with dimensions (P, N ) → (n, p). In statistical learning,
one does not assume some well specified model y = Φf0 + w, and the major difference is that the matrix X
is random, which add extra “noise” which needs to be controlled as n → +∞. The recovery is performed by
the normalized ridge regression problem
1
min ||Xβ − y||2 + λ||β||2
β 2n
so that the natural change of variable should be n1 X ∗ X ∼ Φ∗ Φ (empirical covariance) and n1 X ∗ y ∼ Φ√∗ y.
The law of large number shows that n1 X ∗ X and n1 X ∗ y are contaminated by a noise of amplitude 1/ n,
which plays the role of ||w||.
Proof. We first analyze the problem, and notice that if Φ = U ΣV > with Σ = diagm (σm ), then ΦΦ> =
U Σ2 U > and then V > = Σ−1 U > Φ. We can use this insight. Since ΦΦ> is a positive symmetric matrix,
we write its eigendecomposition as ΦΦ> = U Σ2 U > where Σ = diagR
m=1 (σm ) with σm > 0. We then define
def. > −1
V = Φ U Σ . One then verifies that
V > V = (Σ−1 U > Φ)(Φ> U Σ−1 ) = Σ−1 U > (U Σ2 U > )U Σ−1 = IdR and U ΣV > = U ΣΣ−1 U > Φ = Φ.
This theorem is still valid with complex matrice, replacing > by ∗ . Expression (1.4) describes Φ as a sum
>
of rank-1 matrices um vm . One usually order the singular values (σm )m in decaying order σ1 > . . . > σR . If
these values are different, then the SVD is unique up to ±1 sign change on the singular vectors.
The left singular vectors U is an orthonormal basis of Im(Φ), while the right singular values is an
orthonormal basis of Im(Φ> ) = ker(Φ)⊥ . The decomposition (1.4) is often call the “reduced” SVD because
one has only kept the R non-zero singular values. The “full” SVD is obtained by completing U and V to
define orthonormal bases of the full spaces RP and RN . Then Σ becomes a rectangular matrix of size P × N .
A typical example is for Φf = f ? h over RP = RN , in which case the Fourier transform diagonalizes the
convolution, i.e.
Φ = (um )∗m diag(ĥm )(um )m (1.5)
def. 2iπ
where (um )n = √1 e N nm so that the singular values are σm = |ĥm | (removing the zero values) and the
N
def.
singular vectors are (um )n and (vm θm )n where θm = |ĥm |/ĥm is a unit complex number.
Computing the SVD of a full matrix Φ ∈ RN ×N has complexity N 3 .
5
Compact operators. One can extend the decomposition to compact operators Φ : S → H between
separable Hilbert space. A compact operator is such that ΦB1 is pre-compact where B1 = {s ∈ S ; ||s|| 6 1}
is the unit-ball. This means that for any sequence (Φsk )k where sk ∈ B1 one can extract a converging
sub-sequence. Note that in infinite dimension, the identity operator Φ : S → S is never compact.
Compact operators Φ can be shown to be equivalently defined as those for which an expansion of the
form (1.4) holds
+∞
X
>
Φ= σ m u m vm (1.6)
m=1
where (σm )m is a decaying sequence converging to 0, σm → 0. Here in (1.6) convergence holds in the operator
norm, which is the algebra norm on linear operator inherited from those of S and H
def.
||Φ||L(S,H) = min ||u||S 6 1.
||Φu||H
where dy is the Lebesgue measure. An example of such a setting which generalizes (1.5) is when Φf = f ? h
on Td = (R/2πZ)d , which is corresponds to a translation invariant kernel k(x, y) = h(x − y), in which case
Rx
um (x) = (2π)−d/2 eiωx , σm = |fˆm |. Another example on Ω = [0, 1] is the integration, (Φf )(x) = 0 f (y)dy,
which corresponds to k being the indicator of the “triangle”, k(x, y) = 1x6y .
Pseudo inverse. In the case where w = 0, it makes to try to directly solve Φf = y. The two obstruction
for this is that one not necessarily has y ∈ Im(Φ) and even so, there are an infinite number of solutions if
ker(Φ) 6= {0}. The usual workaround is to solve this equation in the least square sense
def.
f + = argmin ||f ||S where y + = ProjIm(Φ) (y) = argmin ||y − z||H .
Φf =y + z∈Im(Φ)
The following proposition shows how to compute this least square solution using the SVD and by solving
linear systems involving either ΦΦ∗ or Φ∗ Φ.
Proposition 2. One has
f + = Φ+ y where Φ+ = V diagm (1/σm )U ∗ . (1.7)
In case that Im(Φ) = H, one has Φ+ = Φ∗ (ΦΦ∗ )−1 . In case that ker(Φ) = {0}, one has Φ+ = (Φ∗ Φ)−1 Φ∗ .
Proof. Since U is an ortho-basis of Im(Φ), y + = U U ∗ y, and thus Φf = y + reads U ΣV ∗ f = U U ∗ y and
hence V ∗ f = Σ−1 U ∗ y. Decomposition orthogonally f = f0 + r where f0 ∈ ker(Φ)⊥ and r ∈ ker(Φ), one
has f0 = V V ∗ f = V Σ−1 U ∗ y = Φ+ y is a constant. Minimizing ||f ||2 = ||f0 ||2 + ||r||2 is thus equivalent
to minimizing ||r|| and hence r = 0 which is the desired result. If Im(Φ) = H, then R = N so that
ΦΦ∗ = U Σ2 U ∗ is the eigen-decomposition of an invertible and (ΦΦ∗ )−1 = U Σ−2 U ∗ . One then verifies
Φ∗ (ΦΦ∗ )−1 = V ΣU ∗ U Σ−2 U ∗ which is the desired result. One deals similarly with the second case.
6
For convolution operators Φf = f ? h, then
ĥ−1
if ĥm 6= 0
Φ+ y = y ? h + where ĥ+
m =
m
.
0 if ĥm = 0.
so that the recovery error is ||Φ+ y − f0+ || = ||Φ+ w||. This quantity can be as larges as ||w||/σR if w ∝ uR . The
noise is thus amplified by the inverse 1/σR of the smallest amplitude non-zero singular values, which can be
very large. In infinite dimension, one typically has R = +∞, so that the inverse is actually not bounded
(discontinuous). It is thus mendatory to replace Φ+ by a regularized approximate inverse, which should have
the form
∗
Φ+
λ = V diagm (µλ (σm ))U (1.8)
where µλ , indexed by some parameter λ > 0, is a regularization of the inverse, that should typically satisfies
1
µλ (σ) 6 Cλ < +∞ and lim µλ (σ) = .
λ→0 σ
Figure 1.2, left, shows a typical example of such a regularized inverse curve, obtained by thresholding.
where J is some regularization functional which should at least be continuous on S. The simplest example
is the quadratic norm J = || · ||2S ,
def.
fλ = argmin ||y − Φf ||2H + λ||f ||2 (1.10)
f ∈S
which is indeed a special case of (1.8) as proved in Proposition 3 bellow. In this case, the regularized solution
is obtained by solving a linear system
which is the desired expression since (Σ2 + λ)−1 Σ = diag(µλ (σm ))m .
7
A special case is when Φf = f ? h is a convolution operator. In this case, the regularized inverse is
computed in O(N log(N )) operation using the FFT as follow
ĥ∗m
fˆλ,m = ŷm .
|ĥm |2 + σ 2
Figure 1.2 contrast the regularization curve associated to quadratic regularization (1.11) (right) to the
simpler thresholding curve (left).
The question is to understand how to choose λ as a function of the noise level ||w||H in order to guarantees
that fλ → f0 and furthermore establish convergence speed. One first needs to ensure at least f0 = f0+ , which
in turns requires that f0 ∈ Im(Φ∗ ) = ker(Φ)⊥ . Indeed, an important drawback of linear recovery methods
(such as quadratic regularization) is that necessarily fλ ∈ Im(Φ∗ ) = ker(Φ⊥ ) so that no information can
be recovered inside ker(Φ). Non-linear methods must be used to achieve a “super-resoltution” effect and
recover this missing information.
Source condition. In order to ensure convergence speed, one quantify this condition and impose a so-
called source condition of order β, which reads
In some sense, the larger β, the farther f0 is away from ker(Φ), and thus the inversion problem is “easy”. This
condition means that there should exists z ∈ RP such that f0 = V diag(σm 2β
)V ∗ z, i.e. z = V diag(σm
−2β
)V ∗ f0 .
In order to control the strength of this source condition, we assume ||z|| 6 ρ where ρ > 0. The source
condition thus corresponds to the following constraint
X
−2β
σm hf0 , vm i2 6 ρ2 < +∞. (Sβ,ρ )
m
This is a Sobolev-type constraint, similar to those imposed in ??. A prototypical example is for a low-pass
filter Φf = f ? h where h as a slow polynomial-like decay of frequency, i.e. |ĥm | ∼ 1/mα for large m. In this
case, since vm is the Fourier basis, the source condition (Sβ,ρ ) reads
X
||m||2αβ |fˆm |2 6 ρ2 < +∞,
m
Sublinear convergence speed. The following theorem shows that this source condition leads to a con-
vergence speed of the regularization. Imposing a bound ||w|| 6 δ on the noise, the theoretical analysis of the
inverse problem thus depends on the parameters (δ, ρ, β). Assuming f0 ∈ ker(Φ)⊥ , the goal of the theoretical
analysis corresponds to studying the speed of convergence of fλ toward f0 , when using y = Φf0 + w as δ → 0.
This requires to decide how λ should depends on δ.
Theorem 1. Assuming the source condition (Sβ,ρ ) with 0 < β 6 2, then the solution of (1.10) for ||w|| 6 δ
satisfies
1 β
||fλ − f0 || 6 Cρ β+1 δ β+1
for a constant C which depends only on β, and for a choice
2 2
λ ∼ δ β+1 ρ− β+1 .
fλ = Φ+ +
fλ0 = Φ+
0 def.
λ (Φf0 + w) = fλ + Φλ w where λ (Φf0 ),
8
1
Figure 1.2: Bounding µλ (σ) 6 Cλ = √
2 λ
.
so that fλ = fλ0 + Φ+
λ w, one has for any regularized inverse of the form (1.8)
The term ||fλ − fλ0 || is a variance term which account for residual noise, and thus decays when λ increases
(more regularization). The term ||fλ0 − f0 || is independent of the noise, it is a bias term coming from the
approximartion (smoothing) of f0 , and thus increases when λ increases. The choice of an optimal λ thus
results in a bias-variance tradeoff between these two terms. Assuming
∀ σ > 0, µλ (σ) 6 Cλ
2
f0,m 2
The bias term is bounded as, since 2β
σm
= zm ,
X X 2
2 f0,m
||fλ0 − f0 ||2 = (1 − µλ (σm )σm )2 f0,m
2
= β
(1 − µλ (σm )σm )σm 2β
2
6 Dλ,β ρ2 (1.14)
m m σm
where we assumed
∀ σ > 0, (1 − µλ (σ)σ)σ β 6 Dλ,β . (1.15)
Note that for β > 2, one has Dλ,β = +∞ Putting (1.14) and (1.15) together, one obtains
σ λσ β
In the case of the regularization (1.10), one has µλ (σ) = σ 2 +λ , and thus (1 − µλ (σ)σ)σ β = σ 2 +λ . For β 6 2,
one verifies (see Figure 1.2 and 1.3) that
1 β
Cλ = √ and Dλ,β = cβ λ 2 ,
2 λ
for some constant cβ . Equalizing the contributions of the two terms in (1.16) (a better constant would be
β 2
reached by finding the best λ) leads to selecting √δλ = λ 2 ρ i.e. λ = (δ/ρ) β+1 . With this choice,
√ 1 β 1
||fλ − f0 || = O(δ/ λ) = O(δ(δ/ρ)− β+1 ) = O(δ β+1 ρ β+1 ).
9
β
σ
Figure 1.3: Bounding λ λ+σ 2 6 Dλ,β .
This theorem shows that using larger β 6 2 leads to faster convergence rates as ||w|| drops to zero. The
rate (1.13) however suffers from a “saturation” effect, indeed, choosing β > 2 does not helps (it gives the
same rate as β = 2), and the best possible rate is thus
1 2
||fλ − f0 || = O(ρ 3 δ 3 ).
By choosing more alternative regularization functional µλ and choosing β large enough, one can show that
it is possible to reach rate ||fλ − f0 || = O(δ 1−κ ) for an arbitrary small κ > 0. Figure 1.2 contrast the
regularization curve associated to quadratic regularization (1.11) (right) to the simpler thresholding curve
(left) which does not suffers from saturation. Quadratic regularization however is much simpler to implement
because it does not need to compute an SVD, is defined using a variational optimization problem and is
computable as the solution of a linear system. One cannot however reach a linear rate ||fλ − f0 || = O(||w||).
Such rates are achievable using non-linear sparse `1 regularizations as detailed in Chapter ??.
Convex regularization. Following (1.9), the ill-posed problem of recovering an approximation of the
high resolution image f0 ∈ RN from noisy measures y = Φf0 + w ∈ RP is regularized by solving a convex
optimization problem
def. 1
fλ ∈ argmin E(f ) = ||y − Φf ||2 + λJ(f ) (1.17)
f ∈RN 2
where ||y − Φf ||2 is the data fitting term (here || · || is the `2 norm on RP ) and J(f ) is a convex functional on
RN .
The Lagrange multiplier λ weights the importance of these two terms, and is in practice difficult to
set. Simulation can be performed on high resolution signal f0 to calibrate the multiplier by minimizing the
super-resolution error ||f0 − f˜||, but this is usually difficult to do on real life problems.
In the case where there is no noise, w = 0, the Lagrange multiplier λ should be set as small as possible. In
the limit where λ → 0, the unconstrained optimization problem (1.17) becomes a constrained optimization,
as the following proposition explains. Let us stress that, without loss of generality, we can assume that
y ∈ Im(Φ), because one has the orthogonal decomposition
lim f = +∞
||f ||→+∞
10
i.e.
∀ K, ∃R, ||x|| > R =⇒ |J(f )| > K.
This means that its non-empty levelsets {f ; J(f ) 6 c} are bounded (and hence compact) for all c.
Proposition 4. We assume that J is coercive and that y ∈ Φ. Then, if for each λ, fλ is a solution of (1.17),
then (fλ )λ is a bounded set and any accumulation point f ? is a solution of
Proof. Denoting h, any solution to (1.18), which in particular satisfies Φh = y, because of the optimality
condition of fλ for (1.17), one has
1 1
||Φfλ − y||2 + J(fλ ) 6 ||Φh − y||2 + J(h) = J(h).
2λ 2λ
This shows that J(fλ ) 6 J(h) so that since J is coercive the set (fλ )λ is bounded and thus one can consider an
accumulation point fλk → f ? for k → +∞. Since ||Φfλk −y||2 6 λk J(h), one has in the limit Φf ? = y, so that
f ? satisfies the constraints in (1.18). Furthermore, by continuity of J, passing to the limit in J(fλk ) 6 J(h),
one obtains J(f ? ) 6 J(h) so that f ? is a solution of (1.18).
Note that it is possible to extend this proposition in the case where J is not necessarily coercive on the
full space (for instance the TV functionals in Section 1.4.1 bellow) but on the orthogonal to ker(Φ). The
proof is more difficult.
Quadratic Regularization. The simplest class of prior functional are quadratic, and can be written as
1 1
J(f ) = ||Gf ||2RK = hLf, f iRN (1.19)
2 2
where G ∈ RK×N and where L = G∗ G ∈ RN ×N is a positive semi-definite matrix. The special case (1.10) is
recovered when setting G = L = IdN .
Writing down the first order optimality conditions for (1.17) leads to
hence, if
ker(Φ) ∩ ker(G) = {0},
then (1.19) has a unique minimizer fλ , which is obtained by solving a linear system
Φf = h ? f, (1.22)
and using G = ∇ be a discretization of the gradient operator, such as for instance using first order finite
differences (??). This corresponds to the discrete Sobolev prior introduced in Section ??. Such an operator
compute, for a d-dimensional signal f ∈ RN (for instance a 1-D signal for d = 1 or an image when d = 2), an
approximation ∇fn ∈ Rd of the gradient vector at each sample location n. Thus typically, ∇ : f 7→ (∇fn )n ∈
11
RN ×d maps to d-dimensional vector fields. Then −∇∗ : RN ×d → RN is a discretized divergence operator. In
this case, ∆ = −GG∗ is a discretization of the Laplacian, which is itself a convolution operator. One then
has
ĥ∗m ŷm
fˆλ,m = , (1.23)
|ĥm |2 − λdˆ2,m
where dˆ2 is the Fourier transform of the filter d2 corresponding to the Laplacian. For instance, in dimension
1, using first order finite differences, the expression for dˆ2,m is given in (??).
A = Φ∗ Φ + λL and b = Φ∗ y.
def.
Af = b where
It is possible to solve exactly this linear system with direct methods for moderate N (up to a few thousands),
and the numerical complexity for a generic A is O(N 3 ). Since the involved matrix A is symmetric, the
best option is to use Choleski factorization A = BB ∗ where B is lower-triangular. In favorable cases, this
factorization (possibly with some re-re-ordering of the row and columns) can take advantage of some sparsity
in A.
For large N , such exact resolution is not an option, and should use approximate iterative solvers, which
only access A through matrix-vector multiplication. This is especially advantageous for imaging applications,
where such multiplications are in general much faster than a naive O(N 2 ) explicit computation. If the matrix
A is highly sparse, this typically necessitates O(N ) operations. In the case where A is symmetric and positive
definite (which is the case here), the most well known method is the conjugate gradient methods, which is
actually an optimization method solving
def. def.
min E(f ) = Q(f ) = hAf, f i − hf, bi (1.24)
f ∈RN
which is equivalent to the initial minimization (1.17). Instead of doing a naive gradient descent (as studied
in Section 2.1 bellow), stating from an arbitrary f (0) , it compute a new iterate f (`+1) from the previous
iterates as n o
def.
f (`+1) = argmin E(f ) ; f ∈ f (`) + Span(∇E(f (0) ), . . . , ∇E(f ` )) .
f
The crucial and remarkable fact is that this minimization can be computed in closed form at the cost of two
matrix-vector product per iteration, for k > 1 (posing initially d(0) = ∇E(f (0) ) = Af (0) − b)
||g (`) ||2 (`−1) hg` , d(`) i
f (`+1) = f (`) − τ` d(`) where d(`) = g` + d and τ` = (1.25)
||g (`−1) ||2 hAd(`) , d(`) i
def.
g (`) = ∇E(f (`) ) = Af (`) − b. It can also be shown that the direction d(`) are orthogonal, so that after
` = N iterations, the conjugate gradient computes the unique solution f (`) of the linear system Af = b. It is
however rarely used this way (as an exact solver), and in practice much less than N iterates are computed.
It should also be noted that iterations (1.25) can be carried over for an arbitrary smooth convex function
E, and it typically improves over the gradient descent (although in practice quasi-Newton method are often
preferred).
12
images. This is can clearly be seen in the convolutive case (1.23), this the restoration operator Φ+ λ Φ is a
filtering, which tends to smooth sharp part of the data.
This phenomena can also be understood because the restored data fλ always belongs to Im(Φ∗ ) =
ker(Φ)⊥ , and thus cannot contains “high frequency” details that are lost in the kernel of Φ. To alleviate this
shortcoming, and recover missing information in the kernel, it is thus necessarily to consider non-quadratic
and in fact non-smooth regularization.
Total variation. The most well know instance of such a non-quadratic and non-smooth regularization is
the total variation prior. For smooth function f : Rd 7→ R, this amounts to replacing the quadratic Sobolev
energy (often called Dirichlet energy)
Z
def. 1
JSob (f ) = ||∇f ||2Rd dx,
2 Rd
where ∇f (x) = (∂x1 f (x), . . . , ∂xd f (x))> is the gradient, by the (vectorial) L1 norm of the gradient
Z
def.
JTV (f ) = ||∇f ||Rd dx.
Rd
We refer also to Section ?? about these priors. Simply “removing” the square 2 inside the integral might
seems light a small change, but in fact it is a game changer.
Indeed, while JSob (1Ω ) = +∞ where 1Ω is the indicator a set Ω with finite perimeter |Ω| < +∞, one can
show
R that JTV (1Ω ) = |Ω|, if one interpret ∇f as a distribution Df (actually a vectorial Radon measure) and
R d ||∇f ||Rd dx is replaced by the total mass |Df |(Ω) of this distribution m = Df
Z
d d
|m|(Ω) = sup hh(x), dm(x)i ; h ∈ C(R 7→ R ), ∀ x, ||h(x)|| 6 1 .
Rd
The total variation of a function such that Df has a bounded total mass (a so-called bounded variation
function) is hence defined as
Z
def. 1 d d
JTV (f ) = sup f (x) div(h)(x)dx ; h ∈ Cc (R ; R ), ||h||∞ 6 1 .
Rd
Generalizing the fact that JTV (1Ω ) = |Ω|, the functional co-area formula reads
Z
JTV (f ) = Hd−1 (Lt (f ))dt where Lt (f ) = {x ; f (x) = t}
R
and where Hd−1 is the Hausforf measures of dimension d − 1, for instance, for d = 2 if L has finite perimeter
|L|, then Hd−1 (L) = |L| is the perimeter of L.
Discretized Total variation. For discretized data f ∈ RN , one can define a discretized TV semi-norm
as detailed in Section ??, and it reads, generalizing (??) to any dimension
X
JTV (f ) = ||∇fn ||Rd
n
d
where ∇fn ∈ R is a finite difference gradient at location indexed by n.
The discrete total variation prior JTV (f ) defined in (??) is a convex but non differentiable function of f ,
since a term of the form ||∇fn || is non differentiable if ∇fn = 0. We defer to chapters ?? and 2 the study of
advanced non-smooth convex optimization technics that allows to handle this kind of functionals.
In order to be able to use simple gradient descent methods, one needs to smooth the TV functional. The
general machinery proceeds by replacing the non-smooth `2 Euclidean norm || · || by a smoothed version, for
instance p
def.
∀ u ∈ Rd , ||u||ε = ε2 + ||u||.
13
This leads to the definition of a smoothed approximate TV functional, already introduced in (??),
def.
X
ε
JTV (f ) = ||∇fn ||ε
n
ε→0 1
||u||ε −→ ||u|| and ||u||ε = ε + ||u||2 + O(1/ε2 )
2ε
ε
which suggest that JTV interpolates between JTV and JSob .
The resulting inverse regularization problem (1.17) thus reads
def. 1
fλ = argmin E(f ) = ||y − Φf ||2 + λJTV
ε
(f ) (1.26)
f ∈RN 2
It is a strictly convex problem (because || · ||ε is strictly convex for ε > 0) so that its solution fλ is unique.
where we used O(||r||2RN ) in place of o(||r||RN ) (for differentiable function) because we assume here E is of
class C 1 (i.e. the gradient is continuous).
For such a function, the gradient descent algorithm is defined as
def.
f (`+1) = f (`) − τ` ∇E(f (`) ), (1.28)
where the step size τ` > 0 should be small enough to guarantee convergence, but large enough for this
algorithm to be fast.
We refer to Section 2.1 for a detailed analysis of the convergence of the gradient descent, and a study of
the influence of the step size τ` .
∇G(f ) = Af − b.
In particular, one retrieves that the first order optimality condition ∇G(f ) = 0 is equivalent to the linear
system Af = b.
For the quadratic fidelity term G(f ) = 21 ||Φf − y||2 , one thus obtains
In the special case of the regularized TV problem (1.26), the gradient of E reads
14
Recall the chain rule for differential reads ∂(G1 ◦ G2 ) = ∂G1 ◦ ∂G2 , but that gradient vectors are actually
transposed of differentials, so that for E = F ◦ H where F : RP → R and H : RN → RP , one has
so that
ε
JTV (f ) = − div(N ε (∇f )),
where N ε (u) = (un /||un ||ε )n is the smoothed-normalization operator of vector fields (the differential of ||·||1,ε ),
and where div = −∇∗ is minus the adjoint of the gradient. √
Since div = −∇∗ , their Lipschitz constants are equal || div ||op = ||∇||op , and is for instance equal to 2d
for the discretized gradient operator. Computation shows that the Hessian of || · ||ε is bounded by 1/ε, so
that for the smoothed-TV functional, the Lipschitz constant of the gradient is upper-bounded by
||∇||2
L= + ||Φ||2op .
ε
Furthermore, this functional is strongly convex because || · ||ε is ε-strongly convex, and the Hessian is lower
bounded by
µ = ε + σmin (Φ)2
where σmin (Φ) is the smallest singular value of Φ. For ill-posed problems, typically σmin (Φ) = 0 or is very
small, so that both L and µ degrades (tends respectively to 0 and +∞) as ε → 0, so that gradient descent
becomes prohibitive for small ε, and it is thus required to use dedicated non-smooth optimization methods
detailed in the following chapters. On the good news side, note however that in many case, using a small but
non-zero value for ε often leads to better a visually more pleasing results, since it introduce a small blurring
which diminishes the artifacts (and in particular the so-called “stair-casing” effect) of TV regularization.
1.5.1 Deconvolution
The blurring operator (1.1) is diagonal over Fourier, so that quadratic regularization are easily solved
using Fast Fourier Transforms when considering periodic boundary conditions. We refer to (1.22) and the
correspond explanations. TV regularization in contrast cannot be solved with fast Fourier technics, and is
thus much slower.
1.5.2 Inpainting
For the inpainting problem, the operator defined in (1.3) is diagonal in space
15
In the noiseless case, the recovery (1.18) is solved using a projected gradient descent. For the Sobolev
energy, the algorithm iterates
f (`+1) = Py (f (`) + τ ∆f (`) ).
which converges if τ < 2/||∆|| = 1/4. Figure 1.4 shows some iteration of this algorithm, which progressively
interpolate within the missing area.
k=1 k = 10 k = 20 k = 100
16
Image f0 Observation y = Φf0
Sobolev f ? TV f ?
SNR=?dB SNR=?dB
17
Figure 1.7: Principle of tomography acquisition.
We thus simply the acquisition process over the discrete domain and model it as computing directly samples
of the Fourier transform
Φf = (fˆ[ω])ω∈Ω ∈ RP
where Ω is a discrete set of radial lines in the Fourier plane, see Figure 1.8, right.
In this discrete setting, recovering from Tomography measures y = Rf0 is equivalent in this setup to
inpaint missing Fourier frequencies, and we consider partial noisy Fourier measures
where w[ω] is some measurement noise, assumed here to be Gaussian white noise for simplicity.
The peuso-inverse f + = R+ y defined in (1.7) of this partial Fourier measurements reads
y[ω] if ω ∈ Ω,
fˆ+ [ω] =
0 if ω ∈ / Ω.
Figure 1.9 shows examples of pseudo inverse reconstruction for increasing size of Ω. This reconstruction
exhibit serious artifact because of bad handling of Fourier frequencies (zero padding of missing frequencies).
The total variation regularization (??) reads
1X
f ? ∈ argmin |y[ω] − fˆ[ω]|2 + λ||f ||TV .
f 2
ω∈Ω
It is especially suitable for medical imaging where organ of the body are of relatively constant gray value,
thus resembling to the cartoon image model introduced in Section ??. Figure 1.10 compares this total
variation recovery to the pseudo-inverse for a synthetic cartoon image. This shows the hability of the total
variation to recover sharp features when inpainting Fourier measures. This should be contrasted with the
difficulties that faces TV regularization to inpaint over the spacial domain, as shown in Figure ??.
18
Image f Radon sub-sampling Fourier domain
Image f0 Pseudo-inverse TV
19
20
Chapter 2
Convex Optimization
The main references for this chapter are [?, ?, ?], see also [?, ?, ?].
We consider a general convex optimization problem
min f (x) (2.1)
x∈H
where H = RN is a finite dimensional Hilbertian (i.e. Euclidean) space, and try to devise “cheap” algorithms
with a low computational cost per iterations. The class of algorithms considered are first order, i.e. they
make use of gradient information.
21
Proposition 5. Conditions (RL ) and (Sµ ) imply
µ L
∀ (x, x0 ), f (x0 ) + h∇f (x), x0 − xi + ||x − x0 ||2 6 f (x) 6 f (x0 ) + h∇f (x0 ), x0 − xi + ||x − x0 ||2 . (2.3)
2 2
If f is of class C 2 , conditions (RL ) and (Sµ ) are equivalent to
where ∂ 2 f (x) ∈ RN ×N is the Hessian of f , and where is the natural order on symmetric matrices, i.e.
def.
where xt = f + t(x0 − x). Using Cauchy-Schwartz, and then the smoothness hypothesis (RL )
Z 1 Z 1
0 0 0 0 0 2
f (x ) − f (x) 6 h∇f (x), x − xi + L||xt − f ||||x − x||dt 6 h∇f (x), x − xi + L||x − x|| tdt
0 0
which gives the desired result since ||xt − x||2 /t = t||x0 − x||2 .
The relation (2.3) shows that a smooth (resp. strongly convex) functional is bellow a quadratic tangential
majorant (resp. minorant).
Condition (2.4) thus reads that the singular values of ∂ 2 f (x) should be contained in the interval [µ, L].
The upper bound is also equivalent to ||∂ 2 f (x)||op 6 L where || · ||op is the operator norm, i.e. the largest
singular value. In the special case of a quadratic function Q of the form (1.24), ∂ 2 f (x) = A is constant, so
that [µ, L] can be chosen to be the range of the singular values of A.
In order to get some insight on the convergence proof and the associated speed, we first study the simple
case of a quadratic functional.
Proposition 6. For f (x) = hAx, xi − hb, xi with the singular values of A upper-bounded by L, assuming
there exists (τmin , τmax ) such that
2
0 < τmin 6 τ` 6 τ̃max < (2.5)
L
then there exists 0 6 ρ̃ < 1 such that
If the singular values are lower bounded by µ, then the best rate ρ̃ is obtained for
22
Figure 2.1: Contraction constant h(τ ) for a quadratic function (right).
Since the solution x? (which by the way is unique by strict convexity) satisfy the first order condition
Ax? = b, it gives
x(`+1) − x? = x(`) − x? − τ` A(x(`) − x? ) = (IdN − τ` )(x(`) − x? ).
One thus has to study the contractance ratio of the linear map IdN − τ` A, i.e. its largest singular value,
which reads
def.
h(τ ) = ||IdN − τ A||2 = σmax (IdN − τ ) = max(|1 − τ` σmax (A)|, |1 − τ σmin (A)|).
For a quadratic function, one has σmin (A) = µ, σmax (A) = L. Figure 2.1, right, shows a display of h(τ ). One
has that for 0 < τ < 2/L, h(τ ) < 1.
def.
Note that when the condition number ε = µ/L 1 is small (which is the typical setup for ill-posed
problems), then the contraction constant appearing in (2.7) scales like
ρ̃ ∼ 1 − 2ε. (2.8)
The quantity ε in some sense reflects the inverse-conditioning of the problem. For quadratic function, it
indeed corresponds exactly to the inverse of the condition number (which is the ratio of the largest to
smallest singular value). The condition number is minimum and equal to 1 for orthogonal matrices.
The error decay rate (2.6), although it is geometrical O(ρ` ) is called a “linear rate” in the optimization
literature. It is a “global” rate because it hold for all ` (and not only for large enough `).
We now give convergence theorem for a general convex function. On contrast to quadratic function, if
one does not assumes strong convexity, one can only show a sub-linear rate on the function values (and
no rate at all on the iterates themselves!). It is only when one assume strong convexity that linear rate is
obtained. Note that in this case, the solution of the minimization problem is not necessarily unique.
Theorem 2. If f satisfy conditions (RL ), assuming there exists (τmin , τmax ) such that
2
0 < τmin 6 τ` 6 τmax < , (2.9)
L
then x(`) converges to a solution x? of (2.1) and there exists C > 0 such that
C
f (x(`) ) − f (x? ) 6 . (2.10)
`+1
If furthermore f is µ-strongly convex, then there exists 0 6 ρ < 1 such that ||x(`) − x? || 6 ρ` ||x(0) − x? ||.
23
Proof. In the case where f is not strongly convex, we only prove (2.10) since the proof that x(`) converges
is more technical. Note indeed that if the minimizer x? is non-unique, then it might be the case that the
iterate x(`) “cycle” while approaching the set of minimizer, but actually convexity of f prevents this kind
of pathological behavior. For simplicity, we do the proof in the case τ` = 1/L, but it extends to the general
case. The L-smoothness property imply (2.3), which reads
L (`+1)
f (x(`+1) ) 6 f (x(`) ) + h∇f (x(`) ), x(`+1) − x(`) i + ||x − x(`) ||2 .
2
Using the fact that x(`+1) − x(`) = − L1 ∇f (x(`) ), one obtains
1 1 1
f (x(`+1) ) 6 f (x(`) ) − ||∇f (x(`) )||2 + ||∇f (x(`) )||2 6 f (x(`) ) − ||∇f (x(`) )||2 (2.11)
L 2L 2L
This shows that (f (x(`) ))` is a decaying sequence. By convexity
L||x(0) − x? ||2
f (x(k+1) ) − f (x? ) 6
2(k + 1)
def.
which gives (2.10) for C = L||x(0) − x? ||2 /2.
µ
If we now assume f is µ-strongly convex, then, using ∇f (x? ) = 0, one has 2 ||x
?
− x||2 6 f (x) − f (x? ) for
all x. Re-manipulating (2.14) gives
µ (`+1) L (`)
||x − x? ||2 6 f (x(`+1) ) − f (x? ) 6 ||x − x? ||2 − ||x? − x(`+1) ||2 ,
2 2
and hence s
(`+1) ? L
||x − x || 6 ||x(`+1) − x? ||, (2.15)
L+µ
which is the desired result.
Note that in the low conditioning setting ε 1, one retrieve a dependency of the rate (2.15) similar to
the one of quadratic functions (2.8), indeed
s
L 1 1
= (1 + ε)− 2 ∼ 1 − ε.
L+µ 2
24
2.1.2 Sub-gradient Descent
The gradient descent (2.2) cannot be applied on a non-smooth function f . One can use in place of a
gradient a sub-gradient, which defines the sub-gradient descent
def.
x(`+1) = x(`) − τ` g (`) where g (`) ∈ ∂f (x(`) ). (2.16)
The main issue with this scheme is that to ensure convergence, the iterate should go to zero. One can easily
convince oneself why by looking at the iterates on a function f (x) = |x|.
Theorem 3. If ` τ` = +∞ and ` τ`2 < +∞, then x(`) converges to a minimizer of f .
P P
where C ⊂ RS is a closed convex set and f : RS → R is a smooth convex function (at least of class C 1 ).
The gradient descent algorithm (2.2) is generalized to solve a constrained problem using the projected
gradient descent
def.
x(`+1) = ProjC x(`) − τ` ∇f (x(`) ) , (2.18)
where ProjC is the orthogonal projector on C
ProjC (x) = argmin ||x − x0 ||
x0 ∈C
which is always uniquely defined because C is closed and convex. The following proposition shows that all
the convergence properties of the classical gradient descent caries over to this projected algorithm.
Theorem 4. Theorems ?? and 2 still holds when replacing iterations (2.2) by (2.18).
Proof. The proof of Theorem ?? extends because the projector is contractant, || ProjC (x) − ProjC (x0 )|| 6
||x − x0 || so that the strict contraction properties of the gradient descent is maintained by this projection.
The main bottleneck that often prevents to use (2.18) is that the projector is often complicated to
compute. We are however lucky since for `1 mininization, one can apply in a straightforward manner this
method.
25
Figure 2.2: Proximal map and projection map.
Proof. The proximal map of || · ||1 was derived in Proposition ??. For the quadratic case
Note that in some case, the proximal map of a non-convex function is well defined, for instance Proxτ ||·||0
√
is the hard thresholding associated to the threshold 2τ , see Proposition ??.
26
which is the optimality condition for z − y = Proxf (x).
One has
z = Proxf (·−y) (x) ⇔ 0 ∈ x − z + λ∂f (x − y) ⇔ 0 ∈ x0 − (z − y) + ∂f (x0 )
def.
where we defined x0 = x − y, and this is the optimality condition for z − y = Proxf (x0 )
The following proposition is very useful.
Proposition 10. If A ∈ RP ×N is a tight frame, i.e. AA∗ = IdP , then
Proxf ◦A = A∗ ◦ Proxf ◦A + IdN − A∗ A.
In particular, if A is orthogonal, then Proxf ◦A = A∗ ◦ Proxf ◦A.
Link with duality. One has the following fundamental relation between the proximal operator of a func-
tion and of its Legendre-Fenchel transform
Theorem 5 (Moreau decomposition). One has
Proxτ f = Id − τ Proxf ∗ /τ (·/τ ).
This theorem shows that the proximal operator of f is simple to compute if and only the proximal
operator of f ∗ is also simple. As a particular instantiation, since according to , one can re-write the soft
thresholding as follow
Proxτ ||·||1 (x) = x − τ Proj||·||∞ 61 (x/τ ) = x − Proj||·||∞ 6τ (x) where Proj||·||∞ 6τ (x) = min(max(x, −τ ), τ ).
In the special case where f = ιC where C is a closed convex cone, then
(ιC )∗ = ιC ◦ C ◦ = {y ; ∀ x ∈ C, hx, yi 6 0}
def.
where (2.24)
and C ◦ is the so-called polar cone. Cones are fundament object in convex optimization because they are
invariant by duality, in the sense of (2.24) (if C is not a cone, its Legendre transform would not be an
indicator). Using (2.24), one obtains the celebrated Moreau polar decomposition
x = ProjC (x) +⊥ ProjC ◦ (x)
where “+⊥ ” denotes an orthogonal sum (the terms in the sum are mutually orthogonal). [ToDo: add
drawing] In the case where C = V is a linear space, this corresponds to the usual decomposition RN =
V ⊕⊥ V ⊥ .
27
Link with Moreau-Yosida regularization. The following proposition shows that the proximal operator
can be interpreted as performing a gradient descent step on the Moreau-Yosida smoothed version fµ of f ,
defined in (??).
This shows that being a minimizer of f is equivalent to being a fixed point of Proxτ f . This suggest the
following fixed point iterations, which are called the proximal point algorithm
def.
x(`+1) = Proxτ` f (x(`) ). (2.27)
On contrast to the gradient descent fixed point scheme, the proximal point method is converging for any
sequence of steps.
This implicit step (2.27) should be compared with a gradient descent step (2.2)
def.
x(`+1) = (Id + τ` ∇f )(x(`) ).
One sees that the implicit resolvent (Id − τ` ∂f )−1 replaces the explicit step Id + τ` ∇f . For small τ` and
smooth f , they are equivalent at first order. But the implicit step is well defined even for non-smooth
function, and the scheme (the proximal point) is always convergent (whereas the explicit step size should be
small enough for the gradient descent to converge). This is inline with the general idea the implicit stepping
(e.g. implicit Euler for integrating ODE, which is very similar to the proximal point method) is more stable.
Of course, the drawback is that explicit step are very easy to implement whereas in general proximal map
are hard to solve (most of the time as hard as solving the initial problem).
2.3.2 Forward-Backward
It is in general impossible to compute Proxγf so that the proximal point algorithm is not implementable.
In oder to derive more practical algorithms, it is important to restrict the class of considered function, by
imposing some structure on the function to be minimized. We consider functions of the form
def.
min E(x) = f (x) + g(x) (2.28)
x
28
where g ∈ Γ0 (H) can be an arbitrary, but f needs to be smooth.
One can modify the fixe point derivation (2.25) to account for this special structure
x? ∈ argmin f + g ⇔ 0 ∈ ∇f (x? ) + ∂g(x? ) ⇔ x? − τ ∇f (x? ) ∈ (Id + τ ∂g)(x? )
⇔ x? = (Id + τ ∂g)−1 ◦ (Id − τ ∇f )(x? ).
This fixed point suggests the following algorithm, with the celebrated Forward-Backward
def.
x(`+1) = Proxτ` g x(`) − τ` ∇f (x(`) ) . (2.29)
Derivation using surrogate functionals. An intuitive way to derive this algorithm, and also a way to
prove its convergence, it using the concept of surrogate functional.
To derive an iterative algorithm, we modify the energy E(x) to obtain a surrogate functional E(x, x(`) )
whose minimization corresponds to a simpler optimization problem, and define the iterations as
def.
x(`+1) = argmin E(x, x(`) ). (2.30)
x
In order to ensure convergence, this function should satisfy the following property
E(x) 6 E(x, x0 ) and E(x, x) = E(x) (2.31)
and E(x) − E(x, x0 ) should be a smooth function. Property (2.31) guarantees that f is decaying by the
iterations
E(x(`+1) ) 6 E(x(`) )
and it simple to check that actually all accumulation points of (x(`) )` are stationary points of f .
In order to derive a valid surrogate E(x, x0 ) for our functional (2.28), since we assume f is L-smooth (i.e.
satisfies (RL )), let us recall the quadratic majorant (2.3)
L
f (x) 6 f (x0 ) + h∇f (x0 ), x0 − xi + ||x − x0 ||2 ,
2
1
so that for 0 < τ < L, the function
1
E(x, x0 ) = f (x0 ) + h∇f (x0 ), x0 − xi + ||x − x0 ||2 + g(x)
def.
(2.32)
2τ
satisfies the surrogate conditions (2.31). The following proposition shows that minimizing the surrogate
functional corresponds to the computation of a so-called proximal operator.
Proposition 13. The update (2.30) for the surrogate (2.32) is exactly (2.29).
Proof. This follows from the fact that
1 1
h∇f (x0 ), x0 − xi + ||x − x0 ||2 = ||x − (x0 − τ ∇f (x0 ))||2 + cst.
2τ 2τ
Convergence of FB. Although we impose τ < 1/L to ensure majorization property, one can actually
show convergence under the same hypothesis as for the gradient descent, i.e. 0 < τ < 2/L, with the same
convergence rates. This means that Theorem 4 for the projected gradient descent extend to FB.
Theorem 7. Theorems ?? and 2 still holds when replacing iterations (2.2) by (2.29).
Note furthermore that the projected gradient descent algorithm (2.18) is recovered as a special case
of (2.29) when setting J = ιC the indicator of the constraint set, since ProxρJ = ProjC in this case.
Of course the difficult point is to be able to compute in closed form Proxτ g in (2.29), and this is usually
possible only for very simple function. We have already seen such an example in Section ?? for the resolution
of `1 -regularized inverse problems (the Lasso).
29
2.3.3 Douglas-Rachford
We consider here the structured minimization problem
but on contrary to the Forward-Backward setting studied in Section 2.3.2, no smoothness is imposed on f .
We here suppose that we can compute easily the proximal map of f and g.
Example 1 (Constrained Lasso). An example of a problem of the form (2.33) where one can apply Douglas-
Rachford is the noiseless constrained Lasso problem (??)
min ||x||1
Ax=y
def.
where one can use f = ιCy where Cy = {x ; Ax = y} and g = || · ||1 . As noted in Section ??, this problem is
equivalent to a linear program. The proximal operator of g is the soft thresholding as stated in (2.20), while
the proximal operator of g is the orthogonal projector on the affine space Cy , which can be computed by
solving a linear system as stated in (??) (this is especially convenient for inpainting problems or deconvolution
problem where this is achieved efficiently).
The Douglas-Rachford iterations read
def.
µ (`) µ def.
x̃(`+1) = 1 − x̃ + rProxτ g (rProxτ f (x̃(`) )) and x(`+1) = Proxτ f (x̃(`+1) ), (2.34)
2 2
where we have used the following shortcuts
One can show that for any value of τ > 0, any 0 < µ < 2, and any x̃0 , x(`) → x? which is a minimizer of
f + g.
Note that it is of course possible to inter-change the roles of f and g, which defines another set of
iterations.
More than two functions. Another sets of iterations can be obtained by “symetrizing” the algorithm.
More generally, if we have K functions (fk )k , we re-write
X X
min fk (x) = min f (X) + g(X) where f (X) = fk (xk ) and g(X) = ι∆ (X)
x X=(x1 ,...,xk )
k k
while the proximal operator of f is easily computed from those of the (fk )k using (2.22). One can thus apply
DR iterations (2.34).
Handling a linear operator. One can handle a minimization of the form (2.36) by introducing extra
variables
f (z) = f1 (x) + f2 (y)
inf f1 (x) + f2 (Ax) = inf f (z) + g(z) where
x z=(x,y) g(z) = ιC (x, y),
where C = {(x, y) ; Ax = y}. This problem can be handled using DR iterations (2.34), since the proximal
operator of f is obtained from those of (f1 , f2 ) using (2.22), while the proximal operator of g is the projector
on C, which can be computed in two alternative ways as the following proposition shows.
30
Proposition 14. One has
(
def.
ỹ = (IdP + AA∗ )−1 (Ax − y)
ProjC (x, y) = (x + A∗ ỹ, y − ỹ) = (x̃, Ax̃) where def. (2.35)
x̃ = (IdN + A∗ A)−1 (A∗ y + x).
Remark 1 (Inversion of linear operator). At many places (typically to compute some sort of projector) one
has to invert matrices of the form AA∗ , A∗ A, IdP + AA∗ or IdN + A∗ A (see for instance (2.35)). There are
some case where this can be done efficiently. Typical examples where this is simple are inpainting inverse
problem where AA∗ is diagonal, and deconvolution or partial Fourier measurement (e.g. fMRI) for which
A∗ A is diagonalized using the FFT. If this inversion is too costly, one needs to use more advanced methods,
based on duality, which allows to avoid trading the inverse A by the application of A∗ . They are however
typically converging more slowly.
but furthermore assume that f is µ-strongly convex, and we assume for simplicity that both (f, g) are
continuous. If f were also smooth (but it needs to be!), one could think about using the Forward-Backward
algorithm (2.29). But the main issue is that in general Proxτ g◦A cannot be computed easily even if one can
compute Proxτ g◦A . An exception to this is when A is a tight frame, as exposed in Proposition 10, but in
practice it is rarely the case.
Example 2 (TV denoising). A typical example, which was the one used by Antonin Chambolle [?] to develop
this class of method, is the total varation denoising
1
min ||y − x||2 + λ||∇x||1,2
x 2
where ∇x ∈ RN ×d is the gradient (a vector field) of a signal (d = 1)or image (d = 2) x, and || · ||1,2 is the
vectorial-`1 norm (also called `1 − `2 norm), such that for a d-dimensional vector field (vi )N
i=1
def.
X
||v||1,2 = ||vi ||.
i
Here
1
f= || · −y||2 and g = λ|| · ||1,2
2
so that f is µ = 1 strongly convex, and one sets A = ∇ the linear operator.
Applying Fenchel-Rockafellar Theorem ?? (since strong duality holds, all involved functions being con-
tinuous), one has that
p? = sup − g ∗ (u) − f ∗ (−A∗ u).
u
31
But more importantly, since f is µ-strongly convex, one has that f ∗ is smooth with a 1/µ-Lipschitz gradient.
One can thus use the Forward-Backward algorithm (2.29) on (minus the energy of) this problem, which
reads
u(`+1) = Proxτk g∗ u(`) + τk A∇f ∗ (A∗ u(`) ) .
To guarantee convergence, the step size τk should be smaller than 2/L where L is the Lipschitz constant of
A ◦ ∇f ∗ ◦ A∗ , which is smaller than ||A||2 /µ.
Last but not least, one some (not necessarily unique) dual minimizer u? is computed, the primal-dual
relationships (??) ensures that one retrieves the unique primal minimizer x? as
i ||vi ||
1 2
f ? (h) = ||h|| + hh, yi and ∇f ? (h) = h + y.
2
Furthermore, µ = 1 and A∗ A = ∆ is the usual finite difference approximation of the Laplacian, so that
||A||2 = ||∆|| = 4d where d is the dimension.
where A = ∇, f (x) = 21 ||y − K · ||2 and g = λ|| · ||1,2 . Note however that with such a splitting, one will
have to compute the proximal operator of f , which, following (2.21), requires inverting either IdP + AA∗ or
IdN + A∗ A, see Remark 1.
A standard primal-dual algorithm, which is detailed in [], reads
def.
z (`+1) = Proxσg∗ (z (`) + σA(x̃(`) )
x(`+1) = Proxτ f (x(`) − τ A∗ (z (`+1) ))
def.
def.
x̃(`) = x(`+1) + θ(x(`+1) − x(`) )
32
Bibliography
[1] Stephane Mallat. A wavelet tour of signal processing: the sparse way. Academic press, 2008.
33