Data-Driven Approaches To Inverse Problems
Data-Driven Approaches To Inverse Problems
CIME 2023
University of Cambridge
Abstract
Inverse problems are concerned with the reconstruction of unknown physical quantities
using indirect measurements and are fundamental across diverse fields such as medical
imaging (MRI, CT), remote sensing (Radar), and material sciences (electron microscopy).
These problems serve as critical tools for visualizing internal structures beyond what is
visible to the naked eye, enabling quantification, diagnosis, prediction, and discovery.
However, most inverse problems are ill-posed, necessitating robust mathematical treat-
ment to yield meaningful solutions. While classical approaches provide mathematically
rigorous and computationally stable solutions, they are constrained by the ability to ac-
curately model solution properties and implement them efficiently.
These notes offer an introduction to this data-driven paradigm for inverse problems, cov-
ering methods such as data-driven variational models, plug-and-play approaches, learned
iterative schemes (also known as learned unrolling), and learned post-processing. The
first part of these notes will provide an introduction to inverse problems, discuss classical
solution strategies, and present some applications. The second part will delve into modern
data-driven approaches, with a particular focus on adversarial regularization and provably
convergent linear plug-and-play denoisers. Throughout the presentation of these method-
ologies, their theoretical properties will be discussed, and numerical examples will be
provided for image denoising, deconvolution, and computed tomography reconstruction.
The lecture series will conclude with a discussion of open problems and future perspectives
in the field.
i
ii Abstract
Contents
4 Perspectives 45
4.1 On Task Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 The Data Driven - Knowledge Informed Paradigm . . . . . . . . . . . . . . 47
Bibliography 49
iii
iv Abstract
List of Figures
1.1 An overview of various fundamental image processing tasks. . . . . . . . . . 1
1.2 Overview of various biological, biomedical and clinical research applications
using image analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Applications in conservation, sustainability, and digital humanities, show-
casing various remote sensing and image analysis techniques. . . . . . . . . 5
1.4 Applications in Physical Sciences, including materials science, computa-
tional fluid dynamics, astrophysics, and geophysics. . . . . . . . . . . . . . . 6
1.5 Illustration of non-uniqueness in CT reconstruction (left) and a description
of ill-posedness in inverse problems (right). Courtesy of Samuli Siltanen. . . 10
1.6 Regularization visualized. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1 Examples of different noise models and corresponding data fidelity terms,
with example images. See works by Werner and Hohage [2012], Hohage
and Werner [2014]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Example of TV denoised image of rectangles. The total variation penal-
izes small irregularities/oscillations while respecting intrinsic image features
such as edges. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Properties of Total Variation Smoothing . . . . . . . . . . . . . . . . . . . . 16
2.4 Comparison of regularization methods. This showcase that convex relax-
ation of l0 of TV with l1 successfully achieves sparsity and is a more natural
prior for denoising. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 MRI reconstruction example: (a) Ground truth Shepp-Logan phantom. (b)
Undersampled k-space (Fourier) data. (c) Reconstruction via zero-filling
the undersampled k-space and inverse Fourier transform. (d) Reconstruc-
tion using a Total Variation (TV) regularized approach. . . . . . . . . . . . 17
2.6 Example of binary Chan–Vese segmentation compared to Mumford-Shah
segmentation. [Mumford and Shah, 1989, Pock et al., 2009, Getreuer, 2012]. 18
3.5 Illustration of what artifacts appearing whenever learned operators are ap-
plied repeatedly without convergence guarantees. Example borrowed from
[Gilton et al., 2021a]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.6 Comparison of CT reconstructions: (a) a good quality reconstruction, (b)
the corresponding sinogram data, and (c) a poor quality reconstruction. . . 36
3.7 Diagram illustrating the concept of spectral filtering. From [Hauptmann
et al., 2024]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.8 Spectral Filtering for Convergent Regularisation . . . . . . . . . . . . . . . 42
4.1 Biomedical imaging pathway: The path from imaging data acquisition
to prediction; diagnosis; treatment planning, features several processing
and analysis steps which usually are performed sequentially. CT data
and segmentation are courtesy of Evis Sala and Ramona Woitek. . . . . . . 45
4.2 Task-adapted reconstruction, with CNN-based MRI reconstruction (task
X) and CNN-based MRI segmentation (task D). Both are trained jointly
with combined loss CℓX + (1 − C)ℓD for varying C ∈ [0, 1]. All figures from
[Adler et al., 2022]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
vi Abstract
Chapter 1
(a) Image Denoising: Given noisy image y = u + n, (b) Image Segmentation: Given image u on domain Ω,
the task is to compute a denoised version u∗ . [Ke and compute the characteristic function χS of a region of
Schönlieb, 2021] interest S ⊂ Ω [Grah et al., 2017]1 .
(c) Image Reconstruction: Compute u from indirect, (d) Image Classification: Given a set of images ui ,
noisy measurements y = A(u) + n, with A known oper- the task is to assign appropriate labels yi to each image
ator [Benning and Burger, 2018, Arridge et al., 2019]. [Aviles-Rivero et al., 2019].
In this chapter, we begin by exploring the concept of well-posedness and its significance in
the context of inverse problems. This will lead us to the notion of ill-posedness of inverse
problems. These are often characterized by high sensitivity to noise, meaning even small
errors or perturbations in the input data can lead to large variations in the solution. We
will then investigate a range of knowledge-driven regularization techniques designed to
mitigate the effects of ill-posedness and stabilize the solution process.
1
2 Introduction to Inverse Problems
• General Image Processing: Image processing more generally involves solving mul-
tiple intertwined inverse problems, with many applications spanning environmental
conservation, remote sensing, an digital humanities, as illustrated in Figure 1.3. For
instance, in conservation and environmental science, LiDAR data can be used for
detailed tree monitoring and forest assessment (Figure 1.3a), or multispectral and
hyperspectral imagery for landcover analysis (Figure 1.3b). Remote sensing data,
coupled with image processing, also aids in understanding dynamical systems, such
as analyzing traffic flow for urban planning and infrastructure management (Fig-
ure 1.3c). Multi-modal image fusion (Figure 1.3d), with data from different sensors,
can be used for improving data representations, applicable in many fields ranging
from remote sensing to medical imaging. In the realm of digital humanities, compu-
tational image processing plays a vital role for virtual art restoration and interpreta-
tion (Figure 1.3e), where imaging can help unveil hidden details, analyze materials,
or digitally restore damaged cultural heritage artifacts. These varied applications
all rely on extracting meaningful information from image data, often necessitating
a range of image processing steps such as image reconstruction, enhancement, seg-
mentation, feature extraction, deblurring, denoising and registration, many of which
3
(c) Mitosis analysis: (d) Spatio-temporal MRI: [Aviles-Rivero et al., 2018, (e) Tumour segmentation:
[Grah et al., 2017] 2021] [Buddenkotte et al., 2023]
EFFUSION ATELECTASIS
GRAPH CLASSIFIER
CONSTRUCTION OUTPUT EMPHYSEMA PNEUMONIA
MASS INFILTRATION
X-RAY DATASET
INITIAL FINAL
GRAPH GRAPH
Fig. 1.2: Overview of various biological, biomedical and clinical research applications using image analysis.
4 Introduction to Inverse Problems
• Physical Sciences: Inverse problems are foundational across the physical sciences,
enabling researchers to probe and understand phenomena from the vastness of cos-
mic structures down to the intricacies of material microstructures, as illustrated in
Figure 1.4. In astrophysics, for instance, inverse problems arise when reconstructing
images of distant celestial objects from data collected by telescopes [Starck, 1998].
This process is complicated by vast distances, interference from various light sources,
faint signals, and atmospheric disturbances. A particularly famous example is the
first imaging of a black hole (Figure 1.4c) [Akiyama et al., 2019].
(a) Tree monitoring w/ LiDAR: [Lee et al., 2015, 2016] (b) Landcover analysis: [Sellars et al., 2019]
(c) Analysing traffic: EPSRC project (d) Multi-modal image fusion: [Bungert et al., 2018]
(e) Virtual art restoration and interpretation: Mathematics for Applications in Cultural Heritage (MACH)
https://mach.maths.cam.ac.uk funded by the Leverhulme Trust; [Calatroni et al., 2018, Parisotto et al., 2019,
2020]
Fig. 1.3: Applications in conservation, sustainability, and digital humanities, showcasing various remote
sensing and image analysis techniques.
6 Introduction to Inverse Problems
(a) Material Sciences: [Tovey et al., 2019] (b) CFD : [Benning et al., 2014]
Fig. 1.4: Applications in Physical Sciences, including materials science, computational fluid dynamics,
astrophysics, and geophysics.
1.1 Well-posed and ill-posed problems 7
y = Au,
While the forward problem is generally assumed to be well-posed, inverse problems are
typically ill-posed, meaning they violate at least one of the conditions for well-posedness.
This focus on ill-posedness arises from physical motivations as opposed to as an abstract
concern; it stems directly from the fact that a vast majority of practical problems in science
and engineering are indeed ill-posed. The following simple examples illustrate common
issues that arise when these conditions are not met.
• (Instability) Finally suppose that n = m, and that there exists an inverse A−1 :
Rm → Rn . Suppose further that the condition number κ = λ1 /λm is very large,
where λ1 and λm are the biggest and smallest eigenvalues of A. Such a matrix is
said to be ill-conditioned. In this case, the problem is sensitive to even the smallest
errors in the measurement. The naive reconstruction ũ = A−1 yn = u + A−1 n would
not produce a meaningful solution, but would instead be dominated by the noise
component A−1 n.
8 Introduction to Inverse Problems
The last example illustrates one of the key questions of inverse problem theory:
Example 1.1.3 (Blurring in the continuum). One common and illustrative inverse prob-
lem encountered in image processing is deblurring. Imagine we have an image that has
been blurred, perhaps by camera motion or an out-of-focus lens. Our goal is to recover the
original, sharp image. This seemingly straightforward task quickly reveals the challenges
inherent in many inverse problems.
Let us consider this in a continuous one-dimensional setting. Suppose our observed blurred
function, y(x) : R → R, results from the convolution of an original sharp function, u(x),
with a blurring kernel, i.e. y(x) = (Gσ ∗ u) (x), where
1 −|x|2 /(2σ2 )
Gσ (x) := e = Gaussian kernel
2πσ 2
with standard deviation σ, dictating the extent of the Gaussian blur. Our objective is
to reconstruct the original sharp function u from the observed blurred function y. This
turns out to be equivalent to inverting the heat equation. To be precise, the blurred
measurement y(x) can be seen as a solution to the heat equation at a specific time t = σ 2 /2,
with u(x) being the initial condition. Therefore, attempting to retrieve u(x) from y(x) is
analogous to solving the heat equation backward in time. Ill-posedness in this example
arises from a lack of continuous dependence of the solution on the data: small errors in
the measurement y can lead to very large errors in the reconstructed u. From Fourier
theory, we can write the following, where F and F −1 denote the Fourier transform and
inverse Fourier transform respectively:
√
y= 2πF −1 (FGσ Fu)
1 Fy
u = √ F −1 .
2π FGσ
Instead of measuring a blurry y, suppose now that we measure a blurry and noisy yδ =
y + nδ , with deblurred solution uδ . Then,
√ F (y − yδ ) Fnδ
2π |u − uδ | = F −1 = F −1
FGσ FGσ
Now, for high-frequencies, F (nδ ) will be large while FGσ will be small (since Gσ is a
compact operator). Hence, the high frequency components in the error are amplified!
In many practical linear inverse problems, the condition of continuity is the first to break
down. This failure of continuous dependence of the solution on the data leads to extreme
amplification of noise. In addition, the uniqueness condition often fails in under-sampled
inverse problems. While under-sampling has physical advantages, such as faster data
acquisition or reduced exposure (e.g., in medical imaging), the trade-off is significant
noise amplification, thereby making the ill-posedness even more severe.
Z Z
f (θ, s) = (Ru)(θ, s) = u(x)dx = u(sθ + y)dy
x′ ·θ=s θ⊥
for θ ∈ S1 and θ⊥ being the vector orthogonal to θ. Effectively the Radon transform
in two dimensions integrates the function u over lines in R2 . Since S 1 is the unit circle
S 1 = θ ∈ R2 | ∥θ∥ = 1 , we can choose for instance θ = (cos(φ), sin(φ))⊤ , for φ ∈ [0, 2π),
and parameterize the Radon transform in terms of φ and s, i.e.
Z
f (φ, s) = (Ru)(φ, s) = u(s cos(φ) − t sin(φ), s sin(φ) + t cos(φ))dt
R
Note that with respect to the origin of the reference coordinate system, φ determines the
angle of the line along one wants to integrate, while s is the offset from that line from the
center of the coordinate system. It can be shown that the Radon transform is linear and
continuous, and even compact. Visually, CT simply turns images into sinograms:
R +n=
Informally, following [Epstein, 2007, Candès, 2021a], we can construct eigenfunctions of the
operator R∗ R, where R∗ is the adjoint of the Radon transform. For a formal presentation
see [Candès, 2021b]. For g(x) = ei⟨k,x⟩ , we have:
1 i⟨k,x⟩
R∗ R [g] (x) = e
∥k∥
Definition 1.2.1. Let A ∈ L(U, V) be a bounded operator. A family {Rα }α>0 of contin-
uous operators is called a regularization (or regularization operator) of A† if
Rα y → A† y = u†
Here, α > 0 acts as a tuning parameter balancing the effect of the data fidelity term
∥Au − yn ∥2 , which ensures consistency with the observed measurements. The regulariza-
tion term R(u) aims to incorporate prior knowledge about the reconstruction, penalizing
x if it is not “realistic”.
The selection of an appropriate regularizer R and the tuning of the parameter α are
critical for designing effective regularization methods. By introducing regularization, we
aim to achieve more than just any solution: we seek a problem that is well-posed and
1.2 Overcoming the ill-posedness 11
U A V
u†
u(y δ , α) y
δ
A† y δ yδ
whose solution is a good approximation of the true solution. This is normally captured
through the properties of existence, uniqueness, stability and convergence. Here, we will
present this for the simple setting of finite dimensions and strongly convex regularizers,
and a proof for this setting can be found in Mukherjee et al. [2024].
u† = argmin R(u)
u∈Rn
s.t. Au=Ay 0
This naturally extends to more complex settings [Shumaylov et al., 2024, Pöschl, 2008,
Scherzer et al., 2009, Mukherjee et al., 2023]. In words, Existence, Uniqueness and Stability
ensure that Definition 1.1.1 is satisfied and the regularized inverse problem is well-posed.
Convergent Regularization on the other hand shows that “solution is close to the orig-
inal”, and thus that formulating the inverse problem using a variational formulation is
reasonable.
Recall the Bayes theorem, which provides us with a way to statistically invert: for
y ∈ Rm , u ∈ Rn
p(y | u)p(u)
p(u | y) = .
p(y)
The likelihood p(y | u) is determined by the forward model and the statistics of the
measurement error, and p(u) encodes our prior knowledge on u. In practice, p(y) is
usually a normalizing factor which may be ignored. The maximum a-posteriori (MAP)
estimate is the maximizer u∗ of the posterior distribution:
parameter and noise level This interpretation provides a useful connection between the
variational formulation and noise statistics. To be precise, the likelihood can be interpreted
as a fidelity term, and the prior as regularization. See e.g. [Pereyra, 2019, Tan et al.,
2024a]. For Gaussian noise, for example, the regularization parameter is given by the
variance of the noise model as shown below.
In this chapter we will explore the relationship between variational models and the theory
of Partial Differential Equations (PDEs). This connection will equip us with a powerful
analytical and computational tools for analyzing and solving inverse problems. We begin
by investigating the impact of various regularization choices, with a particular emphasis
on Total Variation (TV) regularization due to its efficacy in preserving edges - a property
fundamental in image processing tasks. Because regularizers like TV often result in non-
smooth optimization problems, we will then introduce key concepts from classical convex
analysis. This provides the essential numerical tools for minimizing such variational ener-
gies. These tools are fundamental to many established methods and have led to a diverse
“zoo” of regularizers designed for various imaging tasks. Finally, we will finish the sec-
tion by highlighting the inherent limitations of purely model-driven approaches, thereby
motivating the exploration of data-driven and hybrid techniques in subsequent sections.
Here, the data fidelity D(Au, y) enforces alignment between the forward model applied to u
and the observed data y. A common choice for D is a least-squares distance measure, such
as 21 ∥Au−y∥2 , however generally choice of D depends on data statistics (see Section 1.2.3),
and some examples are shown in Figure 2.1. The second term R(u) is a functional that
incorporates a-priori information about the image, acting as a regularizer, and α > 0 is
a weighting parameter that balances the influence of this prior information against fidelity
to the data. Some basic examples of forward operators include:
13
14 Variational Models and PDEs for Inverse Imaging
Example 2.1.1 (1D Tikhonov). Classical Tikhonov regularization often employs simple
quadratic regularizers like R(u) = 1R
2 Ω u2 dx or, more commonly for images, R(u) =
2 Ω |∇u| dx. The latter term penalizes large gradients, encouraging smoothness in the
1R 2
solution. However, this choice implies that the reconstructed image u possesses a certain
degree of regularity. Most crucially for imaging, the reconstruction cannot exhibit
sharp discontinuities like object boundaries or fine edges within an image.
To see this, consider a one-dimensional scenario where u : [0, 1] → R and u ∈ H 1 (0, 1),
i.e. is L2 with L2 derivative. For any 0 < s < t < 1, we have:
s
Z t √ Z t √
u(t) − u(s) = ′
u (r)dr ≤ t−s |u′ (r)|2 dr ≤ t − s∥u∥H 1 (0,1)
s s
This inequality shows that u must be Hölder continuous with exponent 1/2 (i.e., u ∈
C 1/2 (0, 1)), precluding jump discontinuities.
Example 2.1.2 (2D Tikhonov). Extending this to a two-dimensional image u ∈ H 1 ((0, 1)2 ),
2.2 Total Variation (TV) regularization 15
Fig. 2.2: Example of TV denoised image of rectangles. The total variation penalizes small irregulari-
ties/oscillations while respecting intrinsic image features such as edges.
one can show that for almost every y ∈ (0, 1), the function x 7→ u(x, y) (a horizontal slice
of the image) belongs to H 1 (0, 1). This is because:
∂u(x, y) 2
Z 1 Z 1 !
dx dy ≤ ∥u∥2H 1 < ∞
0 0 ∂x
This implies that u cannot have jumps across vertical lines in the image (and similarly for
horizontal lines).
The space BV (Ω) is particularly well-suited for images because, unlike H 1 (Ω), BV
functions can have jump discontinuities (edges). Minimizing the total variation
penalizes small irregularities and oscillations while respecting instrinsic image features
such as edges. See Figure 2.3 for a visualisation of properties of TV. Heuristically, the total
variation of a function quantifies the “amount of jumps” or oscillations it contains; thus,
16 Variational Models and PDEs for Inverse Imaging
Fig. 2.3: Properties of Total Variation (TV) smoothing. (a-b) TV penalizes small irregularities and
oscillations, and tends to preserve edges. (c) The total variation measures the size of the jump discontinu-
ity. Overall, the total variation penalizes small irregularities/oscillations while respecting intrinsic image
features such as edges [Rudin et al., 1992].
noisy images, which typically have many rapid oscillations, have a large TV value. Owing
to these desirable properties, TV regularization has become a widely used technique in
image processing and inverse problems. It promotes solutions that are piecewise smooth
yet retain sharp edges, a property that is crucial in image processing, see e.g. Figures 2.3
and 2.4d. The non-differentiability of the TV term, however, necessitates specialized
optimization algorithms, such as primal-dual methods or general splittings [Lions and
Mercier, 1979, Combettes and Wajs, 2005, Hintermüller and Stadler, 2003].
Example 2.2.2 (Compressed sensing). In compressed sensing [Candes et al., 2006, Poon,
2015], TV regularization plays a vital role. Images often exhibit sparse gradients (large
areas of constant intensity), a key assumption in compressed sensing. For u ∈ W 1,1 (Ω),
the total variation coincides with the L1 norm of its gradient:
The L1 norm is well-known for promoting sparsity. While the L0 norm (counting non-
zero gradient values) would be ideal for enforcing sparse gradients, it leads to compu-
tationally intractable (NP-hard) problems. The L1 norm serves as a convex relaxation,
making optimization feasible while still encouraging solutions with few non-zero gradient
values, characteristic of piecewise constant regions. Remarkably, if the underlying data
is indeed sparse, TV regularization enables near-perfect reconstruction from significantly
undersampled data, for example Figures 2.4 and 2.5.
Example 2.2.3 (MRI). Magnetic Resonance Imaging ( [Lustig, 2008, Fessler, 2008]) is
a medical imaging technique that measures the response of atomic nuclei in a strong
magnetic field. The measured data in MRI is essentially a sampled Fourier transform
of the object being imaged. In many MRI applications, acquiring a full set of Fourier
measurements is time-consuming and can be uncomfortable for the patient. Compressed
sensing offers a way to speed up the process by acquiring only a subset of the Fourier
data. This is known as undersampled Fourier acquisition, see Figure 2.5b.
y = (Fu)|Λ + n,
where F denotes the Fourier transform operator, Λ is the set of (undersampled) Fourier
coefficients and n represents noise in the measurement process. We often seek a piecewise
constant image, which means the image has distinct regions with constant intensities (like
different tissues in the body). The goal, then, is to identify a piecewise constant function
2.2 Total Variation (TV) regularization 17
Fig. 2.4: Comparison of regularization methods. This showcase that convex relaxation of l0 of TV with
l1 successfully achieves sparsity and is a more natural prior for denoising.
u consistent to the datum g. As before, minimizing ∥∇u∥0 under data consistency is NP-
hard, and ℓ1 can be used as convex relaxation, leading to total variation minimization:
1 2
min α∥∇u∥1 + (Fu)|Λ − y .
u 2
(a) Ground truth (b) Undersampled Fourier (c) Zero-filling (d) TV solution
Fig. 2.5: MRI reconstruction example: (a) Ground truth Shepp-Logan phantom. (b) Undersampled
k-space (Fourier) data. (c) Reconstruction via zero-filling the undersampled k-space and inverse Fourier
transform. (d) Reconstruction using a Total Variation (TV) regularized approach.
Example 2.2.4 (Sets of Finite Perimeter and the Co-area Formula). For instance, if
Ω ⊂ R2 is an open set and D is a subset with a C 1,1 boundary, the total variation of its
characteristic function u = χD (1 inside D, 0 outside) is simply the perimeter of D within
Ω: |Du|(Ω) = H1 (∂D ∩ Ω). See Ambrosio et al. [2000]. More generally, the co-area
formula states that for any u ∈ BV (Ω):
Z +∞
|Du|(Ω) = Per({u > s}; Ω)ds
−∞
where Per({u > s}; Ω) = ∥Dχ{u>s} ∥(Ω) is the perimeter of the superlevel set of u at level
s. This formula reveals that the total variation of u is the integral of the perimeters of all
its level sets.
Example 2.2.5 (Chan–Vese Segmentation). The Chan–Vese model [Chan and Vese, 2001]
is a popular variational approach for image segmentation that leverages the TV regular-
ization. It stems from the Mumford–Shah functional [Mumford and Shah, 1989], which
aims to find an optimal piecewise smooth approximation of a given image. The Chan–
Vese model simplifies this by assuming that the image can be segmented into regions with
constant intensities.
Let Ω ⊂ R2 represent the image domain, and let y : Ω → R denote the given image. The
Chan–Vese model seeks to partition Ω into two regions, represented by a binary function
χ : Ω → {0, 1}. The objective functional to be minimized is:
Z Z
2
min α|Dχ|(Ω) + (y − c1 ) χ + (y − c2 )2 (1 − χ),
χ,c1 ,c2 Ω Ω
where χ is the binary segmentation function, c1 and c2 are the average intensities within the
regions where χ = 1 and χ = 0, respectively. Solving the original Chan–Vese formulation
Fig. 2.6: Example of binary Chan–Vese segmentation compared to Mumford-Shah segmentation. [Mum-
ford and Shah, 1989, Pock et al., 2009, Getreuer, 2012].
with the binary constraint is computationally challenging. A common approach [Cai et al.,
2.3 From Total Variation Regularization to Nonlinear PDEs 19
2013, 2019] to address this is to relax the binary constraint. This leads to the following
convex optimization problem (with given c1 and c2 ):
Z Z
min α|Dv|(Ω) + (y − c1 )2 v + (y − c2 )2 (1 − v),
v Ω Ω
with the relaxed segmentation function v ∈ [0, 1]. The final binary segmentation is then
typically obtained by thresholding the resulting v. See Figure 2.6 for a visual example.
Example 2.3.1 (Nonlinear Image Smoothing: The ROF Model). The ROF model seeks
to determine an image u that remains close to a noisy observation y while also possessing
a minimal total variation. The corresponding optimization problem is:
1
min α|Du|(Ω) + ∥u − y∥2 .
u 2
To understand the process by which u evolves to minimize this energy functional, we can
examine its (sub)gradient flow. This flow describes the path of steepest descent for the
functional. It is given by the differential inclusion [Bellettini et al., 2002a]:
In this expression, ut denotes the derivative of u with respect to an artificial time variable
t (representing the evolution of the flow), and p is an element from the subdifferential of
the TV term.
In regions of the image where the gradient is component-wise non-zero, the subdifferential
reduces to a singleton, and the gradient flow to be expressed more explicitly as the following
nonlinear PDE:
Du
ut = α div + (u − y), in Ω.
|Du|
This equation is a nonlinear diffusion equation. Its key characteristic is that the effective
diffusion coefficient is inversely proportional to the magnitude of the image gradient, |Du|.
20 Variational Models and PDEs for Inverse Imaging
This property leads to a highly desirable selective smoothing behavior: in relatively flat
regions of the image where |Du| is small (often dominated by noise), the diffusion is strong,
leading to significant smoothing. Conversely, near sharp edges where |Du| is large, the
diffusion is weak, which helps to preserve these important structural features of the image
while reducing noise elsewhere.
where J and H are proper and convex functions, and one or both may be Lipschitz
differentiable. Note that for this section the notation has changed from the usual, as will
become clear from Example 2.4.1 - both J and H can take on the rule of the data fidelity.
Example 2.4.1 (ROF problem). For a given noisy image y ∈ Rn , the ROF model from
Example 2.3.1 seeks to find an image u by solving:
1
min α∥∇u∥2,1 + ∥u − y∥22 ,
u 2
P q
where the TV term is ∥∇u∥2,1 = |(∇u)ij |2 = (ux )2ij + (uy )2ij .
P
ij ij
Several algorithms have been developed to compute minimizers of such functionals. One
approach involves regularising the TV term to make it differentiable. For instance, one
might consider instead solving the following regularized ROF problem:
1
Xq
min α u2x + u2y + ϵ + ∥u − g∥22
u 2
for a small 0 < ϵ ≪ 1. The regularized TV is differentiable in the classical sense, therefore
we can apply classical numerical algorithms to compute a minimizer, e.g. gradient descent,
conjugate gradient methods etc.
where X ′ denotes the dual space of X. It is obvious from this definition that 0 ∈ ∂J(x)
if and only if x is a minimizer of J.
Example 2.4.5 (Subdifferential of the ℓ1 norm). To illustrate the concept of the sub-
differential for a common non-smooth function in imaging, consider the ℓ1 -norm. Let
X = ℓ1 (Λ) and J(x) := ∥x∥1 , with Λ = [1, . . . , n] or N. The subdifferential is given by:
What is more, when f is a proper, convex, lsc function, applying the Legendre-Fenchel
transform twice returns the original function: ∀x ∈ Rn : f ∗∗ (x) = f (x).
Example 2.4.7 (One homogeneous functions). For example, for a function J that is one-
homogeneous (i.e., J(λu) = λJ(u) for every u and λ > 0), its Legendre–Fenchel transform
is the characteristic function of a closed convex set K:
0 if v ∈ K,
(
∗
J (v) = χK (v) =
+∞ otherwise.
Since J ∗∗ = J (as J is proper, convex, and lsc), we can recover J(u) from its transform:
Let u∗ = proxτ J (y). The optimality condition for this minimization is:
u∗ − y
0 ∈ ∂J (u∗ ) + ,
τ
which can be rewritten as u∗ = (I + τ ∂J)−1 y. Furthermore, Moreau’s identity provides a
relationship between the proximal map of J and its convex conjugate J ∗ :
y
y = proxτ J (y) + τ prox 1 J ∗ .
τ τ
This identity implies that if proxτ J is known, prox 1 J ∗ can also be computed.
τ
where J : Y → (−∞, +∞] and H : X → (−∞, +∞] are convex, lower semi-continuous
(l.s.c.) functions, and A : X → Y is a bounded linear operator. Using the definition of
the convex conjugate, we have J (Au) = supp∈Y (⟨p, Au⟩ − J ∗ (p)). Substituting this into
the primal problem leads to:
where the latter is the dual problem, H∗ is the convex conjugate of H, and A∗ is the
adjoint of A. Under the above assumptions, there exists at least one solution p∗ (see, e.g.,
Ekeland and Temam [1999], Borwein and Luke [2015]). If u∗ solves the primal problem
and p∗ solves the dual problem, then (u∗ , p∗ ) is a saddle-point of the Lagrangian L(u, p)
defined as follows, which provides a link between primal and dual solutions:
such that for all (u, p) ∈ X × Y , we have L(u∗ , p) ≤ L(u∗ , p∗ ) ≤ L(u, p∗ ). Moreover, we
can define the primal-dual gap, a measure of suboptimality, is defined as
Example 2.4.8 (Dual ROF). To show how duality can simplify or offer new perspectives,
we can derive the dual of the ROF problem from Example 2.4.1. Let A = ∇, J (z) =
α∥z∥2,1 , and H(u) = 21 ∥u − y∥22 . The convex conjugate of J is
0 if |pi,j |2 ≤ α
(
∗ ∀i, j,
J (p) = χ{∥·∥2,∞ ≤α} (p) = .
+∞ otherwise
The conjugate of H is H∗ (q) = 12 ∥q + y∥22 − 12 ∥y∥22 . Substituting into the dual formulation,
we get:
1
max −J ∗ (p) − ∥∇∗ p∥22 − ⟨∇∗ p, y⟩
p 2
1 1
= − min J ∗ (p) + ∥∇∗ p − y∥22 + ∥y∥2 .
p 2 2
So the dual ROF problem is equivalent to solving:
1
min ∥∇∗ p − y∥22 : ∥pi,j ∥2 ≤ α for all i, j .
p 2
This dual problem is a constrained least-squares problem, which can be easier to solve
than the primal non-smooth problem. From the optimality conditions of the saddle-point
problem, we also have the relationship u = y − ∇∗ p connecting the primal and dual
solutions.
With these tools, we can now introduce several iterative algorithms designed for non-
smooth convex optimization.
The iterate uk+1 is the unique minimizer of the proximal subproblem J(v) + 2τ
1
∥v − uk ∥22 .
This algorithm can be interpreted as an explicit gradient descent step on the Moreau–
Yosida regularization of J. The Moreau–Yosida regularization of J (or envelope) with
parameter τ > 0 is: !
∥v − ū∥22
Jτ (ū) := min J(v) + .
v 2τ
It can be shown that Jτ is continuously differentiable (even if J is not) with gradient:
ū − proxτ J (ū)
∇Jτ (ū) = .
τ
Thus, the proximal descent update uk+1 = proxτ J (uk ) can be rewritten as uk+1 = uk −
τ ∇Jτ (uk ), which is an explicit gradient descent step on the smoothed function Jτ .
24 Variational Models and PDEs for Inverse Imaging
where J is “simple” (its prox is easily computable) and H is differentiable with a Lipschitz
continuous gradient (Lipschitz constant LH ). The Forward-Backward splitting algorithm,
also known as the proximal gradient algorithm, combines an explicit gradient descent step
on H (forward step) and an implicit proximal step on J (backward step):
A point u is a minimizer of the composite objective if and only if it is a fixed point of this
iteration, which corresponds to the optimality condition 0 ∈ ∇H(u) + ∂J (u). If the step
size τ satisfies 0 < τ ≤ 1/LH , the iterates uk converge to a minimizer.
This algorithm is closely related to other optimization techniques like the augmented
Lagrangian method and the alternating direction method of multipliers (ADMM) [Arrow
et al., 1958, Pock et al., 2009, Esser et al., 2010].
Analytical Properties:
• Function Space: The natural function space for TV regularization is the space of
functions of bounded variation, BV (Ω). This space is non-reflexive, necessitating
the use of specialized compactness properties for analysis [Ambrosio et al., 2000,
Ambrosio, 1990, De Giorgi and Ambrosio, 1988, De Giorgi, 1992].
• Stability: Novel metrics like Bregman divergences can be employed for deriving
stability estimates in TV-regularized problems [Burger et al., 2007, Hofmann et al.,
2007, Schönlieb et al., 2009].
• Non-differentiability: The non-differentiability of the TV term requires tools from
convex analysis, such as subgradients, and leads to the study of TV flow via dif-
ferential inclusions or viscosity solutions [Chen et al., 1999, Ambrosio and Soner,
1996, Caselles et al., 2007, Alter et al., 2005, Caselles and Chambolle, 2006, Bellet-
tini et al., 2002a,b, Paolini, 2003, Novaga and Paolini, 2005, Bellettini et al., 2006].
Analysis often draws upon geometric measure theory [Federer, 1996, 2014, Allard
and Almgren, 1986, Allard, 2008].
Numerical Properties:
This “regularizer zoo” highlights the diversity of approaches developed to address the
specific challenges posed by different image reconstruction problems. The selection of an
appropriate regularizer requires careful consideration of the image properties, the degra-
dation process, and the desired characteristics of the reconstruction.
This limitation highlights the need for a paradigm shift towards data-driven recon-
struction methods, which leverage the power of overparameterized models like support
vector machines and neural networks [Goodfellow et al., 2016]. These models, trained on
vast amounts of data, can learn intricate patterns and relationships that may be diffi-
cult to capture through explicit mathematical formulations. The next section will delve
into existing paradigms and challenges of data-driven reconstruction, exploring how it
can complement and even surpass knowledge-driven methods in the quest for accurate
and robust image recovery. Furthermore, we will investigate the emerging field of hybrid
approaches that combine the strengths of both data-driven and knowledge-driven tech-
niques, potentially leading to a new generation of image reconstruction algorithms that
push the boundaries of performance and applicability.
Chapter 3
(a) Ground-truth (b) FBP: 21.61 dB, 0.17 (c) TV: 25.74 dB, 0.80 (d) LPD: 29.51 dB, 0.85
Fig. 3.1: Limited angle CT reconstruction: Heavily ill-posed problem. Deep Learning cannot do magic
and also hits boundaries of what is mathematically possible. A fully learned method LPD (Section 3.1) in
d begins hallucinating, as highlighted in red boxes, despite resulting in better performance metrics (here
PSNR and SSIM).
The deep learning revolution of the 2010s fundamentally transformed our approach
to complex imaging tasks by challenging traditional modeling paradigms. As computa-
tional power and data availability dramatically increased, neural networks demonstrated
their ability to learn intricate representations directly from massive datasets, often out-
performing carefully crafted mathematical models. Consequently, given the abundance
of image data available today, a natural question emerges: why meticulously handcraft
27
28 Data-Driven Approaches to Inverse Problems
Fig. 3.2: Sparse view CT reconstruction: top row is based on mathematical/handcrafted models; bottom
row is using novel deep learning based models. For this problem, deep learning methods result in both
improved metrics (here PSNR and SSIM) and visually better reconstructions. Photo courtesy of Subhadip
Mukherjee [Mukherjee et al., 2023].
models when we can potentially derive effective priors simply by providing sufficient data
to overparameterized models?
Deep learning exemplifies this paradigm, using extensive computational resources to train
highly flexible, over-parameterized neural networks that can adapt to diverse imaging
tasks and datasets, especially in high dimensions, while remaining computationally effi-
cient. An example in the context of inverse problems is shown in Figure 3.2 – learned
methods consistently and significantly outperform knowledge driven methods like TV reg-
ularization. Despite their power, these models often sacrifice interpretability and demand
substantial training data to achieve good performance. An example of this in the context
of inverse problems is shown in Figure 3.1 - for a significantly ill-posed problem, fully
learned methods begin hallucinating, despite resulting in better performance metrics.
We note however, that derivation of models directly from data is by no means exclusive
to deep neural networks; in fact, such approaches predate and extend beyond them, con-
stituting a rich methodological landscape within machine learning and signal processing.
To illustrate, classical learning techniques have long been employed to explore data-driven
regularization models, for example:
• Sparse Coding and Dictionary Learning: These methods aim to find a sparse
29
Some examples include Elad and Aharon [2006], Aharon et al. [2006], Mairal et al.
[2009], Rubinstein et al. [2010], Moreau and Bruna [2016], Chandrasekaran et al.
[2011], DeVore [2007], Fadili et al. [2009], Mallat and Zhang [1993], Elad and Aharon
[2006], Rubinstein et al. [2009], Papyan et al. [2017], Peyré [2009].
Some examples include Calatroni et al. [2017a], Kunisch and Pock [2013], De los
Reyes et al. [2016], Haber et al. [2009], Langer [2017], Horesh et al. [2010].
The key distinctions between the paradigms emerge not just in methodology, but in their
philosophical approach: knowledge-driven models seek to understand through explicit
modeling assumptions, while data-driven models pursue understanding through statistical
learning and pattern recognition.
The rest of this section will discuss deep learning more generally, will present a number
of recent approaches within the data-driven paradigm, progressively advancing towards
methodological frameworks that intersect the two paradigms, exemplified through meth-
ods using deep neural networks as regularizers.
The usefulness of the resulting model is influenced by all of these model design parameters,
as well as the quality of the training set, and optimization approach used. This in turn,
makes it difficult to understand their internal workings and interpret their outputs. This
makes deep learning models effectively a “black box”. More precisely, the resulting issues
are:
Despite these, deep learning offers interesting opportunities for inverse problems thanks
to its ability to produce highly accurate and computationally fast solutions. However, to
fully leverage the potential of deep learning, it is essential to integrate these techniques
with established mathematical modeling principles. This synergy is crucial to ensure pre-
dictable and reliable (in a certain sense) behavior in the resulting solutions, as oftentimes
interpretability goes hand in hand with mathematical guarantees, but often comes at an
expense in either reduced performance or computational intractability. The current main
goal of the field thus is in trying to find the sweet spot between computational power and
mathematical guarantees.
z0 = x ∈ X
z k+1 = f k z k , θk , k = 0, . . . , K − 1
where z k ∈ X k represents the feature vector at the k-th layer, with X k being the corre-
sponding feature space, and f k : X k × Θk → X k+1 is the non-linear transformation at
the k-th layer, parameterized by θk . A common choice for f k is an affine transformation
3.1 Learned Iterative Reconstruction Schemes 31
∇D(A(·), y) ∇D(A(·), y)
where W k is a weight matrix (for imaging tasks often represented by a convolution oper-
ator), bk is a bias vector, σ is an element-wise non-linear activation function (e.g., ReLU,
tanh).
The training process aims to optimize the network parameters θ by minimizing a loss
function Ln over a given dataset {(xn , cn )}n , often with an added regularization term
R(θ) (this time to regularize the training):
N
1 X
min Ln (Ψ (xn , θ) , cn ) + R(θ)
θ∈Θ N n=1
This generic framework can be adapted and applied to various mathematical imaging
tasks, such as image classification, segmentation, and reconstruction, by appropriately
defining the network architecture, loss function, and training data.
Each ΛΘk can be viewed as a residual layer in a neural network ΨΘ (y) with N layers, which
reconstructs u from y. These schemes are typically derived from an iterative method
designed to solve a variational regularization problem and several variations of learned
iterative schemes exist, each with a different formulation for the layers ΛΘk . Writing the
steps as
uk+1 = Λθk uk , A∗ Auk − y for k = 0, . . . , N − 1,
for some neural networks Λθk : X × X → X, the main examples (described in more detail
below) are
for some neural network Γθ : X → X with an architecture that does not involve data or the
forward operator (or its adjoint), which only enter into the evaluation of h = D(A(·), y).
An illustration is shown in Figure 3.3. Parameters are then learned from supervised data.
Variational Networks. First proposed by Hammernik et al. [2018], Kobler et al. [2017],
they represent one of the earliest learned iterative schemes and are notable for their ex-
plicit connection to variational regularization. These networks are inspired by variational
regularization models where the regularization functional incorporates parameterizations
that extend beyond traditional handcrafted regularisers like TV. More specifically, their
development draws from the Fields of Experts (FoE) model and, by extension, conditional
shrinkage fields [Schmidt and Roth, 2014], which allow for adaptive parameters across it-
erations. Variational Networks are defined by unrolling an iterative scheme designed to
minimize a functional comprising both a data discrepancy component and a regularizer.
Such networks can be interpreted as performing block incremental gradient descent on a
learned variational energy or as learned non-linear diffusion.
Learned Gradient, Proximal, Primal Dual. These are all further extensions of the
idea introducing increasingly more freedom to parameters [Adler and Öktem, 2017], with
Adler and Öktem [2018] generalizing the steps even further to include steps in both primal
(image) and dual (measurement) spaces, inspired by the primal-dual hybrid gradient from
Section 2.4.4. This can be summarized with a more general parameterization of the
gradient steps, potentially including some R regularization information, e.g. using TV
[Kiss et al., 2025]. An illustration is shown in Figure 3.4.
Empirical evidence suggests such approaches result in models that are easier to train
and demonstrate excellent reconstruction quality for mildly ill-posed inverse problems and
offer considerable versatility, for instance in task-adapted reconstruction [Adler et al.,
2022, Lunz et al., 2018]. While intuitively appealing, this approach can lose connection
to the original variational problem, leading to a lack of theoretical guarantees. Despite
this, this generic recipe has been used to propose numerous methods, leveraging various
iterative algorithms such as gradient descent, proximal descent, and primal-dual methods.
3.1 Learned Iterative Reconstruction Schemes 33
A A
A∗ A∗ A∗
While this may seem gloomy, many current research efforts are focused on addressing
these limitations, including
Fig. 3.5: Illustration of what artifacts appearing whenever learned operators are applied repeatedly
without convergence guarantees. Example borrowed from [Gilton et al., 2021a].
u = ΓΘ (u; y).
This formulation naturally leads to iterative schemes that provably converge (under certain
assumptions) to a fixed point as the number of iterations approaches infinity. For instance,
consider a deep equilibrium gradient descent scheme where:
Here, A is the forward operator, A∗ is its adjoint, η is a step size, and RΘ is a trainable
neural network representing a gradient of a learned regularizer. Note that these models
are more general than gradient descent, as convergence can be ensured even when RΘ
is not a gradient of a function. To ensure convergence, ΓΘ can be constrained to be
a contraction mapping. This, once again, is not simply an academic exercise and has
a significant effect in practice, guaranteeing convergence to a fixed point as showcased
in Figure 3.5 on top row, compared to iterate divergence for a non-constrained model,
showcased on the bottom row. The following theorem provides sufficient conditions for
convergence in the context of deep equilibrium gradient descent:
Remark 3.1.2. An interesting additional avenue with convergent learned iterative schemes
is the ability to accelerate their convergence by increasing the memory of the iterations,
i.e. by introducing dependency of each iteration from just the previous iterate only to
a couple of previous iterates, for instance via Anderson acceleration as in Gilton et al.
[2021b]. While convergence to a fixed point is a desirable property, further investigation is
needed to characterize the properties of this fixed point and its relationship to the solution
of the underlying inverse problem, e.g. in analogy with Obmann and Haltmeier [2023].
where y ∈ Y is the measured data, A is a linear and bounded forward operator, X and Y
are Banach spaces, α > 0 is a regularization parameter, and R(u) is the regularizer.
Instead of learning the entire reconstruction mapping from y to u as in Section 3.1, this
approach concentrates on learning a data-adaptive regularizer R(u). The goal is for
R(u) to effectively capture prior knowledge about the desired solution, promoting re-
constructions with desirable characteristics (e.g., “good-looking” images) while penalizing
undesirable features. Learning the regulariser offers several advantages:
In essence, this approach seeks to learn an “image prior” that is both data-driven and
amenable to mathematical analysis.
36 Data-Driven Approaches to Inverse Problems
How to learn? Given some parametric model Rθ for the regularizer, its parameters θ
still need to be learned from data. Over the past decades, a variety of paradigms have
been introduced for learning the regularizer given an image distribution. Analogous to
Section 3.1, the direct approach is to train parameters such that the optimal solution in
Equation (3.4) minimizes the ℓ2 loss over training data. This results in the so called bilevel
learning discussed in Section 3.0.1. While parameter hypergradients can be computed via
implicit differentiation or unrolling, they can quickly become computationally infeasible,
necessitating approximations. Alternatively, interpreting the problem as maximum-a-
posteriori estimation (Section 1.2.3), the prior can be learned directly from data. See
Habring and Holler [2024] or Dimakis et al. [2022] for an overview.
In what follows, we consider one specific approach for learning Rθ , which relies neither
on the bilevel structure, nor on learning the whole prior. Instead, we view the regulariser
as a “classifier” that distinguishes between desirable and undesirable solutions. This de-
couples the problem of learning the regulariser from the underlying variational problem.
Fig. 3.6: Comparison of CT reconstructions: (a) a good quality reconstruction, (b) the corresponding
sinogram data, and (c) a poor quality reconstruction.
Inspired by the Wasserstein GAN framework [Arjovsky et al., 2017], the 1-Wasserstein
distance between the clean and noisy distributions is employed as a weakly supervised
loss for the regulariser. The 1-Wasserstein distance between Pn and PU is given by:
Wass1 (Pn , PU ) = sup EU ∼Pn [R(U )] − EU ∼PU [R(U )] ,
R∈1-Lip
where the supremum is taken over all 1-Lipschitz functions R. By finding an appropriate
R, this formulation allows to train the regulariser to effectively capture image statistics
without requiring paired examples of “good” and “bad” images. In practice, the regulariser
is parameterized as:
RΘ (u) = ΨΘ (u) + ρ0 |u|22 ,
where ΨΘ (u) is a (potentially convex [Mukherjee et al., 2024] or weakly convex [Shumaylov
et al., 2024]) convolutional neural network (CNN) and an additional l2 regularization term
3.2 Learned Variational Models 37
that enhances analysis and ensures coercivity. Ensuring exact 1-Lipschitzness turns out
to be rather complicated, and the network is trained by minimizing the following loss
function [Lunz et al., 2018]:
h i
min EU ∼PU [ΨΘ (U )] − EU ∼Pn [ΨΘ (U )] + µ · E (∥∇u ΨΘ (U )∥ − 1)2+
Θ
The first two terms encourage the regularizer to distinguish between “good” and “bad”
images, while the third term softly enforces a Lipschitz constraint on the regularizer.
A theoretical justification for using such a loss can be seen by analyzing the gradient
descent flow of the regularizer with respect to the Wasserstein distance. Under certain
assumptions, it can be shown that this flow leads to the fastest decrease in Wasserstein
distance for any regularization functional with normalized gradients.
This is the fastest decrease in Wasserstein distance for any regularization functional with
normalized gradients!
Assumption 3.2.3 (Low Noise Assumption (LNA)). The pushforward of the noisy dis-
tribution under the projection equals the clean distribution, (PM )# (Pn ) = Pr . This
corresponds to an assumption that the noise level is low in comparison to manifold cur-
vature.
38 Data-Driven Approaches to Inverse Problems
Theorem 3.2.4. Assume DMA and LNA. Then, the distance function to the data man-
ifold
u 7→ min ∥u − v∥2
v∈M
is a maximizer to the Wasserstein Loss
Remark 3.2.5. The functional in Equation (3.5) does not necessarily have a unique maxi-
mizer. However, in certain settings it can be shown to be unique almost everywhere, see
Staudt et al. [2022], Milne et al. [2022].
It is worth mentioning that while theoretically appealing, recent work [Stanczuk et al.,
2021] has shown that the practical success of Wasserstein GANs may not be solely at-
tributed to their ability to approximate the Wasserstein distance.
3.2.2.1 Extensions
The development of learned regularizers, particularly adversarial ones, is an active research
area with several important possible extensions:
operator splitting techniques for optimizing the variational objective in Equation (2.1),
decoupling the regularization step from the data fidelity term, and replacing regularization
steps with sophisticated denoisers. PnP methods are rooted in operator splitting tech-
niques, such as the Alternating Direction Method of Multipliers (ADMM) [Setzer, 2011].
Consider a reformulation of Equation (2.1), introducing an auxiliary variable v:
min{D(Au, y) + αR(v)} s.t. u = v.
u,v
The crucial insight for PnP methods is that this decouples the measurement fidelity step
from the reconstruction regularization, done by denoising v. The (regularizing) v-update
step in Equation (3.7) can be recognized as the proximal operator of the regularizer R
scaled by α/λ. This allows for the replacement of the proximal operator with any effective
denoising algorithm D, such as BM3D [Dabov et al., 2009], non-local means [Buades et al.,
2005], or deep learning-based denoisers. This flexibility gives rise to the name “Plug-and-
Play”.
A significant challenge in this case, particularly when utilizing learned denoisers, is the
adjustment of regularization strength. These denoisers are often trained for a specific noise
level σ, yet the effective noise within PnP iterations can vary, and the overall regularization
must be adapted to the noise present in the measurements y δ . Empirically, this has been
approached by tuning regularization strength by denoiser scaling [Xu et al., 2020].
be symmetric and positive semi-definite [Moreau, 1965, Gribonval and Nikolova, 2020].
For the resulting functional to be convex as well, Dσ must be non-expansive, i.e. we will
assume that its eigenvalues live in the interval [0, 1] for the sake of contractivity. Lastly,
we will assume that the operator norm is bounded from below, such that the inverse
is well defined and is a bounded operator. If these conditions hold, the functional J is
uniquely determined by Dσ (up to an additive constant). The objective then becomes
controlling the regularization strength by effectively scaling this underlying functional
J. The difficulty is that one typically only has access to the denoiser Dσ , not J itself.
However, when the denoiser is linear, it turns out to be possible to appropriately modify
the denoiser based on the following observations. By definition of a proximal operator
Dσ = proxJ = (id + ∂J)−1 .
On the other hand, since Dσ is linear, Dσ−1 is linear, and by above ∂J =: W is also
linear. As a result, J(x) = 21 ⟨x, W x⟩ up to an additive constant. Inverting, we have
W = Dσ−1 − Id. Therefore,
1 D −1 E
x, Dσ − Id x .
J(x) =
2
We can control the regularization strength, by scaling the regularization functional J:
1 D h −1 i E
τ J(x) = x, τ Dσ − (τ − 1) Id − Id x ,
2
resulting in
−1
proxτ J = τ Dσ−1 − (τ − 1) Id = gτ (Dσ ) .
Here gτ : R → R, given by gτ (λ) = λ/(τ − λ(τ − 1)) is applied to Dσ using the functional
calculus. This implies that applying this filter (as illustrated conceptually in Figure 3.7)
effectively transforms the original denoiser Dσ = proxJ into gτ (Dσ ) = proxτ J . Further-
more Figure 3.8a illustrates the effect on eigenvalues of the resulting linear denoiser as a
function of original eigenvalues.
J prox Dσ
τJ prox gτ (Dσ )
Fig. 3.7: Diagram illustrating the concept of spectral filtering. From [Hauptmann et al., 2024].
This approach differs from traditional spectral regularization methods (e.g., Tikhonov
regularization, Landweber iteration) [Engl et al., 1996], where filtering is typically applied
to the singular values of the forward operator A. In contrast, here the denoiser (and thus
the implicit prior) is modified.
It turns out to be possible to show convergent regularization in general for other spectral
filters satisfying technical conditions that (1 − gτ (λ)) / (τ gτ (λ)) is bounded above and
below by positive values and converges as τ → 0. Under such conditions one achieves
convergent PnP regularization by Theorem 5 in Hauptmann et al. [2024], and an example
of this is illustrated in Figures 3.8b and 3.8c.
42 Data-Driven Approaches to Inverse Problems
The effect of spectral filtering on denoiser eigenvalues Convergent regularisation by spectral filtering
1.0
10−2
Filtered eigenvalue hτ (λ)
0.8
kx̂(y δ , τ δ ) − x† k
0.6 10−3
0.4 τ = 0.1
τ = 0.5 10−4
0.2 τ = 1.0
τ = 2.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 10−5 10−4 10−3 10−2 10−1
Input eigenvalue λ δ
Increasing δ
† δ δ
x x̂(y , τ ) x̂(y , τ δ )
δ
x̂(y δ , τ δ )
(c) Illustration of convergent regularization via a selection of snapshots from the plot in b.
Fig. 3.8: Further concepts in spectral filtering and application to CT reconstruction. (a) Eigenvalue
spectral filtering. (b) Spectral filtering to control regularisation strength for convergent regularisation. (c)
Resulting images from the CT reconstruction. The linear denoiser filter is gτ (λ) = λ/(τ − λ(τ − 1)). All
illustrations from [Hauptmann et al., 2024].
3.4 Outlooks
The preceding sections have largely focused on methodologies that adapt existing math-
ematical frameworks to incorporate deep learning techniques. Examples include learned
iterative schemes, where neural networks replace components of classical algorithms, and
Plug-and-Play (PnP) methods, where denoisers replace splitting steps. While these ap-
proaches have demonstrated considerable empirical success, they often represent incremen-
tal adaptations rather than fundamental redesigns. Consequently, they can sometimes lack
a deep theoretical grounding or may appear as ad-hoc integrations rather than solutions
derived from first principles tailored to the unique characteristics of deep learning.
A central challenge and a key direction for future research is to move beyond merely
“plugging in” deep learning components into pre-existing structures. To fully harness the
potential of high-capacity, overparameterized models, the development of new frameworks
that are fundamentally designed with deep learning in mind is essential. Such a paradigm
shift would likely involve concerted efforts in several interconnected areas:
3.4 Outlooks 43
A persistent theme in this endeavor is navigating the trade-off between capacity and
guarantees. Constraints are necessary for interpretability and reliability, but excessive
constraints limit the expressive power of deep learning. Finding the sweet spot is key.
This might involve:
• Developing more flexible constraints: Can we design constraints that are less
restrictive but still ensure desirable properties like stability and convergence?
• Learning constraints from data: Can we use data to learn optimal constraints
that balance capacity and guarantees?
Ultimately, the future of data-driven inverse problems lies in integrating deep learning
more deeply with the underlying mathematical principles. This will require both theoret-
ical and practical innovations, but the potential rewards are immense.
44 Data-Driven Approaches to Inverse Problems
Chapter 4
Perspectives
Fig. 4.1: Biomedical imaging pathway: The path from imaging data acquisition to prediction; di-
agnosis; treatment planning, features several processing and analysis steps which usually are performed
sequentially. CT data and segmentation are courtesy of Evis Sala and Ramona Woitek.
A key observation driving current research is that the metrics used to evaluate the quality
of a reconstruction should be intrinsically linked to the ultimate objective of the entire
workflow. For instance, in clinical medical imaging, the primary goal is rarely the re-
construction of a visually appealing image, but rather to enable accurate diagnosis, guide
treatment planning, or monitor therapeutic response. This motivates the concept of task-
adapted reconstruction [Adler et al., 2022, Wu et al., 2018], wherein the reconstruction
process is explicitly tailored to optimize performance on a specific downstream task, such
as segmentation, classification, or quantitative parameter estimation.
45
46 Perspectives
reconstruction.
Training neural networks often involves highly non-convex optimization, and joint train-
ing for multiple tasks does not fundamentally alter this characteristic. In fact, combining
tasks within a unified learning pipeline can create opportunities for synergy, where the
optimization process for one task can provide beneficial regularization or feature represen-
tations for another. The inherent non-convexity of end-to-end learned systems means that
extending them to incorporate downstream tasks does not necessarily introduce greater
optimization challenges than those already present in learning the reconstruction alone.
Consider, for example, the task of detecting ovarian cancer from medical images (ra-
diomics). This process typically involves reconstructing an image from sensor measure-
ments, followed by segmentation of potential tumorous regions, and then extraction of
quantitative imaging features for statistical analysis and classification. By jointly opti-
mizing the reconstruction and segmentation processes within a single deep learning model,
it is possible to guide the reconstruction to produce images that are not only faithful to
the measured data but are also more amenable to accurate segmentation by the learned
segmentation module.
Example 4.1.1 (Joint Reconstruction Segmentation Adler et al. [2022]). For example, in
tomographic reconstruction, we can jointly optimize the reconstruction and segmentation
processes using a combined loss function. Consider a CNN-based MRI reconstruction (X)
and CNN-based MRI segmentation (D).
(a) Minimal loss values for various C values, showcasing that jointly training for reconstruction and segmentation
is better.
Fig. 4.2: Task-adapted reconstruction, with CNN-based MRI reconstruction (task X) and CNN-based
MRI segmentation (task D). Both are trained jointly with combined loss CℓX + (1 − C)ℓD for varying
C ∈ [0, 1]. All figures from [Adler et al., 2022].
4.2 The Data Driven - Knowledge Informed Paradigm 47
m
1 X
( )
(θ∗ , ϑ∗ ) arg min ℓjoint (xi , τ (zi )) , A†θ (yi ) , Tϑ ◦ A†θ (yi )
(θ,ϑ)∈Θ×Ξ m i=1
ℓjoint (x, d), x′ , d′ := (1 − C)ℓX x, x′ + CℓD d, d′ for fixed C ∈ [0, 1].
This loss function balances the reconstruction error (ℓX ) and the segmentation error (ℓD ),
allowing for a trade-off between the two tasks. Figure 4.2a illustrates that whenever
segmentation performance is the ultimate task to be solved, training primarily for recon-
struction results in poor segmentations, while training primarily for segmentation results
in poor reconstructions. What is of most significance is that training for both jointly
with equal weighting actually results in better segmentations than if one were to train for
segmentation only!
This co-adaptation can lead to improved accuracy and robustness in both the reconstruc-
tion and the segmentation, ultimately enhancing the reliability of the radiomic analysis
and the diagnostic outcome. The degrees of freedom inherent in solving an ill-posed in-
verse problem can be strategically utilized to favor solutions that, while consistent with
the data, also possess features conducive to the success of the subsequent task.
This task-adapted approach not only enhances the efficiency and accuracy of medical im-
age analysis but also has broader implications for healthcare accessibility. By automating
or streamlining certain tasks, such as segmentation, it reduces the reliance on special-
ized expertise, potentially making advanced imaging techniques more widely available in
settings with limited resources.
Towards more useful theoretical tools: The paradigm shift driven by deep learning
also necessitates a corresponding evolution in our theoretical approaches. Much of the
traditional analysis in inverse problems has focused on model properties, optimization
landscapes, and convergence proofs, often treating the model in relative isolation from
the data that was used to create it. However, deep learning models are fundamentally
data-centric; their behavior, efficacy, and potential failure modes are inextricably linked
to the characteristics of the training dataset. Therefore, future analytical efforts must
pivot to more explicitly account for this data dependency. It is no longer sufficient to
analyze convex regularizers in abstraction. Rigorous analysis must now encompass the
training dataset itself: its size, diversity, representativeness, potential inherent biases,
and the precise manner in which these factors influence the learned model’s generalization
capabilities, and its robustness to distributional shifts.
As deep learning systems become more complex and their decision-making processes more
opaque, explainability emerges as a critical concern. If a model produces a reconstruc-
tion specifically optimized for a downstream task, understanding why the reconstruction
appears as it does, and how specific features (or apparent artifacts) contribute to the
downstream decision, is crucial for validation, debugging, and building trust, especially
in safety-critical applications like medicine. Future research must focus on developing
methods that can provide insights into these complex, end-to-end trained systems.
J. Adler and O. Öktem. Solving ill-posed inverse problems using iterative deep neural
networks. Inverse Problems, 33(12):124007, 2017.
J. Adler, S. Lunz, O. Verdier, C.-B. Schönlieb, and O. Öktem. Task adapted reconstruction
for inverse problems. Inverse Problems, 38(7):075006, 2022.
W. K. Allard and F. J. Almgren. Geometric measure theory and the calculus of variations,
volume 44. American Mathematical Soc., 1986.
L. Alvarez, P.-L. Lions, and J.-M. Morel. Image selective smoothing and edge detection
by nonlinear diffusion. ii. SIAM Journal on numerical analysis, 29(3):845–866, 1992.
L. Ambrosio. Metric space valued functions of bounded variation. Annali della Scuola
Normale Superiore di Pisa-Classe di Scienze, 17(3):439–478, 1990.
L. Ambrosio and H. M. Soner. Level set approach to mean curvature flow in arbitrary
codimension. Journal of differential geometry, 43(4):693–737, 1996.
L. Ambrosio, N. Fusco, and D. Pallara. Functions of bounded variation and free disconti-
nuity problems. Oxford university press, 2000.
49
50 Perspectives
S. Arridge, P. Maass, O. Öktem, and C.-B. Schönlieb. Solving inverse problems using
data-driven models. Acta Numerica, 28:1–174, 2019.
M. Bachmayr and M. Burger. Iterative total variation schemes for nonlinear inverse
problems. Inverse Problems, 25(10):105004, 2009.
G. Bellettini, V. Caselles, and M. Novaga. The total variation flow in rn. Journal of
Differential Equations, 184(2):475–525, 2002a.
G. Bellettini, V. Caselles, and M. Novaga. The total variation flow in rn. Journal of
Differential Equations, 184(2):475–525, 2002b.
G. Bellettini, M. Novaga, and E. Paolini. Global solutions to the gradient flow equation
of a nonconvex functional. SIAM journal on mathematical analysis, 37(5):1657–1687,
2006.
M. Benning and M. Burger. Modern regularization methods for inverse problems. Acta
numerica, 27:1–111, 2018.
A. L. Bertozzi, S. Esedoglu, and A. Gillette. Inpainting of binary images using the cahn–
hilliard equation. IEEE Transactions on image processing, 16(1):285–291, 2006.
A. Buades, B. Coll, and J.-M. Morel. A non-local algorithm for image denoising. In
2005 IEEE computer society conference on computer vision and pattern recognition
(CVPR’05), volume 2, pages 60–65. Ieee, 2005.
A. Buades, B. Coll, and J.-M. Morel. Non-local means denoising. Image Processing On
Line, 1:208–212, 2011.
M. Burger, G. Gilboa, S. Osher, and J. Xu. Nonlinear inverse scale space methods. 2006.
M. Burger, E. Resmerita, and L. He. Error estimation for bregman iterations and inverse
scale space methods in image restoration. Computing, 81:109–135, 2007.
52 Perspectives
X. Cai, R. Chan, and T. Zeng. A two-stage image segmentation method using a con-
vex variant of the mumford–shah model and thresholding. SIAM Journal on Imaging
Sciences, 6(1):368–390, 2013.
X. Cai, R. Chan, C.-B. Schonlieb, G. Steidl, and T. Zeng. Linkage between piecewise
constant mumford–shah model and rudin–osher–fatemi model and its virtue in image
segmentation. SIAM Journal on Scientific Computing, 41(6):B1310–B1340, 2019.
E. J. Candès. Lecture 10. Course Notes for MATH 262/CME 372: Applied Fourier
Analysis and Elements of Modern Signal Processing, 2021a. URL https://candes.
su.domains/teaching/math262/Lectures/Lecture10.pdf.
E. J. Candès. Lecture 11. Course Notes for MATH 262/CME 372: Applied Fourier
Analysis and Elements of Modern Signal Processing, 2021b. URL https://candes.
su.domains/teaching/math262/Lectures/Lecture11.pdf.
E. J. Candes, J. K. Romberg, and T. Tao. Stable signal recovery from incomplete and in-
accurate measurements. Communications on Pure and Applied Mathematics: A Journal
Issued by the Courant Institute of Mathematical Sciences, 59(8):1207–1223, 2006.
M. Carriero, A. Leaci, and F. Tomarelli. Plastic free discontinuities and special bounded
hessian. Comptes rendus de l’Académie des sciences. Série 1, Mathématique, 314(8):
595–600, 1992.
V. Caselles and J. Morel. Introduction to the special issue on partial differential equations
and geometry-driven diffusion in image processing and analysis. IEEE transactions on
image processing, 7(3):269–273, 1998.
V. Caselles, F. Catté, T. Coll, and F. Dibos. A geometric model for active contours in
image processing. Numerische mathematik, 66:1–31, 1993.
V. Cevher, S. Becker, and M. Schmidt. Convex optimization for big data: Scalable,
randomized, and parallel algorithms for big data analytics. IEEE Signal Processing
Magazine, 31(5):32–43, 2014.
A. Chambolle and C. Dossal. On the convergence of the iterates of the “fast iterative
shrinkage/thresholding algorithm”. Journal of Optimization theory and Applications,
166:968–982, 2015.
A. Chambolle and P.-L. Lions. Image recovery via total variation minimization and related
problems. Numerische Mathematik, 76:167–188, 1997.
T. F. Chan and L. A. Vese. Active contours without edges. IEEE Transactions on image
processing, 10(2):266–277, 2001.
Y.-G. Chen, Y. Giga, and S. Goto. Uniqueness and existence of viscosity solutions of
generalized mean curvature flow equations. In Fundamental Contributions to the Con-
tinuum Theory of Evolving Phase Interfaces in Solids: A Collection of Reprints of 14
Seminal Papers, pages 375–412. Springer, 1999.
L. Chizat and F. Bach. On the global convergence of gradient descent for over-
parameterized models using optimal transport. Advances in neural information pro-
cessing systems, 31, 2018.
54 Perspectives
K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian. Bm3d image denoising with shape-
adaptive principal component analysis. In SPARS’09-Signal Processing with Adaptive
Sparse Structured Representations, 2009.
E. De Giorgi and L. Ambrosio. Un nuovo tipo di funzionale del calcolo delle variazioni.
Atti della Accademia Nazionale dei Lincei. Classe di Scienze Fisiche, Matematiche e
Naturali. Rendiconti Lincei. Matematica e Applicazioni, 82(2):199–210, 1988.
M. V. de Hoop, M. Lassas, and C. A. Wong. Deep learning architectures for nonlinear op-
erator functions and nonlinear inverse problems. Mathematical Statistics and Learning,
4(1):1–86, 2022.
J. C. De los Reyes, C.-B. Schönlieb, and T. Valkonen. The structure of optimal parameters
for image restoration problems. Journal of Mathematical Analysis and Applications, 434
(1):464–500, 2016.
M. J. Ehrhardt, P. Markiewicz, and C.-B. Schönlieb. Faster pet reconstruction with non-
smooth priors by randomization and preconditioning. Physics in Medicine & Biology,
64(22):225019, 2019.
I. Ekeland and R. Temam. Convex analysis and variational problems. SIAM, 1999.
M. Elad and M. Aharon. Image denoising via sparse and redundant representations over
learned dictionaries. IEEE Transactions on Image processing, 15(12):3736–3745, 2006.
E. Esser, X. Zhang, and T. F. Chan. A general framework for a class of first order primal-
dual algorithms for convex optimization in imaging science. SIAM Journal on Imaging
Sciences, 3(4):1015–1046, 2010.
C. Etmann, R. Ke, and C.-B. Schönlieb. iunets: learnable invertible up-and downsampling
for large-scale inverse problems. In 2020 IEEE 30th International Workshop on Machine
Learning for Signal Processing (MLSP), pages 1–6. IEEE, 2020.
M. J. Fadili, J.-L. Starck, J. Bobin, and Y. Moudden. Image decomposition and separation
using sparse representations: An overview. Proceedings of the IEEE, 98(6):983–994,
2009.
M. Fornasier and C.-B. Schönlieb. Subspace correction methods for total variation and
\ell_1-minimization. SIAM Journal on Numerical Analysis, 47(5):3397–3428, 2009.
56 Perspectives
M. Genzel, J. Macdonald, and M. März. Solving inverse problems with deep neural
networks–robustness included? IEEE transactions on pattern analysis and machine
intelligence, 45(1):1119–1134, 2022.
G. Gilboa and S. Osher. Nonlocal operators with applications to image processing. Mul-
tiscale Modeling & Simulation, 7(3):1005–1028, 2009.
D. Gilton, G. Ongie, and R. Willett. Neumann networks for linear inverse problems in
imaging. IEEE Transactions on Computational Imaging, 6:328–343, 2019.
D. Gilton, G. Ongie, and R. Willett. Deep equilibrium architectures for inverse problems
in imaging. IEEE Transactions on Computational Imaging, 7:1123–1133, 2021a.
D. Gilton, G. Ongie, and R. Willett. Model adaptation for inverse problems in imaging.
IEEE Transactions on Computational Imaging, 7:661–674, 2021b.
I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016. http:
//www.deeplearningbook.org.
E. Haber, L. Horesh, and L. Tenorio. Numerical methods for the design of large-scale
nonlinear discrete ill-posed inverse problems. Inverse Problems, 26(2):025002, 2009.
J. Hadamard. Sur les problèmes aux dérivées partielles et leur signification physique.
Princeton university bulletin, pages 49–52, 1902.
4.2 The Data Driven - Knowledge Informed Paradigm 57
T. Hohage and F. Werner. Convergence rates for inverse problems with impulsive noise.
SIAM Journal on Numerical Analysis, 52(3):1203–1221, 2014.
L. Horesh, E. Haber, and L. Tenorio. Optimal experimental design for the large-scale
nonlinear ill-posed problem of impedance imaging. Large-Scale Inverse Problems and
Quantification of Uncertainty, pages 273–290, 2010.
S. Hurault, A. Leclaire, and N. Papadakis. Gradient step denoiser for convergent plug-
and-play. arXiv preprint arXiv:2110.03220, 2021.
J. Kaipio and E. Somersalo. Statistical and computational inverse problems, volume 160.
Springer Science & Business Media, 2006.
R. Ke and C.-B. Schönlieb. Unsupervised image restoration using partially linear denoisers.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9):5796–5812,
2021.
58 Perspectives
N. Khelifa, F. Sherry, and C.-B. Schönlieb. Enhanced denoising and convergent regulari-
sation using tweedie scaling. arXiv preprint arXiv:2503.05956, 2025.
P.-L. Lions and B. Mercier. Splitting algorithms for the sum of two nonlinear operators.
SIAM Journal on Numerical Analysis, 16(6):964–979, 1979.
J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online dictionary learning for sparse coding.
In Proceedings of the 26th annual international conference on machine learning, pages
689–696, 2009.
4.2 The Data Driven - Knowledge Informed Paradigm 59
S. Masnou and J.-M. Morel. Level lines based disocclusion. In Proceedings 1998 Interna-
tional Conference on Image Processing. ICIP98 (Cat. No. 98CB36269), pages 259–263.
IEEE, 1998.
T. Milne, Étienne Bilocq, and A. Nachman. A new method for determining Wasserstein 1
optimal transport maps from Kantorovich potentials, with deep learning applications.
arXiv preprint arXiv:2211.00820, 2022.
T. Moreau and J. Bruna. Understanding trainable sparse coding via matrix factorization.
arXiv preprint arXiv:1609.00285, 2016.
S. Osher, A. Solé, and L. Vese. Image decomposition and restoration using total variation
minimization and the h. Multiscale Modeling & Simulation, 1(3):349–370, 2003.
M. Pereyra. Proximal markov chain monte carlo algorithms. Statistics and Computing,
26:745–760, 2016.
P. Perona and J. Malik. Scale-space and edge detection using anisotropic diffusion. IEEE
Transactions on pattern analysis and machine intelligence, 12(7):629–639, 1990.
G. Peyré. Sparse modeling of textures. Journal of mathematical imaging and vision, 34:
17–31, 2009.
D. L. Phillips. A technique for the numerical solution of certain integral equations of the
first kind. Journal of the ACM (JACM), 9(1):84–97, 1962.
C. Poon. On the role of total variation in compressed sensing. SIAM Journal on Imaging
Sciences, 8(1):682–720, 2015.
C. Pöschl. Tikhonov regularization with general residual term. PhD thesis, Leopold
Franzens Universität Innsbruck, 2008.
P. Putzky and M. Welling. Recurrent inference machines for solving inverse problems.
arXiv preprint arXiv:1706.04008, 2017.
Y. Romano, M. Elad, and P. Milanfar. The little engine that could: Regularization by
denoising (red). SIAM Journal on Imaging Sciences, 10(4):1804–1844, 2017.
L. I. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation based noise removal
algorithms. Physica D: nonlinear phenomena, 60(1-4):259–268, 1992.
U. Schmidt and S. Roth. Shrinkage fields for effective image restoration. In Proceedings
of the IEEE conference on computer vision and pattern recognition, pages 2774–2781,
2014.
C.-B. Schönlieb, A. Bertozzi, M. Burger, and L. He. Image inpainting using a fourth-order
total variation flow. In SAMPTA’09, pages Special–session, 2009.
J. Schwab, S. Antholzer, and M. Haltmeier. Deep null space learning for inverse problems:
convergence analysis and rates. Inverse Problems, 35(2):025008, 2019.
S. Setzer. Operator splittings, bregman methods and frame shrinkage in image processing.
International Journal of Computer Vision, 92:265–280, 2011.
Z. Shumaylov, J. Budd, S. Mukherjee, and C.-B. Schönlieb. Weakly convex regularisers for
inverse problems: Convergence of critical points and primal-dual optimisation. In Forty-
first International Conference on Machine Learning, 2024. URL https://openreview.
net/forum?id=E8FpcUyPuS.
62 Perspectives
Z. Shumaylov, P. Zaika, J. Rowbottom, F. Sherry, M. Weber, and C.-B. Schönlieb. Lie al-
gebra canonicalization: Equivariant neural operators under arbitrary lie groups. In
The Thirteenth International Conference on Learning Representations, 2025. URL
https://openreview.net/forum?id=7PLpiVdnUC.
J. Starck. Image processing and data analysis, the multi-scale approach. Cambridge Uni-
versity Press, 1998.
J. Sun, H. Li, Z. Xu, et al. Deep admm-net for compressive sensing mri. Advances in
neural information processing systems, 29, 2016.
J. Tang, S. Mukherjee, and C.-B. Schönlieb. Stochastic primal-dual deep unrolling. arXiv
preprint arXiv:2110.10093, 2021.
J. Tang, S. Mukherjee, and C.-B. Schönlieb. Iterative operator sketching framework for
large-scale imaging inverse problems. In ICASSP 2025-2025 IEEE International Con-
ference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2025.
M. Terris, A. Repetti, J.-C. Pesquet, and Y. Wiaux. Building firmly nonexpansive con-
volutional neural networks. In ICASSP 2020-2020 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), pages 8658–8662. IEEE, 2020.
M. Unser and T. Blu. Fractional splines and wavelets. SIAM review, 42(1):43–67, 2000.
T. Valkonen. A primal–dual hybrid gradient method for nonlinear operators with appli-
cations to mri. Inverse Problems, 30(5):055012, 2014.
4.2 The Data Driven - Knowledge Informed Paradigm 63
C. Vonesch, T. Blu, and M. Unser. Generalized daubechies wavelet families. IEEE trans-
actions on signal processing, 55(9):4415–4429, 2007.
D. Wu, K. Kim, B. Dong, G. E. Fakhri, and Q. Li. End-to-end lung nodule detection
in computed tomography. In Machine Learning in Medical Imaging: 9th International
Workshop, MLMI 2018, Held in Conjunction with MICCAI 2018, Granada, Spain,
September 16, 2018, Proceedings 9, pages 37–45. Springer, 2018.
X. Xu, J. Liu, Y. Sun, B. Wohlberg, and U. S. Kamilov. Boosting the performance of plug-
and-play priors via denoiser scaling, 2020. URL https://arxiv.org/abs/2002.11546.