0% found this document useful (0 votes)
49 views28 pages

Decomposed Diffusion Sampler For Accelerating Large Scale Inverse Problems

This paper presents a novel Decomposed Diffusion Sampling (DDS) method that combines Krylov subspace methods with diffusion models to accelerate large-scale inverse problems in medical imaging, achieving state-of-the-art reconstruction quality and over 80 times faster inference times compared to previous methods. The DDS approach leverages the properties of Krylov subspaces to maintain data consistency while eliminating the need for complex computations, making it adaptable to various imaging modalities. Results demonstrate significant improvements in multi-coil MRI and 3D CT reconstructions, showcasing the method's efficiency and effectiveness in real-world applications.

Uploaded by

brixustwo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views28 pages

Decomposed Diffusion Sampler For Accelerating Large Scale Inverse Problems

This paper presents a novel Decomposed Diffusion Sampling (DDS) method that combines Krylov subspace methods with diffusion models to accelerate large-scale inverse problems in medical imaging, achieving state-of-the-art reconstruction quality and over 80 times faster inference times compared to previous methods. The DDS approach leverages the properties of Krylov subspaces to maintain data consistency while eliminating the need for complex computations, making it adaptable to various imaging modalities. Results demonstrate significant improvements in multi-coil MRI and 3D CT reconstructions, showcasing the method's efficiency and effectiveness in real-world applications.

Uploaded by

brixustwo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Published as a conference paper at ICLR 2024

D ECOMPOSED D IFFUSION S AMPLER FOR ACCELERAT-


ING L ARGE -S CALE I NVERSE P ROBLEMS

Hyungjin Chung1 , Suhyeon Lee2 , Jong Chul Ye2


1
Dept. of Bio & Brain Engineering, KAIST, 2 Kim Jae Chul Graduate School of AI, KAIST
{[Link], [Link], [Link]}@[Link]

A BSTRACT
arXiv:2303.05754v3 [[Link]] 19 Feb 2024

Krylov subspace, which is generated by multiplying a given vector by the matrix of


a linear transformation and its successive powers, has been extensively studied in
classical optimization literature to design algorithms that converge quickly for large
linear inverse problems. For example, the conjugate gradient method (CG), one
of the most popular Krylov subspace methods, is based on the idea of minimizing
the residual error in the Krylov subspace. However, with the recent advancement
of high-performance diffusion solvers for inverse problems, it is not clear how
classical wisdom can be synergistically combined with modern diffusion models.
In this study, we propose a novel and efficient diffusion sampling strategy that
synergistically combines the diffusion sampling and Krylov subspace methods.
Specifically, we prove that if the tangent space at a denoised sample by Tweedie’s
formula forms a Krylov subspace, then the CG initialized with the denoised data
ensures the data consistency update to remain in the tangent space. This negates
the need to compute the manifold-constrained gradient (MCG), leading to a more
efficient diffusion sampling method. Our method is applicable regardless of the
parametrization and setting (i.e., VE, VP). Notably, we achieve state-of-the-art
reconstruction quality on challenging real-world medical inverse imaging problems,
including multi-coil MRI reconstruction and 3D CT reconstruction. Moreover,
our proposed method achieves more than 80 times faster inference time than
the previous state-of-the-art method. Code is available at [Link]
HJ-harry/DDS

1 I NTRODUCTION
Diffusion models (Ho et al., 2020; Song et al., 2021b) are the state-of-the-art generative model that
learns to generate data by gradual denoising, starting from the reference distribution (e.g. Gaussian).
In addition to superior sample quality in the unconditional sampling of the data distribution, it was
shown that one can use unconditional diffusion models as generative priors that model the data
distribution (Chung et al., 2022a; Kawar et al., 2022), and incorporate the information from the
forward physics model along with the measurement y to sample from the posterior distribution
p(x|y). This property is especially intriguing when seen in the context of Bayesian inference, as we
can separate the parameterized prior pθ (x) from the measurement model (i.e. likelihood) p(y|x) to
construct the posterior pθ (x|y) ∝ pθ (x)p(y|x). In other words, one can use the same pre-trained
neural network model regardless of the forward model at hand. Throughout the manuscript, we refer
to this class of methods as Diffusion model-based Inverse problem Solvers (DIS).
For inverse problems in medical imaging, e.g. magnetic resonance imaging (MRI), computed
tomography (CT), it is often required to accelerate the measurement process by reducing the number of
measurements. However, the data acquisition scheme may vary vastly according to the circumstances
(e.g. vendor, sequence, etc.), and hence the reconstruction algorithm needs to be adaptable to
different possibilities. Supervised learning schemes show weakness in this aspect as they overfit to
the measurement types that were used during training. As such, it is easy to see that diffusion models
will be particularly strong in this aspect as they are agnostic to the different forward models at
inference. Indeed, it was shown in some of the pioneering works that diffusion-based reconstruction
algorithms have high generalizability (Jalal et al., 2021; Chung & Ye, 2022; Song et al., 2022; Chung
et al., 2023b).

1
Published as a conference paper at ICLR 2024

(a) Multi-coil CS-MRI


Mask U-Net E2E-varnet Score-MRI (4000) Ours (49) Ground Truth

Noiseless

31.96 / 0.841 32.12 / 0.887 33.38 / 0.887 34.44 / 0.886

𝑨† 𝐲 DPS (50) DPS (1000) Ours (49) Ground Truth


Noisy (𝜎 = 0.05)

28.96 / 0.850 23.58 / 0.644 29.17 / 0.863

(b) 3D Sparse-view CT
FBP DiffusionMBIR (4000) Ours (19) Ground Truth

𝑦
FBP

𝑥 34.30 / 0.877 34.86 / 0.892

𝑧
𝑦 34.16 / 0.957 36.32 / 0.969
Ours (19)

𝑧
𝑥 33.68 / 0.857 35.18 / 0.890

Figure 2: Representative reconstruction results. (a) Multi-coil MRI reconstruction, (b) 3D sparse-view
CT. Numbers in parenthesis: NFE. Yellow numbers in bottom left corner: PSNR/SSIM.

While showing superiority in performance and PSNR[db]


35.0
generalization capacity, slow inference time is
known to be a critical drawback of diffusion 30.0

models. One of the most widely acknowl-


25.0
edged accelerated diffusion sampling strate-
gies is the denoising diffusion implicit model 20.0

(DDIM) (Song et al., 2021a), where the stochas- DDS (ours)


Score-MRI Chung & Ye (2022)
15.0
tic ancestral sampling of denoising diffusion Jalal et al. Jalal et al. (2021)

probabilistic models (DDPM) can be transi-


NFE=20 50 100 500 1000 2000 4000
tioned to deterministic sampling, and thereby
accelerate the sampling. Accordingly, DDIM Figure 1: Parallel imaging MR reconstruction eval-
sampling has been well incorporated in solving uation PSNR vs. NFE (log scale). Reconstruction
low-level vision inverse problems (Kawar et al., from 1D uniform random ×4 acceleration (Zbon-
2022; Song et al., 2023). In a recent applica- tar et al., 2018).
tion of DDIM for linear image restoration tasks,
Wang et al. (2023) proposed an algorithm dubbed denoising diffusion null-space models (DDNM),
where one-step null-space modification is made to impose consistency. However, the sampling
strategy is not successful in practical large-scale medical imaging contexts when the forward model
is significantly more complex (e.g. parallel imaging (PI) CS-MRI, 3D modalities). Furthermore, it is
unclear how the algorithm is related to the existing literature of conditional sampling approaches that
take into account the geometry of the manifold (Chung et al., 2022a; 2023a).
On the other hand, in classical optimization literature, Krylov subspace methods have been extensively
studied to deal with large-scale inverse problems due to their rapid convergence rates (Liesen &

2
Published as a conference paper at ICLR 2024

Strakos, 2013). Specifically, consider a typical linear inverse problem


y = Ax, (1)
where A is the linear mapping and the goal is to retrieve x from the measurement y. Without loss
of generality, throughout the paper we assume that A in (1) is square. Otherwise, we can obtain
an equivalent inverse problem with symmetric linear mapping à as ỹ := A∗ y = A∗ Ax := Ãx.
This is because the solution to the normal equation A∗ Ax = A∗ y is indeed a solution to Ax = y
if A∗ has full column rank, which holds in most of the ill-posed inverse problem cases. Given an
initial guess x̂, Krylov subspace methods seek an approximate solution x(l) from an affine subspace
x̂ + Kl , where the l-th order Krylov subspace Kl is defined by
Kl := Span(b, Ab, · · · , Al−1 b), b := y − Ax̂ (2)
For example, the conjugate gradient (CG) method is a special class of the Krylov subspace method
that minimizes the residual in the Krylov subspace. Krylov subspace methods are particularly useful
for large-scale problems thanks to their fast convergence (Liesen & Strakos, 2013).
Inspired by this, here we are interested in developing a method that synergistically combines Krylov
subspace methods with diffusion models such that it can be effectively used for large-scale inverse
problems. Specifically, based on a novel observation that a diffusion posterior sampling (DPS) (Chung
et al., 2023a) with the manifold constrained gradient (MCG) (Chung et al., 2022a) is equivalent to
one-step projected gradient on the tangent space at the “denoised" data by Tweedie’s formula, we
provide multi-step update scheme on the tangent space using Krylov subspace methods. Specifically,
we show that the multiple CG updates are guaranteed to remain in the tangent space, and subsequently
generated sample with the addition of the noise component can be correctly transferred to the next
noisy manifold. This eliminates the need for the computationally demanding MCG while permitting
multiple economical CG steps at each ancestral diffusion sampling, resulting in a more efficient
DDIM sampling. Our analysis holds for both variance-preserving (VP) and variance-exploding (VE)
sampling schemes.
The combined strategy, dubbed Decomposed Diffusion Sampling (DDS), yields better performance
with much-reduced sampling time (20∼50 NFE, ×80 ∼ 200 acceleration; See Fig. 1 for comparison,
Fig. 2 for representative results), and is shown to be applicable to a variety of challenging large scale
inverse problem tasks: multi-coil MRI reconstruction and 3D CT reconstruction.

2 BACKGROUND
Krylov subspace methods Consider the linear system (1). In classical projection based methods
such as Krylov subspace methods (Liesen & Strakos, 2013), for given two subspace K and L, we
define an approximate problem to find x ∈ K such that
y − Ax ⊥ L (3)
This is a basic projection step, and the sequence of such steps is applied. For example, with non-zero
estimate x̂, the associated problem is to find x ∈ x̂ + K such that y − Ax ⊥ L, which is equivalent
to finding δ ∈ K such that
b − Aδ ⊥ L, δ := x − x̂, b := y − Ax̂. (4)
In terms of the choice of the two subspaces, the CG method chooses the two subspaces K and L as
the same Krylov subspace:
K = L = Kl := Span(b, Ab, · · · , Al−1 b). (5)
Then, CG attempts to find the solution to the following optimization problem:
min ∥y − Ax∥2 (6)
x∈x̂+Kl

Krylov subspace methods can be also extended to nonlinear problems via zero-finding. Specifically,
the optimization problem minx ℓ(x) can be equivalently converted to a zero-finding problem of its
gradient, i.e. ∇x ℓ(x) = 0. If we consider a non-linear forward imaging operator A(·), we can define
ℓ(x) = ∥y − A(x)∥2 /2. Then, one can use, for example, Newton-Krylov method (Knoll & Keyes,

3
Published as a conference paper at ICLR 2024

2004) to linearize the problem near the current solution and apply standard Krylov methods to solve
the current problem. Now, given the optimization problem, we can see the fundamental differences
between the gradient-based approaches and Krylov methods. Specifically, gradient methods are based
on the iteration:
x(i+1) = x(i) − γ∇x ℓ(x(i) ) (7)
which stops updating when ∇x ℓ(x(i) ) ≃ 0. On the other hand, Krylov subspace methods try to find
x ∈ Kl by increasing l to achieve a better approximation of ∇x ℓ(x) = 0. This difference allows
us to devise a computationally efficient algorithm when combined with diffusion models, which we
investigate in this paper. See Appendix A.2 fur further mathematical background.

Diffusion models Diffusion models (Ho et al., 2020) attempt to model the data distribution pdata (x)
by constructing a hierarchical latent variable model
Z T
(t)
Y
pθ (x0 ) = pθ (xT ) pθ (xt−1 |xt ) dx1:T , (8)
t=1
d
where x{1,...,T } ∈ R are noisy latent variables that have the same dimension as the data random
vector x0 ∈ Rd , defined by the Markovian forward conditional densities
p
q(xt |xt−1 ) = N (xt | βt xt−1 , (1 − βt )I), (9)

q(xt |x0 ) = N (xt | ᾱt x0 , (1 − ᾱt )I). (10)
Qt
Here, the noise schedule βt is an increasing sequence of t, with ᾱt := i=1 αt , αt := 1 − βt .
Training of diffusion models amounts to training a multi-noise level residual denoiser (i.e. epsilon
matching)
h i
(t)
min Ext ∼q(xt |x0 ),x0 ∼pdata (x0 ),ϵ∼N (0,I) ∥ϵθ (xt ) − ϵ∥22 ,
θ

(t)
such that ϵθ∗ (xt )≃ xt√−1−ᾱᾱt x0 . Furthermore, it can be shown that epsilon matching is equivalent to
t
the denoising score matching (DSM) (Hyvärinen & Dayan, 2005; Song & Ermon, 2019) objective up
to a constant with different parameterization
h i
(t)
min Ext ,x0 ,ϵ ∥sθ (xt ) − ∇xt log q(xt |x0 )∥22 , (11)
θ
(t)

xt − ᾱt x0 (t) √
such that sθ∗ (xt ) ≃ − 1−ᾱt = −ϵθ∗ (xt )/ 1 − ᾱt . For notational simplicity, we often denote
(t) (t) (t)
ŝt , ϵ̂t , x̂t instead of sθ∗ (xt ), ϵθ∗ (xt ), xθ∗ (xt ), representing the estimated score, noise, and noiseless
image, respectively. Sampling from (8) can be implemented by ancestral sampling, which iteratively
performs
 
1 1 − αt
xt−1 = √ xt − √ ϵ̂t + β̃t ϵ, (12)
αt 1 − ᾱt
1−ᾱt−1
where β̃t := 1−ᾱt βt .

One can also view ancestral sampling (12) as solving the reverse generative stochastic differential
equation (SDE) of the variance preserving (VP) linear forward SDE (Song et al., 2021b). Additionally,
one can construct the variance exploding (VE) SDE by setting q(xt |x0 ) = N (xt |x0 , σt2 I), which
is in a form of Brownian motion. In Appendix A.1 we further review the background on diffusion
models under the score/SDE perspective.

DDIM Seen either from the variational or the SDE perspective, diffusion models are inevitably
slow to sample from. To overcome this issue, DDIM (Song et al., 2021a) proposes another method of
sampling which only requires matching the marginal distributions q(xt |x0 ). Specifically, the update
rule is given as follows

q
(t)
xt−1 = ᾱt−1 x̂t + 1 − ᾱt−1 − η 2 β̃t2 ϵθ∗ (xt ) + η β̃t ϵ,
√ (13)
= ᾱt−1 x̂t + w̃t

4
Published as a conference paper at ICLR 2024

where x̂t is the denoised estimate


(t) 1 √ (t)
x̂t := xθ∗ (xt ) := √ (xt − 1 − ᾱt ϵθ∗ (xt )), (14)
ᾱt
which can also be equivalently derived from Tweedie’s formula Efron (2011), and w̃t denotes the
total noise given by
q
(t)
w̃t := 1 − ᾱt−1 − η 2 β̃t2 ϵθ∗ (xt ) + η β̃t ϵ (15)
In (13), η ∈ [0, 1] is a parameter controlling the stochasticity
p p η = 0.0 leads to fully
of the update rule:
deterministic sampling, whereas η = 1.0 with β̃t = (1 − ᾱt−1 )/(1 − ᾱt ) 1 − ᾱt /ᾱt−1 recovers
the ancestral sampling of DDPMs.
It is important to note that the noise component w̃t properly matches the forward marginal (Song
et al., 2021a). The direction w̃t of this transition is determined by the vector sum of the deterministic
(t)
and the stochastic directional component. Accordingly, assuming optimality of ϵθ∗ , the total noise
w̃t in (15) can be represented by
p
w̃t = 1 − ᾱt−1 ϵ̃ (16)
for some√ϵ̃ ∼ N (0,√I) (see Appendix B). In other words, (13) is equivalently represented by
xt−1 = ᾱt−1 x̂t + 1 − ᾱt−1 ϵ̃ for some ϵ̃ ∼ N (0, I). Therefore, it can be seen that the difference
between DDIM and DDPM lies only in the degree of dependence on the deterministic estimate of the
noise component with feasible intermediate values η ∈ (0, 1).

3 D ECOMPOSED D IFFUSION S AMPLING


Conditional diffusion for inverse problems The conditional diffusion sampling for inverse prob-
lems (Kawar et al., 2022; Chung et al., 2023a;b) attempts to solve the following optimization problem:
min ℓ(x) (17)
x∈M

where ℓ(x) denotes the data consistency (DC) loss (i.e., ℓ(x) = ∥y − Ax∥2 /2 for linear inverse
problems) and M represents the clean data manifold. Consequently, it is essential to navigate in a
way that minimizes cost while also identifying the correct clean manifold. Accordingly, most of the
approaches use standard reverse diffusion (e.g. (12)), alternated with an operation to minimize the
DC loss.
Recently, Chung et al. (2023a) proposed DPS, where the updated estimate from the noisy sample
xt ∈ Mt is constrained to stay on the same noisy manifold Mt . This is achieved by computing the
MCG (Chung et al., 2022a) on a noisy sample xt ∈ Mt as ∇mcg xt ℓ(xt ) := ∇xt ℓ(x̂t ), where x̂t is
the denoised sample in (14) through Tweedie’s formula. The resulting algorithm can be stated as
follows: √
xt−1 = ᾱt−1 (x̂t − γt ∇xt ℓ(x̂t )) + w̃t (18)
(t)
where γt > 0 denotes the step size. Since parameterized score function ϵθ∗ (xt ) is trained with
(t)
samples supported on Mt , ϵθ∗ shows good performance on denoising xt ∼ Mt , allowing precise
transition to Mt−1 . Therefore, by performing (18) from t = T to t = 0, we can solve the
optimization problem (17) with x0 ∈ M. Unfortunately, the computation of MCG for DPS requires
computationally expensive backpropagation and is often unstable (Poole et al., 2022; Du et al., 2023).

Key observation By applying the chain rule for the MCG term in (18), we have
∂ x̂t
∇xt ℓ(x̂t ) = ∇x̂t ℓ(x̂t )
∂xt
where we use the denominator layout for vector calculus. Since ∇x̂t ℓ(x̂t ) is a standard gradient, the
main complexity of the MCG arises from the Jacobian term ∂∂x x̂t
t
.
In the following Proposition 1, we show that if the underlying clean manifold forms an affine subspace,
then the Jacobian term ∂∂xx̂t
t
is indeed the orthogonal projection on the clean manifold up to a scale

5
Published as a conference paper at ICLR 2024

factor. Note that the affine subspace assumption is widely used in the diffusion literature that has
been used when 1) studying the possibility of score estimation and distribution recovery (Chen et al.,
2023), 2) showing the possibility of signal recovery (Rout et al., 2023a;b), and the most relevantly, 3)
showing the geometrical view of the clean and the noisy manifolds (Chung et al., 2022a). Although
it is difficult to assume in practice that the clean manifold forms an affine subspace, it could be
approximated by piece-wise linear regions represented by the tangent subspace at x̂t . Therefore,
Proposition 1 is still valid in that approximate regime.
Proposition 1 (Manifold Constrained Gradient). Suppose the clean data manifold M is represented
as an affine subspace and assumes the uniform distribution on M. Then,
∂ x̂t 1
= √ PM (19)
∂xt ᾱt
x̂t − γt ∇xt ℓ(x̂t ) = PM (x̂t − ζt ∇x̂t ℓ(x̂t )) (20)
for some ζt > 0, where PM denotes the orthogonal projection to M.

Now, (20) in Proposition 1 indicates that if the clean data manifold is an affine subspace, the DPS
corresponds to the projected gradient on the clean manifold. Nonetheless, a notable limitation of
MCG is its inefficient use of a single projected gradient step for each ancestral diffusion sampling.
Motivated by this, we aim to explore extensions that allow computationally efficient multi-step
optimization steps per each ancestral sampling.
Specifically, let Tt denote the tangent space on the clean manifold at a denoised sample x̂t in (14).
Suppose, furthermore, that there exists the l-th order Krylov subspace:
Kt,l := Span(b, Ab, · · · , Al−1 b), b := y − Ax̂t (21)
such that
Tt = x̂t + Kt,l
Then, using the property of CG in (6), it is easy to see that M -step CG update with M ≤ l starting
from x̂t are confined in Tt since it corresponds to the solution of
min ∥y − Ax∥2 (22)
x∈x̂t +KM

and KM ⊂ Kl when M ≤ l. This offers a pivotal insight. It shows that if the tangent space at each
denoised sample is representable by a Krylov subspace, there’s no need to compute the MCG. Rather,
the standard CG method suffices to guarantee that the updated samples stay within the tangent space.
To sum up, our DDS algorithm is as follows:

xt−1 = ᾱt−1 x̂′t + w̃t , (23)
′ ∗ ∗
x̂t = CG(A A, A y, x̂t , M ), M ≤ l (24)
where CG(·) denotes the M -step CG for the normal equation starting from x̂t . In contrast,
DDNM (Wang et al., 2023) for noiseless image restoration problems uses the following update
instead of (24):
x̂′t = (I − A† A)x̂t + A† y, (25)

where A† denotes the pseudo-inverse of A. Unfortunately, (25) in DDNM does not ensure that the
update signal x̂′t lies in Tt due to A† .
Therefore, for large-scale inverse problems, we find that CG outperforms naive projections (25)
by quite large margins. This is to be expected, as CG iteration enforces the update to stay on Tt
whereas the orthogonal projections in DDNM do not guarantee this property. In practice, even when
our Krylov subspace assumptions cannot be guaranteed, we empirically validate in Appendix F.1
that DDS indeed keeps xt closest to the noisy manifold Mt , which, in turn, shows that DDS keeps
the update close to the clean manifold M. Moreover, it is worth emphasizing that gradient-based
methods (Jalal et al., 2021; Chung et al., 2022a; 2023a) often fail when choosing the “theoretically
correct” step sizes of the likelihood. To fix this, several heuristics on the choice of step sizes are
required (e.g. choosing the step size ∝ 1/∥y − Ax̂t ∥2 ), which easily breaks when varying the NFE.
In this regard, DDS is beneficial in that it is free from the cumbersome step-size tuning process.

6
Published as a conference paper at ICLR 2024

Supervised U-Net E2E-Varnet Jalal et al. Score-MRI


Mask Pattern Acc. TV
Zbontar et al. (2018) Sriram et al. (2020) (2100) (4000 × 2 × C ∗ )
DPS (1000) DDS VP (99) DDS VP (49) DDS VP (19)

PSNR [db] 27.32 ±0.43 31.77±0.89 32.96±0.59 32.49±2.10 33.25±1.18 30.56±0.66 34.88±0.74 34.61±0.32 32.73±2.04
×4
SSIM 0.662±0.17 0.846±0.11 0.856±0.11 0.868±0.08 0.857±0.08 0.840±0.20 0.954±0.11 0.956±0.08 0.927±0.08
Uniform 1D
PSNR [db] 25.02±2.21 29.51±0.37 31.98±0.35 32.19±2.45 32.01±2.30 30.29±0.33 31.62±1.88 30.16±1.19 30.33±2.35
×8
SSIM 0.532±0.05 0.780±0.05 0.828±0.08 0.835±0.16 0.821±0.15 0.811±0.18 0.876±0.08 0.830±0.04 0.891±0.16
PSNR [db] 30.55±1.77 32.66±0.26 34.15±1.40 33.98±1.25 34.25±1.33 32.47±1.09 35.12±1.37 35.15±0.39 34.63±1.95
×4
SSIM 0.789±0.06 0.866±0.12 0.878±0.19 0.881±0.12 0.885±0.08 0.838±0.20 0.963±0.15 0.961±0.06 0.957±0.09
Gaussian 1D
PSNR [db] 27.98±1.28 31.64±1.12 33.15±2.09 32.76±2.43 32.43±0.95 30.47±2.32 33.27±1.06 33.43±0.75 32.83±1.29
×8
SSIM 0.747±0.21 0.841±0.09 0.868±0.18 0.870±0.13 0.855±0.13 0.839±0.16 0.937±0.07 0.947±0.15 0.940±0.09
PSNR [db] 29.20±2.37 24.51±0.69 20.97±1.24 30.97±1.14 31.43±1.23 29.65±1.26 33.99±1.30 34.55±1.69 32.55±1.54
×8
SSIM 0.781±0.09 0.724±0.10 0.642±0.08 0.812±0.17 0.831±0.18 0.795±0.12 0.948±0.13 0.956±0.15 0.916±0.14
Gaussian 2D
PSNR [db] 26.28±2.28 14.93±3.33 16.66±4.02 27.34±1.97 29.17±0.98 26.30±1.34 27.86±1.67 25.75±1.77 25.66±2.03
× 15
SSIM 0.547±0.19 0.372±0.29 0.435±0.26 0.692±0.13 0.704±0.08 0.688±0.11 0.732±0.10 0.695±0.09 0.693±0.08
PSNR [db] 29.52±1.26 20.89±3.09 20.70±3.08 32.60±1.88 31.98±0.51 31.05±0.46 35.31±0.79 35.36±0.41 35.39±0.57
×8
SSIM 0.562±0.11 0.576±0.10 0.592±0.18 0.833±0.05 0.816±0.07 0.811±0.08 0.897±0.07 0.875±0.09 0.915±0.11
VD Poisson disk
PSNR [db] 26.19±2.36 16.01±5.59 18.82±3.30 30.22±1.89 29.59±1.22 30.02±1.72 34.84±1.44 35.18±0.97 34.59±1.50
× 15
SSIM 0.510±0.20 0.537±0.21 0.548±0.19 0.749±0.17 0.702±0.15 0.753±0.15 0.934±0.06 0.931±0.05 0.940±0.05

Table 1: Quantitative metrics for noiseless parallel imaging reconstruction. Numbers in parenthesis:
NFE. ∗ : expressed as ×2 × C as the network is evaluated for real/imag part of each coil. Bold: best.
Mean ± 1 std.
Furthermore, our CG step can be easily extended for noisy image restoration problems. Unlike the
DDNM approach that relies on the singular value decomposition to handle noise, which is non-trivial
to perform on forward operators in medical imaging (e.g. PI CS-MRI, CT), we can simply minimize
the cost function
γ 1
ℓ(x) = ∥y − Ax∥22 + ∥x − x̂t ∥22 , (26)
2 2
by performing CG iteration CG(γA∗ A + I, x̂t + γA∗ y, x̂t , M ) in the place of (24), where γ is
a hyper-parameter that weights the proximal regularization (Parikh & Boyd, 2014). Finally, our
method can also be readily extended to accelerate DiffusionMBIR (Chung et al., 2023b) for 3D
CT reconstruction by adhering to the same principles. Specifically, we implement an optimization
strategy to impose the conditioning:
1
min ∥Ax − y∥22 + λ∥D z x∥1 , (27)
x 2

where D z is the finite difference operator that is applied to the z-axis that is not learned through the
diffusion prior, and unlike Chung et al. (2023b), the optimization is performed in the clean manifold
starting from the denoised x̂t rather than the noisy manifold starting from xt . As the additional
prior is only imposed in the direction that is orthogonal to the axial slice dimension (xy) captured by
the diffusion prior (i.e. manifold M), (27) can be solved effectively with the alternating direction
method of multipliers (ADMM) (Boyd et al., 2011) after sampling 2D diffusion slice by slice. See
Appendix C for details in implementation.

4 E XPERIMENTS
Problem setting. We have the following general measurement model (see Fig. 3 for illustration of
the imaging physics).
y = P T sx =: Ax, y ∈ Cn , A ∈ Cn×d , (28)
where P is the sub-sampling matrix, T is the discrete transform matrix (i.e. Fourier, Radon), and
s = I when we have a single-array measurement including CT, and s = [s(1) , . . . , s(c) ] when we
have a c−coil parallel imaging (PI) measurement.
We conduct experiments on two distinguished applications—accelerated MRI, and 3D CT recon-
struction. For the former, we follow the evaluation protocol of Chung & Ye (2022) and test our
method on the fastMRI knee dataset (Zbontar et al., 2018) on diverse sub-sampling patterns. We
provide comparisons against representative DIS methods Score-MRI (Chung & Ye, 2022), Jalal et al.
(2021), DPS (Chung et al., 2023a). Notably, for DPS, we use the DDIM sampling strategy to show
that the strength of DDS not only comes from the DDIM sampling strategy but also the use of the
sampling together with the CG update steps. The optimal η values for DPS (DDIM) are obtained
through grid search. We do not compare against (Song et al., 2022) as the method cannot cover the
multi-coil setting. Other than DIS, we also compare against strong supervised learning baselines:
E2E-Varnet (Sriram et al., 2020), U-Net (Zbontar et al., 2018); and compressed sensing baseline: total

7
Published as a conference paper at ICLR 2024

variation reconstruction (Block et al., 2007). For the latter, we follow Chung et al. (2023b) and test
both sparse-view CT reconstruction (SV-CT), and limited angle CT reconstruction (LA-CT) on the
AAPM 256×256 dataset. We compare against representative DIS methods DiffusionMBIR (Chung
et al., 2023b), Song et al. (2022), MCG (Chung et al., 2022a), and DPS (Chung et al., 2023a);
Supervised baselines Lahiri et al. (2022) and FBPConvNet (Jin et al., 2017); Compressed sensing
baseline ADMM-TV. For all proposed methods, we employ M = 5, η = 0.15 for 19 NFE, η = 0.5
for 49 NFE, η = 0.8 for 99 NFE unless specified otherwise. While we only compare against a single
CS baseline, it was reported in previous works that diffusion model-based solvers largely outperform
the classic CS baselines (Jalal et al., 2021; Luo et al., 2023), for example, L1-wavelet (Lustig et al.,
2007) and L1-ESPiRiT (Uecker et al., 2014). For PI CS-MRI experiments, we employ the rejection
sampling based on a residual-based criterion to ensure stability. Further experimental details can be
found in appendix G.

4.1 ACCELERATED MRI

Improvement over DDNM. Fixing the sampling strategy the same, we inspect the effect of the
three different data consistency imposing strategies: Score-MRI (Chung & Ye, 2022), DDNM (Wang
et al., 2023), and DDS. For DDS, we additionally search for the optimal number of CG iterations
per sampling step. In Tab. 2, we see that under the low NFE regime, the score-MRI DC strategy
has significantly worse performance than the proposed methods, even when using the same DDIM
sampling strategy. Moreover, we see that overall, DDS outperforms DDNM by a few db in PSNR. We
see that 5 CG iterations per denoising step strike a good balance. One might question the additional
computational overhead of introducing the iterative CG into the already slow diffusion sampling.
Nonetheless, from our experiments, we see that on average, a single CG iteration takes about 0.004
sec. Consequently, it only takes about 0.2 sec more than the analytic counterpart when using 50 NFE
(Analytic: 4.51 sec vs. CG(5): 4.71 sec.).
Improvement on VE (Chung & Ye,
2022). Keeping the pre-trained model DDS (ours)
Score-MRI DDNM
intact from Chung & Ye (2022), we switch 1 3 5 10
from the Score-MRI sampling to Algorithm 5,
PSNR[db] 26.48 31.36 31.51 33.78 34.61 32.48
and report on the reconstruction results from SSIM 0.688 0.932 0.934 0.952 0.956 0.949
uniform 1D ×4 accelerated measurements in
Tab. 6. Note that Score-MRI uses 2000 PC Table 2: Ablation study on the DC strategy. 49
as the default setting, which amounts to 4000 NFE VP DDIM sampling strategy, uniform 1D ×4
NFE, reaching 33.25 PSNR. We see almost no acc. reconstruction.
degradation in quality down to 200 NFE, but
the performance rapidly degrades as we move down to 100, and completely fails when we set
the NFE ≤ 50. On the other hand, by switching to the proposed solver, we are able to achieve
the reconstruction quality that better than Score-MRI (4000 NFE) with only 100 NFE sampling.
Moreover, we see that we can reduce the NFE down to 30 and still achieve decent reconstructions.
This is a useful property for a reconstruction algorithm, as we can trade off reconstruction quality
with speed. However, we observe several downsides of using the VE parameterization including
numerical instability with large NFE, which we analyze in detail in appendix F.
Parallel Imaging with VP parameterization (Noiseless). We conduct thorough PI reconstruction
experiments with 4 different types of sub-sampling patterns following Chung & Ye (2022). Algo-
rithm 2 in supplementary material is used for all experiments. Quantitative results are shown in
Tab. 1 (Also see Fig. 7 for qualitative results). As the proposed method is based on diffusion models,
it is agnostic to the sub-sampling patterns, generalizing well to all the different sampling patterns,
whereas supervised learning-based methods such as U-Net and E2E-Varnet fail dramatically on 2D
subsampling patterns. Furthermore, to emphasize that the proposed method is agnostic to the imaging
forward model, we show for the first time in the DIS literature that DDS is capable of reconstructing
from non-cartesian MRI sub-sampling patterns that involve non-uniform Fast-Fourier Transform
(NUFFT) (Fessler & Sutton, 2003). See Appendix F.2.
In Tab. 1, we see that DDS sets the new state-of-the-art in most cases even when the NFE is
constrained to < 100. Note that this is a dramatic improvement over the previous method Chung &
Ye (2022), as for parallel imaging, Score-MRI required 120k(C = 15) NFE for the reconstruction
of a single image. Contrarily, DDS is able to outperform score-MRI with 49 NFE, and performs

8
Published as a conference paper at ICLR 2024

8-view 4-view 2-view


Axial∗ Coronal Sagittal Axial∗ Coronal Sagittal Axial∗ Coronal Sagittal
Method PSNR ↑ SSIM ↑ PSNR ↑ SSIM ↑ PSNR ↑ SSIM ↑ PSNR ↑ SSIM ↑ PSNR ↑ SSIM ↑ PSNR ↑ SSIM ↑ PSNR ↑ SSIM ↑ PSNR ↑ SSIM ↑ PSNR ↑ SSIM ↑
DDS VP (19) 32.31 0.904 35.82 0.975 33.03 0.931 30.59 0.906 30.38 0.947 27.90 0.903 25.43 0.844 24.38 0.862 22.10 0.769
DDS VP (49) 33.86 0.930 35.39 0.974 32.97 0.937 31.48 0.918 30.81 0.949 28.43 0.900 25.85 0.857 24.60 0.871 22.69 0.791
DDS VE (99) 33.43 0.932 34.35 0.972 32.01 0.935 31.23 0.915 30.62 0.958 28.52 0.914 25.09 0.833 24.15 0.855 22.26 0.785
DiffusionMBIR (Chung et al., 2023b) (4000) 33.49 0.942 35.18 0.967 32.18 0.910 30.52 0.914 30.09 0.938 27.89 0.871 24.11 0.810 23.15 0.841 21.72 0.766
DPS (Chung et al., 2023a) (1000) 27.86 0.858 27.07 0.860 23.66 0.744 26.96 0.842 26.09 0.817 22.70 0.737 22.16 0.773 21.56 0.784 19.34 0.698
Score-Med (Song et al., 2022) (4000) 29.10 0.882 27.93 0.875 24.23 0.759 28.20 0.867 27.48 0.889 25.08 0.783 24.07 0.808 23.70 0.822 20.95 0.720
MCG (Chung et al., 2022a) (4000) 28.61 0.873 28.05 0.884 24.45 0.765 27.33 0.855 26.52 0.863 23.04 0.745 24.69 0.821 23.52 0.806 20.71 0.685
Lahiri et al. (2022) 21.38 0.711 23.89 0.769 20.81 0.716 20.37 0.652 21.41 0.721 18.40 0.665 19.74 0.631 19.92 0.720 17.34 0.650
FBPConvNet (Jin et al., 2017) 16.57 0.553 19.12 0.774 18.11 0.714 16.45 0.529 19.47 0.713 15.48 0.610 16.31 0.521 17.05 0.521 11.07 0.483
ADMM-TV 16.79 0.645 18.95 0.772 17.27 0.716 13.59 0.618 15.23 0.682 14.60 0.638 10.28 0.409 13.77 0.616 11.49 0.553

Table 4: Quantitative evaluation of SV-CT on the AAPM 256×256 test set (mean values; std values
in Tab. 8). (Numbers in parenthesis): NFE, Bold: best.

on par with score-MRI with 19 NFE. Even disregarding the additional ×2C more NFEs required
for score-MRI to account for the multi-coil complex valued acquisition, the proposed method still
achieves ×80 ∼ ×200 acceleration. We note that on average, our method takes about 4.7 seconds for
49 NFE, and about 2.25 seconds for 19 NFE on a single commodity GPU (RTX 3090).
Noisy multi-coil MRI reconstruction. One of
the most intriguing properties of the proposed Mask Pattern Acc. TV DPS (1000) DDS VP (49)
DDS is the ease of handling measurement noise ×4
PSNR [db] 24.19 24.40 29.47
SSIM 0.687 0.656 0.866
without careful computation of singular value Uniform 1D
PSNR [db] 23.02 24.60 26.77
decomposition (SVD), which is non-trivial to ×8
SSIM 0.638 0.666 0.827
perform for our tasks. With (26), we can solve ×8
PSNR [db] 23.07 23.48 30.95
it with CG, arriving at Algorithm 3 in supple- SSIM 0.609 0.592 0.890
VD Poisson disk
PSNR [db] 20.92 23.57 29.36
mentary material. For comparison, methods that × 15
SSIM 0.554 0.622 0.853
try to cope with measurement noise via SVD in
the diffusion model context (Kawar et al., 2022; Table 3: Quantitative metrics for noisy parallel
Wang et al., 2023) are not applicable and cannot imaging reconstruction. Numbers in parenthesis:
be compared. One work that does not require NFE.
computation of SVD and hence is applicable is
DPS (Chung et al., 2023a) relying on backpropagation. To test the efficacy of DDS on noisy inverse
problems, we add a rather heavy complex Gaussian noise (σ = 0.05) to the k-space multi-coil
measurements and reconstruct with Algorithm 3 by setting γ = 0.95 found through grid search. In
Tab. 3, we see that DDS far outperforms DPS (Chung et al., 2023a) with 1000 NFE by a large margin,
while being about ×40 faster as DPS requires the heavy computation of backpropagation.

4.2 3D CT RECONSTRUCTION

Sparse-view CT. Similar to the accelerated MRI experiments, we aim to both 1) improve the original
VE model of Chung et al. (2023b), and 2) train a new VP model better suitable for DDS. Inspecting
Tab. 4, we see that by using Algorithm 6 in supplementary material, we are able to reduce the NFE
to 100 and still achieve results that are on par with DiffusionMBIR with 4000 NFE. However, we
observe similar instabilities with the VE parameterization. Additionally, we find it crucial to initialize
the optimization process with CG and later switch to ADMM-TV using a CG solver for proper
convergence (see appendix E.2 for discussion). Switching to the VP parameterization and using
Algorithm 4, we now see that DDS achieves the new state-of-the-art with ≤ 49 NFE. Notably, this
decreases the sampling time to ∼ 25 min for 49 NFE, and ∼ 10 min wall-clock time for 19 NFE on a
single RTX 3090 GPU, compared to the painful 2 days for DiffusionMBIR. In Tab. 7, we see similar
improvements that were seen from SV-CT, where DDS significantly outperforms DiffusionMBIR
while being several orders of magnitude faster.

5 C ONCLUSION

In this work, we present Decomposed Diffusion Sampling (DDS), a general DIS for challenging
real-world medical imaging inverse problems. Leveraging the geometric view of diffusion models and
the property of the CG solvers on the tangent space, we show that performing numerical optimization
schemes on the denoised representation is superior to the previous methods of imposing DC. Further,
we devise a fast sampler based on DDIM that works well for both VE/VP settings. With extensive
experiments on multi-coil MRI reconstruction and 3D CT reconstruction, we show that DDS achieves
superior quality while being ≥ ×80 faster than the previous DIS.

9
Published as a conference paper at ICLR 2024

Ethics Statement We recognize the profound potential of our approach to revolutionize diagnostic
procedures, enhance patient care, and reduce the need for invasive techniques. However, we are
also acutely aware of the ethical considerations surrounding patient data privacy and the potential
for misinterpretation of generated images. All medical data used in our experiments were publicly
available and fully anonymized, ensuring the utmost respect for patient confidentiality. We advocate
for rigorous validation and clinical collaboration before any real-world application of our findings, to
ensure both the safety and efficacy of our proposed methods in a medical setting.

Reproducibility Statement For every different application and different circumstances (noise-
less/noisy, VE/VP, 2D/3D), we provide tailored algorithms(See Appendix. C,E) to ensure maximum
reproducibility. All the hyper-parameters used in the algorithms are detailed in Section 4 and
Appendix G. Code is open-sourced at [Link]

ACKNOWLEDGMENTS
This research was supported by the National Research Foundation of Korea(NRF)(RS-2023-
00262527), Field-oriented Technology Development Project for Customs Administration funded by
the Korean government (the Ministry of Science & ICT and the Korea Customs Service) through
the National Research Foundation (NRF) of Korea under Grant NRF2021M3I1A1097910 & NRF-
2021M3I1A1097938, Korea Medical Device Development Fund grant funded by the Korea gov-
ernment (the Ministry of Science and ICT, the Ministry of Trade, Industry, and Energy, the Min-
istry of Health & Welfare, the Ministry of Food and Drug Safety) (Project Number: 1711137899,
KMDF_PR_20200901_0015), and Culture, Sports, and Tourism R&D Program through the Korea
Creative Content Agency grant funded by the Ministry of Culture, Sports and Tourism in 2023.

R EFERENCES
Kai Tobias Block, Martin Uecker, and Jens Frahm. Undersampled radial MRI with multiple coils.
Iterative image reconstruction using a total variation constraint. Magnetic Resonance in Medicine:
An Official Journal of the International Society for Magnetic Resonance in Medicine, 57(6):
1086–1098, 2007.
Stephen Boyd, Neal Parikh, and Eric Chu. Distributed optimization and statistical learning via the
alternating direction method of multipliers. Now Publishers Inc, 2011.
Guangyong Chen, Fengyuan Zhu, and Pheng Ann Heng. An efficient statistical method for image
noise level estimation. In Proceedings of the IEEE International Conference on Computer Vision,
pp. 477–485, 2015.
Minshuo Chen, Kaixuan Huang, Tuo Zhao, and Mengdi Wang. Score approximation, estima-
tion and distribution recovery of diffusion models on low-dimensional data. arXiv preprint
arXiv:2302.07194, 2023.
Hyungjin Chung and Jong Chul Ye. Score-based diffusion models for accelerated mri. Medical
Image Analysis, pp. 102479, 2022.
Hyungjin Chung, Byeongsu Sim, Dohoon Ryu, and Jong Chul Ye. Improving diffusion models for
inverse problems using manifold constraints. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave,
and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022a. URL
[Link]
Hyungjin Chung, Byeongsu Sim, and Jong Chul Ye. Come-Closer-Diffuse-Faster: Accelerating
Conditional Diffusion Models for Inverse Problems through Stochastic Contraction. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022b.
Hyungjin Chung, Jeongsol Kim, Michael Thompson Mccann, Marc Louis Klasky, and Jong Chul Ye.
Diffusion posterior sampling for general noisy inverse problems. In International Conference on
Learning Representations, 2023a. URL [Link]
Hyungjin Chung, Dohoon Ryu, Michael T Mccann, Marc L Klasky, and Jong Chul Ye. Solving
3d inverse problems using pre-trained 2d diffusion models. IEEE/CVF Conference on Computer
Vision and Pattern Recognition, 2023b.

10
Published as a conference paper at ICLR 2024

Prafulla Dhariwal and Alexander Quinn Nichol. Diffusion models beat GANs on image synthesis.
In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural
Information Processing Systems, 2021.
Yilun Du, Conor Durkan, Robin Strudel, Joshua B Tenenbaum, Sander Dieleman, Rob Fergus, Jascha
Sohl-Dickstein, Arnaud Doucet, and Will Sussman Grathwohl. Reduce, reuse, recycle: Composi-
tional generation with energy-based diffusion models and mcmc. In International Conference on
Machine Learning, pp. 8489–8510. PMLR, 2023.
Bradley Efron. Tweedie’s formula and selection bias. Journal of the American Statistical Association,
106(496):1602–1614, 2011.
Jeffrey A Fessler and Bradley P Sutton. Nonuniform fast fourier transforms using min-max interpola-
tion. IEEE transactions on signal processing, 51(2):560–574, 2003.
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in
Neural Information Processing Systems, 33:6840–6851, 2020.
Aapo Hyvärinen and Peter Dayan. Estimation of non-normalized statistical models by score matching.
Journal of Machine Learning Research, 6(4), 2005.
Ajil Jalal, Marius Arvinte, Giannis Daras, Eric Price, Alexandros G Dimakis, and Jonathan Tamir.
Robust compressed sensing mri with deep generative priors. Advances in Neural Information
Processing Systems, 34, 2021.
Kyong Hwan Jin, Michael T McCann, Emmanuel Froustey, and Michael Unser. Deep convolutional
neural network for inverse problems in imaging. IEEE Transactions on Image Processing, 26(9):
4509–4522, 2017.
Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-
based generative models. In Proc. NeurIPS, 2022.
Bahjat Kawar, Michael Elad, Stefano Ermon, and Jiaming Song. Denoising diffusion restoration
models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances
in Neural Information Processing Systems, 2022. URL [Link]
kxXvopt9pWK.
Dongjun Kim, Seungjae Shin, Kyungwoo Song, Wanmo Kang, and Il-Chul Moon. Soft truncation: A
universal training technique of score-based diffusion model for high precision score estimation.
International conference on machine learning, 2022.
Dana A Knoll and David E Keyes. Jacobian-free newton–krylov methods: a survey of approaches
and applications. Journal of Computational Physics, 193(2):357–397, 2004.
Anish Lahiri, Marc Klasky, Jeffrey A Fessler, and Saiprasad Ravishankar. Sparse-view cone beam ct
reconstruction using data-consistent supervised and adversarial learning from scarce training data.
arXiv preprint arXiv:2201.09318, 2022.
Jörg Liesen and Zdenek Strakos. Krylov subspace methods: principles and analysis. Oxford
University Press, 2013.
Guanxiong Luo, Moritz Blumenthal, Martin Heide, and Martin Uecker. Bayesian mri reconstruction
with joint uncertainty estimation using diffusion models. Magnetic Resonance in Medicine, 90(1):
295–311, 2023.
Michael Lustig, David Donoho, and John M Pauly. Sparse MRI: The application of compressed
sensing for rapid MR imaging. Magnetic Resonance in Medicine: An Official Journal of the
International Society for Magnetic Resonance in Medicine, 58(6):1182–1195, 2007.
Neal Parikh and Stephen Boyd. Proximal algorithms. Foundations and Trends in optimization, 1(3):
127–239, 2014.
Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d
diffusion. arXiv, 2022.

11
Published as a conference paper at ICLR 2024

Matteo Ronchetti. Torchradon: Fast differentiable routines for computed tomography. arXiv preprint
arXiv:2009.14788, 2020.
Litu Rout, Advait Parulekar, Constantine Caramanis, and Sanjay Shakkottai. A theoretical jus-
tification for image inpainting using denoising diffusion probabilistic models. arXiv preprint
arXiv:2302.01217, 2023a.
Litu Rout, Negin Raoof, Giannis Daras, Constantine Caramanis, Alexandros G Dimakis, and Sanjay
Shakkottai. Solving linear inverse problems provably via posterior sampling with latent diffusion
models. arXiv preprint arXiv:2307.00619, 2023b.
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In 9th
International Conference on Learning Representations, ICLR, 2021a.
Jiaming Song, Arash Vahdat, Morteza Mardani, and Jan Kautz. Pseudoinverse-guided diffusion
models for inverse problems. In International Conference on Learning Representations, 2023.
URL [Link]
Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution.
In Advances in Neural Information Processing Systems, volume 32, 2019.
Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and
Ben Poole. Score-based generative modeling through stochastic differential equations. In 9th
International Conference on Learning Representations, ICLR, 2021b.
Yang Song, Liyue Shen, Lei Xing, and Stefano Ermon. Solving inverse problems in medical imaging
with score-based generative models. In International Conference on Learning Representations,
2022. URL [Link]
Anuroop Sriram, Jure Zbontar, Tullie Murrell, Aaron Defazio, C Lawrence Zitnick, Nafissa Yakubova,
Florian Knoll, and Patricia Johnson. End-to-end variational networks for accelerated MRI re-
construction. In International Conference on Medical Image Computing and Computer-Assisted
Intervention, pp. 64–73. Springer, 2020.
Martin Uecker, Peng Lai, Mark J Murphy, Patrick Virtue, Michael Elad, John M Pauly, Shreyas S
Vasanawala, and Michael Lustig. ESPIRiT—an eigenvalue approach to autocalibrating parallel
MRI: where SENSE meets GRAPPA. Magnetic resonance in medicine, 71(3):990–1001, 2014.
Pascal Vincent. A connection between score matching and denoising autoencoders. Neural computa-
tion, 23(7):1661–1674, 2011.
Yinhuai Wang, Jiwen Yu, and Jian Zhang. Zero-shot image restoration using denoising diffusion
null-space model. In The Eleventh International Conference on Learning Representations, 2023.
URL [Link]
Jure Zbontar, Florian Knoll, Anuroop Sriram, Tullie Murrell, Zhengnan Huang, Matthew J Muckley,
Aaron Defazio, Ruben Stern, Patricia Johnson, Mary Bruno, et al. fastMRI: An open dataset and
benchmarks for accelerated MRI. arXiv preprint arXiv:1811.08839, 2018.
Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng, and Lei Zhang. Beyond a gaussian denoiser:
Residual learning of deep CNN for image denoising. IEEE transactions on image processing, 26
(7):3142–3155, 2017.

12
Published as a conference paper at ICLR 2024

A P RELIMINARIES
A.1 D IFFUSION MODELS

Let us define a random variable x0 ∼ p(x0 ) = pdata (x0 ), where pdata denotes the data distri-
bution. In diffusion models, we construct a continuous Gaussian perturbation kernel p(xt |x0 ) =
N (xt ; st x0 , s2t σt2 I) with t ∈ [0, 1], which smooths out the distribution. As t → 1, the marginal
distribution pt (xt ) is smoothed such that it approximates the Gaussian distribution, which becomes
our reference distribution to sample from. Using the reparametrization trick, one can directly sample
xt = st x0 + st σt z, z ∼ N (0, I). (29)
Diffusion models aim to revert the data noising process. Remarkably, it was shown that the data
noising process and the denoising process can both be represented as a stochastic differential equation
(SDE), governed by the score function ∇xt log p(xt ) (Song et al., 2021b; Karras et al., 2022).
Namely, the forward/reverse diffusion SDE can be succinctly represented as (assuming st = 1 for
simplicity,
p
dx± = −σ̇t σt ∇xt log p(xt ) dt ± σ̇t σt ∇xt log p(xt ) dt + σ̇t σt dwt , (30)
where wt is the standard Wiener process. Here, the + sign denotes the forward process, where (30)
collapses to a Brownian motion. With the − sign, the process runs backward, and we see that the score
function ∇xt log p(xt ) governs the reverse SDE. In other words, in order to run reverse diffusion
sampling (i.e. generative modeling), we need access to the score function of the data distribution.
The procedure called score matching, where one tries to train a parametrized model sθ to approximate
∇xt log p(xt ) can be done through score matching (Hyvärinen & Dayan, 2005). As explicit and
implicit score matching methods are costly to perform, the most widely used training method in the
modern sense is the so-called denoising score matching (DSM) (Vincent, 2011)
h i
(t)
min Ext ,x0 ,ϵ ∥sθ (xt ) − ∇xt log p(xt |x0 )∥22 , (31)
θ
which is easy to train as our perturbation kernel is Gaussian. Once sθ∗ is trained, we can use it as a
plug-in approximation of the score function to plug into (30).
The score function has close relation to the posterior mean E[x0 |xt ], which can be formally linked
through Tweedie’s formula
Lemma 1 (Tweedie’s formula). Given a Gaussian perturbation kernel p(xt |x0 ) =
N (xt ; st x0 , σt2 I), the posterior mean is given by
1
E[x0 |xt ] = (xt + σt2 ∇xt log p(xt )) (32)
st

Proof.
∇xt p(xt )
∇xt log p(xt ) = (33)
p(xt )
Z
1
= ∇xt p(xt |x0 )p(x0 ) dx0 (34)
p(xt )
Z
1
= ∇xt p(xt |x0 )p(x0 ) dx0 (35)
p(xt )
Z
1
= p(xt |x0 )∇xt log p(xt |x0 )p(x0 ) dx0 (36)
p(xt )
Z
= p(x0 |xt )∇xt log p(xt |x0 ) dx0 (37)
st x0 − xt
Z
= p(x0 |xt ) dx0 (38)
σt2
st E[x0 |xt ] − xt
= . (39)
σt2

13
Published as a conference paper at ICLR 2024

In other words, having access to the score function is equivalent to having access to the posterior
mean through time t, which we extensively leverage in this work.

A.2 K RYLOV SUBSPACE METHODS

Consider the problem y = Ax, where x, y ∈ RN , and we have a square matrix A ∈ RN ×N . We


define a fixed point iteration
x = B̂x + ŷ, B̂ := I − D −1 A, ŷ := D −1 y, (40)
where D is some diagonal matrix. One can show that the iteration in (40) converges to the solution
of x∗ = A−1 y if and only if ρ(B̂) < 1, where ρ(·) denotes the spectral norm of the matrix. More
specifically, we further have that
lim sup ∥xn − x∗ ∥1/n ≤ ρ(B̂). (41)
n→∞

Here, unless we know the solution x , we cannot compute the error vector
dn := xn − x∗ . (42)
Instead, however, what we have access to is the residual
bn = y − Axn = −Adn . (43)
For simplicity, let D = I and thus B̂ = I − A. The residual now becomes
bn = y − Axn = B̂xn + y − xn = xn+1 − xn . (44)
Then, with (44), the Jacobi iteration reads
xn+1 := xn + bn (45)
bn+1 := bn − Abn = B̂bn . (46)
This also implies the following
bn ∈ Span(b, Ab, · · · , An b) (47)
n−1
xn ∈ x0 + Span(b, Ab, · · · , A b), (48)
with the latter obtained by shifting the subspace of bn−1 . The Krylov subspace is exactly defined by
this span, i.e. Kn (A) := Span(b, Ab, · · · , An b).
In Krylov subspace methods, the idea is to generate a sequence of approximate solutions xn ∈
x0 + Kn (A) of Ax = y, so that the sequence of the residuals bn ∈ Kn+1 (A) converges to the zero
vector.

B P ROOFS
(t)
Lemma 2 (Total noise). Assuming optimality of ϵθ∗ , the total noise w̃t in (13) can be represented by
p
w̃t = 1 − ᾱt−1 ϵ̃ (49)

for some ϵ̃ ∼ N (0, I). In other words, (23) is equivalently represented by xt−1 = ᾱt−1 x̂t +

1 − ᾱt−1 ϵ̃ for some ϵ̃ ∼ N (0, I).

Proof. Given the independence of the estimated ϵ̂t and the stochastic ϵ ∼ N (0, I) along with the
Gaussianity, the noise variance in (15) is given as 1 − ᾱt−1 − η 2 β̃t2 + η 2 β̃t2 = 1 − ᾱt−1 , recovering
a sample from q(xt−1 |x0 ).
Proposition 1 (Manifold Constrained Gradient). Suppose the clean data manifold M is represented
as an affine subspace and assumes the uniform distribution on M. Then,
∂ x̂t 1
= √ PM (19)
∂xt ᾱt
x̂t − γt ∇xt ℓ(x̂t ) = PM (x̂t − ζt ∇x̂t ℓ(x̂t )) (20)
for some ζt > 0, where PM denotes the orthogonal projection to M.

14
Published as a conference paper at ICLR 2024


Proof.
√ First, we provide a proof for (19). Thanks to the forward sampling xt−1 = ᾱt−1 x0 +
1 − ᾱt−1 ϵ̃, we can obtain the closed-form expression of the likelihood:
p !
1 ∥xt − ᾱ(t)x0 ∥2
p(xt |x0 ) = exp − , (50)
(2π(1 − ᾱ(t)))d/2 2(1 − ᾱ(t))
which is a Gaussian distribution. Using the Bayes’ theorem, we have
Z
p(xt ) = p(xt |x0 )p(x0 )dx0 . (51)

According to the assumption, p(x0 ) is a uniform distribution on the subspace M. To incorporate this
in (51), we first compute p(xt ) by modeling p(x0 ) as the zero-mean Gaussian distribution with the
isotropic variance σ 2 and then take the limit σ → ∞. More specifically, we have
∥PM x0 ∥2
 
1
p(x0 ) = exp − , (52)
(2πσ 2 )l/2 2σ 2
where we use PM x0 = x0 as x0 ∈ M. Therefore, we have
p(xt , x0 ) = p(xt |x0 )p(x0 )
1 1
= exp (−d(xt , x0 ))
(2π(1 − ᾱ(t)))d/2 (2πσ 2 )l/2
where
d(xt , x0 )
p
⊥ ∥PM xt − ᾱ(t)PM x0 ∥2
∥PM xt ∥2 ∥PM x0 ∥2
= + +
2(1 − ᾱ(t)) 2σ 2 2(1 − ᾱ(t))
2⊥
∥PM x0 − µt ∥ ∥PM xt ∥2 + c(t)∥PM xt ∥2
= +
2s(t) 2(1 − ᾱ(t))
and
 −1
1 ᾱ(t)
s(t) = 2
+ (53)
σ 1 − ᾱ(t)

ᾱ(t)
1−ᾱ(t)
µt = ᾱ(t)
xt (54)
1
σ 2 + 1−ᾱ(t)
1
c(t) = ᾱ(t)
(55)
1+ 2
1−ᾱ(t) σ

Therefore, after integrating out with respect to x0 , we have



∥PM xt ∥2 + c(t)∥PM xt ∥2
log p(xt ) = − + const.
2(1 − ᾱ(t))
leading to
1 c(t)
∇xt log p(xt ) = − P ⊥ xt − PM xt
1 − ᾱ(t) M 1 − ᾱ(t)
Furthermore, using (55), for the uniform distribution we have limσ→∞ c(t) = 0. Therefore,
1
lim ∇xt log p(xt ) = − P ⊥ xt
σ→∞ 1 − ᾱ(t) M
Now, using Tweedie’s denoising formula in (14), we have
1 √ (t)
x̂t = √ (xt − 1 − ᾱt ϵθ∗ (xt )) (56)
ᾱt
1
= √ (xt + (1 − ᾱt )∇xt log p(xt )) (57)
ᾱt

15
Published as a conference paper at ICLR 2024

where we use √
(t) (t)
sθ∗ (xt ) = ∇xt log p(xt ) = −ϵθ∗ (xt )/ 1 − ᾱt
Accordingly, we have
1 ⊥ 1
x̂t = √ (xt − PM xt ) = √ PM xt (58)
ᾱt ᾱt
Therefore, we have
∂ x̂t 1
= √ PM . (59)
∂xt ᾱt

Second, we provide a proof for (20). Since x̂t ∈ M, we have x̂t = PM x̂t . Thus, using (59), we
have
∂ x̂t
x̂t − γt ∇xt ℓ(x̂t ) = x̂t − γt ∇x̂t ℓ(x̂t )
∂xt
γt
= PM x̂t − √ PM ∇x̂t ℓ(x̂t )
ᾱt
= PM (x̂t − ζt ∇x̂t ℓ(x̂t ))

where ζt := γt / ᾱt . Q.E.D.

For score functions trained in the context of DDPM (VP-SDE), DDIM sampling of (13) can directly
be used. However, for VE-SDE that was not developed under the variational framework, it is unclear
how to construct a sampler equivalent of DDIM for VP-SDE. As one of our goals is to enable the
direct adoption of pre-trained diffusion models regardless of the framework, our goal is to devise a
fast sampler tailored for VE-SDE. Now, the idea of the decomposition of DDIM steps can be easily
adopted to address this issue.
First, observe that the forward conditional density for VE-SDE can be written as q(xt |x0 ) =
N (xt |x0 , σt2 I), where σt is taken to be a geometric series following Song et al. (2021b). This
directly leads to the following result:
Proposition 2 (VE Decomposition). The following update rule recovers a sample from the marginal
q(xt−1 |x0 ) ∀η ∈ [0, 1].
xt−1 = x̂t + w̃t (60)
where
x̂t := xt + σt2 ŝt (61)
q
w̃t = −σt−1 σt 1 − β̃ 2 η 2 ŝt + σt−1 η β̃t ϵ (62)

Proof. From the equivalent parameterization as in (14), we have the following relations in VE-SDE
xt − x̂t
ŝt = − (63)
σt2
xt − x̂t
ϵ̂t = = −σt ŝt (64)
σt
x̂t = xt + σt2 ŝt . (65)
Plugging in, we have
q
xt−1 = x̂t + σt−1 1 − β̃ 2 η 2 ϵ̂t + σt−1 η β̃ϵ. (66)
Since the variance can be now computed as
 q 2  2
2 2 2
σt−1 1 − β̃ η + σt−1 η β̃ = σt−1 , (67)

2
xt−1 ∼ q(xt−1 ; x̂t , σt−1 I) = q(xt−1 |x0 ) by the assumption.

16
Published as a conference paper at ICLR 2024

2
Again, x̂t arises from Tweedie’s formula. With η = 1 and β̃ = 1 − σt−1 /σt2 , we can recover the
original VE-SDE, whereas η = 0 leads to the deterministic variation of VE-SDE, which we call
VE-DDIM. We can now use the usual VP-DDIM (13) or VE-DDIM (60) depending on the training
strategy of the pre-trained score function.
Here, we present our main geometric observations in the VE context, which are analogous to
Proposition 1 and Lemma 2. Their proofs are straightforward corollaries, and hence, are omitted.
Proposition 3 (VE-DDIM Decomposition). Under the same assumption of Proposition 1, we have
∂ x̂t
= PM (68)
∂xt
x̂t − γt ∇xt ℓ(x̂t ) = PM (x̂t − γt ∇x̂t ℓ(x̂t )) (69)
where PM denotes the orthogonal projection to the subspace M.

Accordingly, our DDS algorithm for VE-DDIM is as follows:


xt−1 = x̂′t + w̃t , (70)
x̂′t = CG(A∗ A, A∗ y, x̂t , M ), M ≤l (71)
where CG(·) denotes the M -step CG for the normal equation starting from x̂t .

C A LGORITHMS

In the following tables, we list all the DDS algorithms used throughout the manuscript. For simplicity,
we define CG(A, y, x, M ) to be running M steps of conjugate gradient steps with initialization x. For
completeness, we include a pseudo-code of the CG method in Algorithm. 1 that is used throughout
the work.

Algorithm 1 Conjugate Gradient (CG)


Require: A, y, x0 , M
1: r 0 ← b − Ax0
2: p0 ← b0
3: for i = 0 : K − 1 do
r⊤ r
4: αk ← p⊤kAp
k k
5: xk+1 ← xk + αk pk
6: r k+1 ← r k − αk Apk
r⊤ r
7: βk ← r⊤kr
k k
8: pk+1 ← bk+1 + βk pk
9: end for
10: return xK

D D ISCUSSION ON C ONDITIONING

Projection type Methods that belong to this class aim to directly replace1 what we have in the range
space of the intermediate noisy xi with the information from the measurement y. Two representative
works that utilize projection are Chung & Ye (2022) and Song et al. (2022). In Chung & Ye (2022),
we use
xt = (I − A† A)x′t + A† y, (72)

where the information from y will be used to fill in the range space of A† . However, this is clearly
problematic when considering the geometric viewpoint, as the sample may fall off the noisy manifold.
1
Both hard (Chung & Ye, 2022) or soft (Song et al., 2022) constraints can be utilized.

17
Published as a conference paper at ICLR 2024

Algorithm 2 DDS (PI MRI; VP; noiseless)


Require: ϵθ∗ , N, {αt }Nt=1 , η, A, M
1: xN ∼ N (0, I)
2: for t = N : 2 do
3: ϵ̂t ← ϵθ∗ (xt )
4: ▷ Tweedie denoising
√ √
5: x̂t ← (xt − 1 − ᾱt ϵ̂t )/ ᾱt
6: ▷ Data consistency
7: x̂′t ← CG(A∗ A, A∗ y, x̂t , M )
8: ϵ ∼ N (0, I)
9: ▷ DDIM sampling q

10: xt−1 ← ᾱt−1 x̂′t − 1 − ᾱt−1 − η 2 β̃t2 ϵ̂t + η β̃t ϵ
11: end for √ √
12: x0 ← (x1 − 1 − ᾱ1 ϵθ∗ (x1 ))/ ᾱ1
13: return x0

Algorithm 3 DDS (PI MRI; VP; noisy)


Require: ϵθ∗ , N, {αt }Nt=1 , η, A, M, γ
1: xN ∼ N (0, I)
2: for t = N : 2 do
3: ϵ̂t ← ϵθ∗ (xt )
4: ▷ Tweedie denoising
√ √
5: x̂t ← (xt − 1 − ᾱt ϵ̂t )/ ᾱt
6: ▷ Data consistency
7: ACG ← I + γA∗ A
8: y CG ← x̂t + γA∗ y
9: x̂′t ← CG(ACG , y CG , x̂t , M )
10: ϵ ∼ N (0, I)
11: ▷ DDIM sampling q

12: xt−1 ← ᾱt−1 x̂′t − 1 − ᾱt−1 − η 2 β̃t2 ϵ̂t + η β̃t ϵ
13: end for √ √
14: x0 ← (x1 − 1 − ᾱ1 ϵθ∗ (x1 ))/ ᾱ1
15: return x0

Gradient type In the Bayesian reconstruction perspective, it is natural to incorporate the gradi-
ent of the likelihood as ∇ log p(xt |y) = ∇ log p(xt ) + ∇ log p(y|xt ). Here, while one can use
∇ log p(xt ) ≃ sθ∗ (xt ), ∇ log p(y|xt ) is intractable for all t ̸= 0 (Chung et al., 2023a), one has to
resort to approximations of ∇ log p(y|xt ). In a similar spirit to Score-MRI (Chung & Ye, 2022), one
can simply use gradient steps of the form
xt = x′t − ξt A∗ (y − Ax′t ), (73)
where ξt is the step size (Jalal et al., 2021). Nevertheless, ∇xt log p(xt |y) is far from Gaussian as i
gets further away from 0, and hence is hard to interpret nor analyze what the gradient steps in the
direction of A∗ (y−Ax′t ) is leading. Albeit not in the context of MRI, a more recent approach (Chung
et al., 2023a) proposes to use
xt = x′t − ξi ∇xt+1 ∥y − Ax̂t+1 ∥22 . (74)
As x̂t+1 is the Tweedie denoised estimate and is free from Gaussian noise, (74) can be thought of as
minimizing the residual on the noiseless data manifold M. However, care must be taken since taking
the gradient with respect to xt corresponds to computing automatic differentiation through the neural
net sθ∗ , often slowing down the compute by about ×2 (Chung et al., 2023a).

T2I vs. DIS For the former, the likelihood is usually given as a neural net-parameterized function
pϕ (y|x) (e.g. classifier, implicit gradient from CFG), whereas for the latter, the likelihood is given as
an analytic distribution arising from some linear/non-linear forward operator (Kawar et al., 2022;
Chung et al., 2023a).

18
Published as a conference paper at ICLR 2024

Algorithm 4 DDS (3D CT recon; VP)


Require: ϵθ∗ , N, {σt }N N
t=1 , η, A, M, {λt }t=1
1: xN ∼ N (0, σT2 I)
2: z N ← torch.zeros_like(xN )
3: w N ← torch.zeros_like(xN )
4: for i = N : 2 do
5: ϵ̂t ← ϵθ∗ (xt )
6: ▷ Tweedie denoising
√ √
7: x̂t ← (xt − 1 − ᾱt ϵ̂t )/ ᾱt
8: ▷ Data consistency
9: ACG ← AT A + ρD Tz D z
10: bCG ← AT y + ρD Tz (z t+1 − wt+1 )
11: x̂′t ← CG(ACG , bCG , x̂t , M )
12: z t ← Sλt /ρ (D z x̂′t + wt+1 )
13: wt ← wt+1 + D z x̂′t − z t
14: ▷ DDIM sampling q

15: xt−1 ← ᾱt−1 x̂′t − 1 − ᾱt−1 − η 2 β̃t2 ϵ̂t + η β̃t ϵ
16: end for √ √
17: x0 ← (x1 − 1 − ᾱ1 ϵθ∗ (x1 ))/ ᾱ1

Algorithm 5 DDS (PI MRI; VE)


Require: sθ∗ , N, {σt }Nt=1 , η, A, M
1: xN ∼ N (0, σT2 I)
2: for t = N : 2 do
3: ŝt ← sθ∗ (xt )
4: ▷ Tweedie denoising
5: x̂t ← xt + σt2 sθ∗ (xt )
6: ▷ Data consistency
7: x̂′t ← CG(A∗ A, A∗ y, x̂t , M )
8: ϵ ∼ N (0, I)
9: ▷ DDIM sampling q
10: xt−1 ← x̂′t − σt−1 σt 1 − β̃ 2 η 2 ŝt + σt−1 ϵ
11: end for
12: x0 ← x1 + σ12 sθ∗ (x1 )
13: return x0

E A LGORITHMIC DETAILS

We provide the VP counterpart of the DDS VE algorithms presented in Algorithm 5,6 in Algorithm 2,4.
The only differences are that the model is now parameterized with ϵθ that estimates the residual noise
components and that we have a different noise schedule.

E.1 ACCELERATED MRI

As stated in section 4.1, using ≥ 200 NFE when using Algorithm 5 degrades the performance due to
the numerical instability. To counteract this issue, we use the same iteration until i ≤ N/50 =: k,
and directly acquire the final reconstruction by Tweedie’s formula:

x0 = xk + σk2 sθ∗ (xk ) (75)

19
Published as a conference paper at ICLR 2024

Algorithm 6 DDS (3D CT recon; VE)


Require: sθ∗ , N, {σt }N N
i=1 , η, A, M, {λt }t=1
1: xN ∼ N (0, σT2 I)
2: z N ← torch.zeros_like(xN )
3: w N ← torch.zeros_like(xN )
4: for t = N : 2 do
5: ŝt ← sθ∗ (xt )
6: ▷ Tweedie denoising
7: x̂t ← xt + σt2 sθ∗ (xt )
8: ▷ Data consistency
9: ACG ← AT A + ρD Tz D z
10: bCG ← AT y + ρD Tz (z t+1 − wt+1 )
11: x̂′t ← CG(ACG , bCG , x̂t , M )
12: z t ← Sλt /ρ (D z x̂′t + wt+1 )
13: wt ← wt+1 + D z x̂′t − z t
14: ▷ DDIM sampling q
15: xt−1 ← x̂′t − σt−1 σt 1 − β̃ 2 η 2 ŝt + σt−1 ϵ
16: end for
17: x0 ← x1 + σ12 sθ∗ (x1 )
18: return x0

Figure 3: Illustration of the imaging forward model used in this work. (a) 3D sparse-view CT:
the forward matrix A transforms the 3D voxel space into 2D projections. (b) Multi-coil CS-MRI:
the forward matrix A first applies Fourier transform to turn the image into k-space. Subsequently,
sensitivity maps are applied as element-wise product to achieve multi-coil measurements. Finally, the
multi-coil measurements are sub-sampled with the masks.

E.2 3D CT RECONSTRUCTION

Recall that to perform 3D reconstruction, we resort to the following optimization problem (omitting
the indices for simplicity)

1
x̂∗ = arg min ∥Ax̂ − y∥22 + λ∥D z x̂∥1 . (76)
x̂ 2

20
Published as a conference paper at ICLR 2024

One can utilize ADMM to stably solve the above problem. Here, we include the steps to arrive at
Algorithms 6,4 for completeness. Reformulating into a constrained optimization problem, we have
1
min ∥y − Ax̂∥22 + λ∥z∥1 (77)
x̂,z 2
s.t. z = D z x̂. (78)
Then, ADMM can be implemented as separate update steps for the primal and the dual variables
1 ρ
x̂j+1 = arg min ∥y − Ax̂j ∥22 + ∥D z x̂j − z j + w∥22 (79)
x̂j 2 2
ρ
z j+1 = arg min λ∥z j ∥1 + ∥D z x̂j − z j + w∥22 (80)
zj 2
wj+1 = wj + D z x̂j+1 − z j+1 . (81)
We have a closed-form solution for (79)
x̂j+1 = (AT A + ρD Tz D z )−1 (AT y + ρD Tz (z − w)), (82)
which can be numerically solved by iterative CG
x̂j+1 = CG(ACG , bCG , x̂j , M ) (83)
ACG := A A + ρD Tz D z
T
(84)
bCG := AT y + ρD Tz (z − w). (85)
Moreover, as (80) is in the form of proximal mapping (Parikh & Boyd, 2014), we have that
z j+1 = Sλ/ρ (D z x + w). (86)
Thus, we have the following update steps
x̂j+1 = CG(ACG , bCG , x̂j , M )
z j+1 = Sλ/ρ (D z x̂j+1 + wj )
wj+1 = wj + D z x̂j+1 − z j+1 .
Usually, such an update step is performed in an iterative fashion until convergence. However, in
our algorithm for 3D reconstruction (Algorithm 6, 4), we only use a single iteration of ADMM per
each step of denoising. This is possible as we share the primal variable z and the dual variable w as
global variables throughout the update steps, leading to proper convergence with a single iteration
(fast variable sharing technique (Chung et al., 2023b)).

Stabilizing Algorithm 6 As stated in section 4.2, in


order to stabilize Algorithm 6, we found it beneficial to
first start the DC imposing strategy with simple CG update
steps without TV prior, as used in our MRI experiments.
We iterate our solver starting from i = N down to N/2
using standard CG updates. Then, we switch to ADMM-
TV scheme starting from i = N/2 − 1 down to 2.

F A DDITIONAL E XPERIMENTS
F.1 N OISE OFFSET

For this experiment, we take 50 random proton density


(PD) weighted images from the fastMRI validation dataset,
and add Gaussian noise σGT = 7.00[×10−2 ] to the im-
ages. For each noisy image, we apply the following DC
step for each method Figure 4: DDS reconstruction of CS-
MRI on radial sampling trajectory. Col
1. Score-MRI (Chung & Ye, 2022): (72) 1: sampling trajectory, 2: Zero filled re-
construction + density compensation, 3:
DDS (99 NFE), 4: ground truth.
21
Published as a conference paper at ICLR 2024

2. Jalal et al. (Jalal et al., 2021): (73) with step size


as used in the implementation2
3. DPS (Chung et al., 2023a): (74) with step size
1.0
4. DDNM (Wang et al., 2023): (25)
5. DDS (CG): CG applied to the Tweedie denoised
estimate, M = 5.

Once the update step is performed, the Gaussian noise level of the updated samples is estimated with
the method from Chen et al. (2015). Note that as the estimation method is imperfect, there is already
a gap between the ground truth noise level and the estimated noise level.

No DDS
Method process
Score-MRI Jalal et al. DPS DDNM
(CG)

σest 7.556 5.959 6.303 8.527 8.256 7.859


np
|σest − σest | 0.000 1.597 1.253 0.917 0.700 0.303
Table 5: Noise offset experiment. Gaussian noise level estimated with Chen et al. (2015). Real noise
np
level: σGT = 7.00[×10−2 ]; σest = 7.56[×10−2 ]

F.2 N ON -C ARTESIAN MRI TRAJECTORIES

To put an emphasis on the fact that the proposed method is agnostic to the forward imaging model
at hand, for the first time in DIS literature, we conduct MRI reconstruction experiments on the
non-Cartesian trajectory, which involves non-uniform Fast Fourier Transform (NUFFT) (Fessler &
Sutton, 2003). In Fig. 4 we show that DDS is able to reconstruct high-fidelity images from radial
trajectory sampling, even under aggressive accelerations.

F.3 S TOCHASTICITY IN SAMPLING

For our sampling strategy, the stochasticity of the sampling process is determined by the parameter
η ∈ [0, 1]. When η → 0, the sampling becomes deterministic, whereas when η → 1, the sampling
becomes maximally stochastic. It is known in the literature that for unconditional sampling using low
NFE, setting η as close to 0 leads to better performance Song et al. (2021a). In Fig. 6, we see a similar
trend when we set NFE = 20. However, when we set NFE ≥ 50, we observe that the choice of η
does not matter too much, as it often fluctuates within the boundary that can be as well be thought
of as arising from the inherent stochasticity of the sampling procedure. This is different from the
observation made in (Song et al., 2021a), which can be thought of as stemming from the conditional
sampling strategy that the proposed method uses.

F.4 I NSTABILITY IN VE PARAMETERIZATION

NFE 4000 500 200 100 50 30


Score-MRI
Chung & Ye (2022)
33.25 33.19 33.13 31.67 3.015 3.239

Ours 32.07 31.16 31.99 33.69 31.79 30.40

Table 6: PSNR [db] of uniform 1D ×4 acc. reconstruction with varying NFEs.

One observation that is made in this experiment, however, is that using ≥ 200 NFEs for the proposed
method degrades the performance. We find that this degradation stems from the numerical pathologies
that arise when VE-SDE is combined with the parameterizing the neural network with sθ . Specifically,
the score function is parameterized to estimate sθ∗ (xt ) ≃ − xtσ−x
2
0
= ϵ/σt . Near t = 0, σt attains a
t

2
[Link]

22
Published as a conference paper at ICLR 2024

(a) VE – parameterization 𝒔𝜃
‖𝒙0 − 𝒙𝜃∗ (𝒙𝑡 )‖2 ‖𝒙0 − 𝒙𝜃∗ (𝒙𝑡 )‖2

𝑖 𝑖
(b) VP – parameterization 𝝐𝜃 (c) Evolution of denoised estimates
(i) VE
ෝ500
𝒙 ෝ250
𝒙 ෝ100
𝒙 ෝ3
𝒙

(ii) VP
ෝ500
𝒙 ෝ250
𝒙 ෝ100
𝒙 ෝ3
𝒙

Figure 5: Evolution of the reconstruction error through time. ±1.0σ plot. (a) VE parameterized with
sθ , (b) VP parameterized with ϵθ , (c) Visualization of x̂t .

PSNR[db]
35.0

34.0
NFE = 20
Axial∗ Coronal Sagittal
NFE = 50
33.0 Method PSNR ↑ SSIM ↑ PSNR ↑ SSIM ↑ PSNR ↑ SSIM ↑
NFE = 100
DDS VP (49) 35.07 0.963 32.90 0.955 29.34 0.861
32.0 DiffusionMBIR Chung et al. (2023b) 34.92 0.956 32.48 0.947 28.82 0.832
MCG Chung et al. (2022a) 26.01 0.838 24.55 0.823 21.59 0.706
Song et [Link] et al. (2022) 27.80 0.852 25.69 0.869 22.03 0.735
31.0 DPS Chung et al. (2023a) 25.32 0.829 24.80 0.818 21.50 0.698
Lahiri et al. Lahiri et al. (2022) 28.08 0.931 26.02 0.856 23.24 0.812
Zhang et al. Zhang et al. (2017) 26.76 0.879 25.77 0.874 22.92 0.841
η =0.0 0.2 0.4 0.6 0.8 1.0 ADMM-TV 23.19 0.793 22.96 0.758 19.95 0.782

Figure 6: Ablation study on the selection Table 7: Quantitative evaluation of LA-CT (90◦ ) on the
of η for Algorithm 2. AAPM 256×256 test set. Bold: Best.

8-view 4-view 2-view


Axial∗ Coronal Sagittal Axial∗ Coronal Sagittal Axial∗ Coronal Sagittal
Method PSNR ↑ SSIM ↑ PSNR ↑ SSIM ↑ PSNR ↑ SSIM ↑ PSNR ↑ SSIM ↑ PSNR ↑ SSIM ↑ PSNR ↑ SSIM ↑ PSNR ↑ SSIM ↑ PSNR ↑ SSIM ↑ PSNR ↑ SSIM ↑
DDS VP (19) 1.38 0.05 1.51 0.06 1.76 0.06 1.96 0.07 1.26 0.10 2.73 0.11 2.01 0.08 2.89 0.10 2.01 0.21
DDS VP (49) 1.96 0.08 1.93 0.10 1.34 0.05 2.01 0.12 2.10 0.11 1.96 0.10 2.90 0.23 2.08 0.17 2.78 0.19
DDS VE (99) 1.75 0.09 1.47 0.11 1.55 0.10 1.98 0.13 2.22 0.07 1.80 0.15 2.52 0.19 2.90 0.17 2.75 0.20
DiffusionMBIR (Chung et al., 2023b) (4000) 1.50 0.09 1.37 0.07 1.93 0.08 1.83 0.13 1.70 0.15 1.28 0.10 1.58 0.13 2.37 0.15 2.58 0.20
DPS (Chung et al., 2023a) (1000) 1.95 0.19 2.05 0.14 2.21 0.17 1.71 0.16 2.33 0.17 2.00 0.12 3.01 0.21 3.29 0.19 3.57 0.29
Score-Med (Song et al., 2022) (4000) 1.68 0.15 1.57 0.18 1.88 0.24 2.14 0.13 2.40 0.15 3.07 0.22 2.69 0.29 2.50 0.19 3.22 0.19
MCG (Chung et al., 2022a) (4000) 1.55 0.15 1.60 0.14 1.61 0.14 1.93 0.20 1.70 0.18 1.88 0.17 2.13 0.16 2.66 0.19 2.78 0.27
Lahiri et al. (2022) 1.28 0.12 1.52 0.10 2.08 0.13 1.30 0.17 1.57 0.10 1.44 0.12 2.40 0.15 2.31 0.22 3.07 0.24
FBPConvNet (Jin et al., 2017) 2.05 0.24 3.74 0.15 3.45 0.29 3.67 0.31 3.51 0.30 3.33 0.27 3.17 0.25 3.04 0.31 4.07 0.22
ADMM-TV 2.73 0.16 2.57 0.18 2.81 0.16 2.90 0.25 2.88 0.16 3.44 0.25 3.01 0.22 2.92 0.19 2.70 0.16

Table 8: Standard deviation of the quantitative metrics presented in Table. 4, which correspond to the
results for sparse-view CT reconstruction. Mean values in Tab. 4. (Numbers in parenthesis): NFE.

very small value e.g.σ0 = 0.01 (Kim et al., 2022), meaning that the score function has to approximate
relatively large values in such regime, leading to numerical instability. This phenomenon is further
illustrated in Fig. 5 (a), where the reconstruction (i.e. denoising) error has a rather odd trend of
jumping up and down, and completely diverging as t → 0. This may be less of an issue for using
samplers such as PC where x̂i are not directly used but becomes a much bigger problem when

23
Published as a conference paper at ICLR 2024

the proposed sampler is used. In fact, for NFE > 200, we find that simply truncating the last few
evolutions is necessary to yield the result reported in Tab. 6 (See appendix E.1 for details).
Such instabilities worsened when we tried scaling our experiments to complex-valued PI reconstruc-
tion due to the network only being trained on magnitude images. On the other hand, the reconstruction
errors for VP trained with epsilon matching have a much stabler evolution of denoising reconstruc-
tions, suggesting that it is indeed a better fit in the context of the proposed methodology. Hence, all
experiments reported hereafter use a network parameterized with ϵθ trained within a VP framework,
and also by stacking real/imag part in the channel dimension to account for the complex-valued
nature of the MR imagery and to avoid using ×2 NFE for a single level of denoising.

G E XPERIMENTAL D ETAILS

G.1 DATASETS

fastMRI knee. We conduct all PI experiments with fastMRI knee dataset (Zbontar et al., 2018).
Following the practices from Chung & Ye (2022), among the 973 volumes of training data, we drop
the first/last five slices from each volume. As the test set, we select 10 volumes from the validation
dataset, which consists of 281 2D slices. While Chung et al. (2022b) used DICOM magnitude
images to train the network, we used the minimum-variance unbiased estimates (MVUE) (Jalal
et al., 2021) complex-valued data by stacking the real and imaginary parts of the images into the
channel dimension. When performing inference, we pre-compute the coil sensitivity maps with
ESPiRiT (Uecker et al., 2014).
AAPM. AAPM 2016 CT low-dose grand challenge data leveraged in Chung et al. (2022a; 2023b)
is used. From the filtered backprojection (FBP) reconstruction of size 512 × 512, we resize all the
images to have the size 256 × 256 in the axial dimension. We use the same 1 volume of testing data
that was used in Chung et al. (2023b). This corresponds to 448 axial slices, 256 coronal, and 256
sagittal slices in total. To generate sparse-view and limited angle measurements, we use the parallel
view geometry for simplicity, with the torch-radon (Ronchetti, 2020) package.

G.2 N ETWORK TRAINING

VE models for both MRI/CT are taken from the official github repositories (Chung & Ye, 2022; Chung
et al., 2022a), which are both based on ncsnpp model of Score-SDE (Song et al., 2021b). In order to
train our VP epsilon matching models, we take the U-Net implementation from ADM (Dhariwal &
Nichol, 2021), and train each model for 1M iterations with the batch size of 4, initial learning rate of
1e − 4 on a single RTX 3090 GPU. Training took about 3 days for each task.

G.3 R EJECTION SAMPLING

When running Algorithm 2 in the low NFE regime (e.g. 19, 49), we see that our method sometimes
yields infeasible reconstructions. Sampling 100 reconstructions for 1D uniform ×4 acc., we see
that 5% of the samples were infeasible for 19 NFE, and 3% of the samples were infeasible for 49
NFE. In such cases, we simply compute the Euclidean norm of the residual ∥y − Ax̂∥ with the
reconstructed sample x̂, and reject the sample if the residual exceeds some threshold value τ . Even
when we consider the additional time costs that arise from re-sampling the rejected reconstructions,
we still achieve dramatic acceleration to previous methods Song et al. (2022); Chung & Ye (2022).

G.4 C OMPARISON METHODS

G.4.1 ACCELERATED MRI

Score-MRI (Chung & Ye, 2022) We use the official pre-trained model3 with 2000 PC sampling.
Note that for PI, this amounts to running the sampling procedure per coil. When reducing the number
of NFE presented in Fig. 1, we use linear discretization with wider bins.
3
[Link]

24
Published as a conference paper at ICLR 2024

Jalal et al. (2021). As the official pre-trained model is trained on fastMRI brain MVUE images, in
order to perform fair comparison, we train the NCSN v2 model with the same fastMRI knee MVUE
images for 1M iterations in the setting identical to when training the model for the proposed method.
For inference, we follow the default annealing step sizes as proposed in the original paper. We use 3
Langevin dynamics steps per noise scale for 700 discretizations, which amounts to a total of 2100
NFE. When reducing the number of NFE presented in Fig. 1, we keep the 3 Langevin steps intact,
and use linear discretization with wider bins.
E2E-Varnet (Sriram et al., 2020), U-Net. We train the supervised learning-based methods with
Gaussian 1D subsampling as performed in Chung & Ye (2022), adhering to the official implementation
and the default settings of the original work.
TV. We use the implementation in [Link].TotalVariation4 , with calibrated sensitivity
maps with ESPiRiT (Uecker et al., 2014). The parameters for the optimizer are found via grid search
on 50 validation images.

G.4.2 3D CT RECONSTRUCTION
DiffusionMBIR (Chung et al., 2023b), MCG (Chung et al., 2022a). Both methods use the same
score function as provided in the official repository5 . For both DiffusionMBIR and Score-CT, we
use 2000 PC sampler (4000 NFE). For DiffusionMBIR, we set λ = 10.0, ρ = 0.04, which is the
advised parameter setting for the AAPM 256×256 dataset. For Chung et al., we use the iterating
ART projections as used in the comparison study in Chung et al. (2023b).
Lahiri et al. (2022). We use two stages of 3D U-Net based CNN architectures. For each greedy
optimization process, we train the network for 300 epochs. For CG, we use 30 iterations at each stage.
Networks were trained with the Adam optimizer with a static learning rate of 1e − 4, batch size of 2.
FBPConvNet Jin et al. (2017), Zhang et al. (2017). Both methods utilize the same network
architecture and the same training strategy, only differing in the application (SV, LA). We use a
standard 2D U-Net architecture and train the models with 300 epochs. Networks were trained with
the Adam optimizer with a learning rate of 1e − 4, batch size 8.
ADMM-TV. Following the protocol of Chung et al. (2023b), we optimize the following objective
1
x∗ = arg min ∥Ax − y∥22 + λ∥Dx∥2,1 , (87)
x 2
with D := [D x , D y , D z ], and is solved with standard ADMM and CG. Hyper-parameters are set
identical to Chung et al. (2023b).

H Q UALITATIVE RESULTS

4
[Link]
5
[Link]

25
Published as a conference paper at ICLR 2024

Mask U-Net VarNet Score-MRI (4000) DDS (49) Ground Truth


(a) (b) 24.91 / 0.715 (c) 28.68 / 0.715 (d) 30.28 / 0.723 (e) 29.44 / 0.768 (f)

31.96 / 0.841 32.12 / 0.887 33.38 / 0.887 34.44 / 0.886

23.80 / 0.715 23.10 / 0.621 31.07 / 0.728 31.39 / 0.783

21.58 / 0.692 20.23 / 0.608 36.32 / 0.919 37.16 / 0.913

Figure 7: Comparison of parallel imaging reconstruction results. (a) subsampling mask (1st row:
uniform1D×4, 2nd row: Gaussian1D×8, 3rd row: Gaussian2D×8, 4th row: variable density
poisson disc ×8), (b) U-Net (Zbontar et al., 2018), (c) E2E-VarNet (Sriram et al., 2020), (d) Score-
MRI (Chung & Ye, 2022) (4000 × 2 × c NFE), (e) DDS (49 NFE), (f) ground truth.

26
Published as a conference paper at ICLR 2024

Mask Zero-filled DPS (50) DPS (1000) DDS (49) Ground Truth
(a) (b) 25.13 / 0.778 (c) 26.40 / 0.884 (d) 24.94 / 0.546 (e) 27.07 / 0.834 (f)

22.37 / 0.602 28.96 / 0.850 23.58 / 0.644 29.17 / 0.863

22.83 / 0.663 23.45 / 0.703 24.47 / 0.608 25.22 / 0.714

22.03 / 0.659 32.35 / 0.854 24.19 / 0.704 32.88 / 0.894

Figure 8: Comparison of noisy (σ = 0.05)parallel imaging reconstruction results. (a) subsampling


mask (1st row: uniform1D×4, 2nd row: Gaussian1D×4, 3rd row: Gaussian2D×8, 4th row: variable
density poisson disc ×8), (b) Zero-filled, (c) DPS (Chung et al., 2023a) (50 NFE), (d) DPS (Chung
et al., 2023a) (1000 NFE), (e) DDS (49 NFE), (f) ground truth. Numbers in the top right corners
denote PSNR and SSIM, respectively.

27
Published as a conference paper at ICLR 2024

(a) 13.36 / 0.404 (b) 20.65 / 0.688 (c) 27.83 / 0.860 (d) 33.34 / 0.938 (e) 34.67 / 0.953 (f)
3D sparse-view CT (8view)

14.19 / 0.537 27.76 / 0.912 28.44 / 0.892 34.23 / 0.968 35.30 / 0.970

15.65 / 0.674 20.94 / 0.849 25.41 / 0.860 34.06 / 0.960 34.52 / 0.967

(a) 11.41 / 0.450 (b) 27.00 / 0.932 (c) 19.34 / 0.634 (d) 31.43 / 0.932 (e) 31.43 / 0.932 (f)
3D limited-angle CT (90∘)

13.78 / 0.616 28.37 / 0.920 25.14 / 0.808 35.49 / 0.963 35.22 / 0.958

18.73 / 0.826 21.25 / 0.910 23.15 / 0.818 29.85 / 0.964 29.91 / 0.959

Figure 9: Comparison of 3D CT reconstruction results. (Top): 8-view SV-CT, (Bottom): 90◦ LA-CT.
(a) FBP, (b) Lahiri et al., (c) Chung et al., (d) DiffusionMBIR (4000 NFE), (e) DDS (49 NFE), (f)
ground truth. Numbers in top right corners denote PSNR and SSIM, respectively.

28

You might also like