Decomposed Diffusion Sampler For Accelerating Large Scale Inverse Problems
Decomposed Diffusion Sampler For Accelerating Large Scale Inverse Problems
A BSTRACT
arXiv:2303.05754v3 [[Link]] 19 Feb 2024
1 I NTRODUCTION
Diffusion models (Ho et al., 2020; Song et al., 2021b) are the state-of-the-art generative model that
learns to generate data by gradual denoising, starting from the reference distribution (e.g. Gaussian).
In addition to superior sample quality in the unconditional sampling of the data distribution, it was
shown that one can use unconditional diffusion models as generative priors that model the data
distribution (Chung et al., 2022a; Kawar et al., 2022), and incorporate the information from the
forward physics model along with the measurement y to sample from the posterior distribution
p(x|y). This property is especially intriguing when seen in the context of Bayesian inference, as we
can separate the parameterized prior pθ (x) from the measurement model (i.e. likelihood) p(y|x) to
construct the posterior pθ (x|y) ∝ pθ (x)p(y|x). In other words, one can use the same pre-trained
neural network model regardless of the forward model at hand. Throughout the manuscript, we refer
to this class of methods as Diffusion model-based Inverse problem Solvers (DIS).
For inverse problems in medical imaging, e.g. magnetic resonance imaging (MRI), computed
tomography (CT), it is often required to accelerate the measurement process by reducing the number of
measurements. However, the data acquisition scheme may vary vastly according to the circumstances
(e.g. vendor, sequence, etc.), and hence the reconstruction algorithm needs to be adaptable to
different possibilities. Supervised learning schemes show weakness in this aspect as they overfit to
the measurement types that were used during training. As such, it is easy to see that diffusion models
will be particularly strong in this aspect as they are agnostic to the different forward models at
inference. Indeed, it was shown in some of the pioneering works that diffusion-based reconstruction
algorithms have high generalizability (Jalal et al., 2021; Chung & Ye, 2022; Song et al., 2022; Chung
et al., 2023b).
1
Published as a conference paper at ICLR 2024
Noiseless
(b) 3D Sparse-view CT
FBP DiffusionMBIR (4000) Ours (19) Ground Truth
𝑦
FBP
𝑧
𝑦 34.16 / 0.957 36.32 / 0.969
Ours (19)
𝑧
𝑥 33.68 / 0.857 35.18 / 0.890
Figure 2: Representative reconstruction results. (a) Multi-coil MRI reconstruction, (b) 3D sparse-view
CT. Numbers in parenthesis: NFE. Yellow numbers in bottom left corner: PSNR/SSIM.
2
Published as a conference paper at ICLR 2024
2 BACKGROUND
Krylov subspace methods Consider the linear system (1). In classical projection based methods
such as Krylov subspace methods (Liesen & Strakos, 2013), for given two subspace K and L, we
define an approximate problem to find x ∈ K such that
y − Ax ⊥ L (3)
This is a basic projection step, and the sequence of such steps is applied. For example, with non-zero
estimate x̂, the associated problem is to find x ∈ x̂ + K such that y − Ax ⊥ L, which is equivalent
to finding δ ∈ K such that
b − Aδ ⊥ L, δ := x − x̂, b := y − Ax̂. (4)
In terms of the choice of the two subspaces, the CG method chooses the two subspaces K and L as
the same Krylov subspace:
K = L = Kl := Span(b, Ab, · · · , Al−1 b). (5)
Then, CG attempts to find the solution to the following optimization problem:
min ∥y − Ax∥2 (6)
x∈x̂+Kl
Krylov subspace methods can be also extended to nonlinear problems via zero-finding. Specifically,
the optimization problem minx ℓ(x) can be equivalently converted to a zero-finding problem of its
gradient, i.e. ∇x ℓ(x) = 0. If we consider a non-linear forward imaging operator A(·), we can define
ℓ(x) = ∥y − A(x)∥2 /2. Then, one can use, for example, Newton-Krylov method (Knoll & Keyes,
3
Published as a conference paper at ICLR 2024
2004) to linearize the problem near the current solution and apply standard Krylov methods to solve
the current problem. Now, given the optimization problem, we can see the fundamental differences
between the gradient-based approaches and Krylov methods. Specifically, gradient methods are based
on the iteration:
x(i+1) = x(i) − γ∇x ℓ(x(i) ) (7)
which stops updating when ∇x ℓ(x(i) ) ≃ 0. On the other hand, Krylov subspace methods try to find
x ∈ Kl by increasing l to achieve a better approximation of ∇x ℓ(x) = 0. This difference allows
us to devise a computationally efficient algorithm when combined with diffusion models, which we
investigate in this paper. See Appendix A.2 fur further mathematical background.
Diffusion models Diffusion models (Ho et al., 2020) attempt to model the data distribution pdata (x)
by constructing a hierarchical latent variable model
Z T
(t)
Y
pθ (x0 ) = pθ (xT ) pθ (xt−1 |xt ) dx1:T , (8)
t=1
d
where x{1,...,T } ∈ R are noisy latent variables that have the same dimension as the data random
vector x0 ∈ Rd , defined by the Markovian forward conditional densities
p
q(xt |xt−1 ) = N (xt | βt xt−1 , (1 − βt )I), (9)
√
q(xt |x0 ) = N (xt | ᾱt x0 , (1 − ᾱt )I). (10)
Qt
Here, the noise schedule βt is an increasing sequence of t, with ᾱt := i=1 αt , αt := 1 − βt .
Training of diffusion models amounts to training a multi-noise level residual denoiser (i.e. epsilon
matching)
h i
(t)
min Ext ∼q(xt |x0 ),x0 ∼pdata (x0 ),ϵ∼N (0,I) ∥ϵθ (xt ) − ϵ∥22 ,
θ
√
(t)
such that ϵθ∗ (xt )≃ xt√−1−ᾱᾱt x0 . Furthermore, it can be shown that epsilon matching is equivalent to
t
the denoising score matching (DSM) (Hyvärinen & Dayan, 2005; Song & Ermon, 2019) objective up
to a constant with different parameterization
h i
(t)
min Ext ,x0 ,ϵ ∥sθ (xt ) − ∇xt log q(xt |x0 )∥22 , (11)
θ
(t)
√
xt − ᾱt x0 (t) √
such that sθ∗ (xt ) ≃ − 1−ᾱt = −ϵθ∗ (xt )/ 1 − ᾱt . For notational simplicity, we often denote
(t) (t) (t)
ŝt , ϵ̂t , x̂t instead of sθ∗ (xt ), ϵθ∗ (xt ), xθ∗ (xt ), representing the estimated score, noise, and noiseless
image, respectively. Sampling from (8) can be implemented by ancestral sampling, which iteratively
performs
1 1 − αt
xt−1 = √ xt − √ ϵ̂t + β̃t ϵ, (12)
αt 1 − ᾱt
1−ᾱt−1
where β̃t := 1−ᾱt βt .
One can also view ancestral sampling (12) as solving the reverse generative stochastic differential
equation (SDE) of the variance preserving (VP) linear forward SDE (Song et al., 2021b). Additionally,
one can construct the variance exploding (VE) SDE by setting q(xt |x0 ) = N (xt |x0 , σt2 I), which
is in a form of Brownian motion. In Appendix A.1 we further review the background on diffusion
models under the score/SDE perspective.
DDIM Seen either from the variational or the SDE perspective, diffusion models are inevitably
slow to sample from. To overcome this issue, DDIM (Song et al., 2021a) proposes another method of
sampling which only requires matching the marginal distributions q(xt |x0 ). Specifically, the update
rule is given as follows
√
q
(t)
xt−1 = ᾱt−1 x̂t + 1 − ᾱt−1 − η 2 β̃t2 ϵθ∗ (xt ) + η β̃t ϵ,
√ (13)
= ᾱt−1 x̂t + w̃t
4
Published as a conference paper at ICLR 2024
where ℓ(x) denotes the data consistency (DC) loss (i.e., ℓ(x) = ∥y − Ax∥2 /2 for linear inverse
problems) and M represents the clean data manifold. Consequently, it is essential to navigate in a
way that minimizes cost while also identifying the correct clean manifold. Accordingly, most of the
approaches use standard reverse diffusion (e.g. (12)), alternated with an operation to minimize the
DC loss.
Recently, Chung et al. (2023a) proposed DPS, where the updated estimate from the noisy sample
xt ∈ Mt is constrained to stay on the same noisy manifold Mt . This is achieved by computing the
MCG (Chung et al., 2022a) on a noisy sample xt ∈ Mt as ∇mcg xt ℓ(xt ) := ∇xt ℓ(x̂t ), where x̂t is
the denoised sample in (14) through Tweedie’s formula. The resulting algorithm can be stated as
follows: √
xt−1 = ᾱt−1 (x̂t − γt ∇xt ℓ(x̂t )) + w̃t (18)
(t)
where γt > 0 denotes the step size. Since parameterized score function ϵθ∗ (xt ) is trained with
(t)
samples supported on Mt , ϵθ∗ shows good performance on denoising xt ∼ Mt , allowing precise
transition to Mt−1 . Therefore, by performing (18) from t = T to t = 0, we can solve the
optimization problem (17) with x0 ∈ M. Unfortunately, the computation of MCG for DPS requires
computationally expensive backpropagation and is often unstable (Poole et al., 2022; Du et al., 2023).
Key observation By applying the chain rule for the MCG term in (18), we have
∂ x̂t
∇xt ℓ(x̂t ) = ∇x̂t ℓ(x̂t )
∂xt
where we use the denominator layout for vector calculus. Since ∇x̂t ℓ(x̂t ) is a standard gradient, the
main complexity of the MCG arises from the Jacobian term ∂∂x x̂t
t
.
In the following Proposition 1, we show that if the underlying clean manifold forms an affine subspace,
then the Jacobian term ∂∂xx̂t
t
is indeed the orthogonal projection on the clean manifold up to a scale
5
Published as a conference paper at ICLR 2024
factor. Note that the affine subspace assumption is widely used in the diffusion literature that has
been used when 1) studying the possibility of score estimation and distribution recovery (Chen et al.,
2023), 2) showing the possibility of signal recovery (Rout et al., 2023a;b), and the most relevantly, 3)
showing the geometrical view of the clean and the noisy manifolds (Chung et al., 2022a). Although
it is difficult to assume in practice that the clean manifold forms an affine subspace, it could be
approximated by piece-wise linear regions represented by the tangent subspace at x̂t . Therefore,
Proposition 1 is still valid in that approximate regime.
Proposition 1 (Manifold Constrained Gradient). Suppose the clean data manifold M is represented
as an affine subspace and assumes the uniform distribution on M. Then,
∂ x̂t 1
= √ PM (19)
∂xt ᾱt
x̂t − γt ∇xt ℓ(x̂t ) = PM (x̂t − ζt ∇x̂t ℓ(x̂t )) (20)
for some ζt > 0, where PM denotes the orthogonal projection to M.
Now, (20) in Proposition 1 indicates that if the clean data manifold is an affine subspace, the DPS
corresponds to the projected gradient on the clean manifold. Nonetheless, a notable limitation of
MCG is its inefficient use of a single projected gradient step for each ancestral diffusion sampling.
Motivated by this, we aim to explore extensions that allow computationally efficient multi-step
optimization steps per each ancestral sampling.
Specifically, let Tt denote the tangent space on the clean manifold at a denoised sample x̂t in (14).
Suppose, furthermore, that there exists the l-th order Krylov subspace:
Kt,l := Span(b, Ab, · · · , Al−1 b), b := y − Ax̂t (21)
such that
Tt = x̂t + Kt,l
Then, using the property of CG in (6), it is easy to see that M -step CG update with M ≤ l starting
from x̂t are confined in Tt since it corresponds to the solution of
min ∥y − Ax∥2 (22)
x∈x̂t +KM
and KM ⊂ Kl when M ≤ l. This offers a pivotal insight. It shows that if the tangent space at each
denoised sample is representable by a Krylov subspace, there’s no need to compute the MCG. Rather,
the standard CG method suffices to guarantee that the updated samples stay within the tangent space.
To sum up, our DDS algorithm is as follows:
√
xt−1 = ᾱt−1 x̂′t + w̃t , (23)
′ ∗ ∗
x̂t = CG(A A, A y, x̂t , M ), M ≤ l (24)
where CG(·) denotes the M -step CG for the normal equation starting from x̂t . In contrast,
DDNM (Wang et al., 2023) for noiseless image restoration problems uses the following update
instead of (24):
x̂′t = (I − A† A)x̂t + A† y, (25)
where A† denotes the pseudo-inverse of A. Unfortunately, (25) in DDNM does not ensure that the
update signal x̂′t lies in Tt due to A† .
Therefore, for large-scale inverse problems, we find that CG outperforms naive projections (25)
by quite large margins. This is to be expected, as CG iteration enforces the update to stay on Tt
whereas the orthogonal projections in DDNM do not guarantee this property. In practice, even when
our Krylov subspace assumptions cannot be guaranteed, we empirically validate in Appendix F.1
that DDS indeed keeps xt closest to the noisy manifold Mt , which, in turn, shows that DDS keeps
the update close to the clean manifold M. Moreover, it is worth emphasizing that gradient-based
methods (Jalal et al., 2021; Chung et al., 2022a; 2023a) often fail when choosing the “theoretically
correct” step sizes of the likelihood. To fix this, several heuristics on the choice of step sizes are
required (e.g. choosing the step size ∝ 1/∥y − Ax̂t ∥2 ), which easily breaks when varying the NFE.
In this regard, DDS is beneficial in that it is free from the cumbersome step-size tuning process.
6
Published as a conference paper at ICLR 2024
PSNR [db] 27.32 ±0.43 31.77±0.89 32.96±0.59 32.49±2.10 33.25±1.18 30.56±0.66 34.88±0.74 34.61±0.32 32.73±2.04
×4
SSIM 0.662±0.17 0.846±0.11 0.856±0.11 0.868±0.08 0.857±0.08 0.840±0.20 0.954±0.11 0.956±0.08 0.927±0.08
Uniform 1D
PSNR [db] 25.02±2.21 29.51±0.37 31.98±0.35 32.19±2.45 32.01±2.30 30.29±0.33 31.62±1.88 30.16±1.19 30.33±2.35
×8
SSIM 0.532±0.05 0.780±0.05 0.828±0.08 0.835±0.16 0.821±0.15 0.811±0.18 0.876±0.08 0.830±0.04 0.891±0.16
PSNR [db] 30.55±1.77 32.66±0.26 34.15±1.40 33.98±1.25 34.25±1.33 32.47±1.09 35.12±1.37 35.15±0.39 34.63±1.95
×4
SSIM 0.789±0.06 0.866±0.12 0.878±0.19 0.881±0.12 0.885±0.08 0.838±0.20 0.963±0.15 0.961±0.06 0.957±0.09
Gaussian 1D
PSNR [db] 27.98±1.28 31.64±1.12 33.15±2.09 32.76±2.43 32.43±0.95 30.47±2.32 33.27±1.06 33.43±0.75 32.83±1.29
×8
SSIM 0.747±0.21 0.841±0.09 0.868±0.18 0.870±0.13 0.855±0.13 0.839±0.16 0.937±0.07 0.947±0.15 0.940±0.09
PSNR [db] 29.20±2.37 24.51±0.69 20.97±1.24 30.97±1.14 31.43±1.23 29.65±1.26 33.99±1.30 34.55±1.69 32.55±1.54
×8
SSIM 0.781±0.09 0.724±0.10 0.642±0.08 0.812±0.17 0.831±0.18 0.795±0.12 0.948±0.13 0.956±0.15 0.916±0.14
Gaussian 2D
PSNR [db] 26.28±2.28 14.93±3.33 16.66±4.02 27.34±1.97 29.17±0.98 26.30±1.34 27.86±1.67 25.75±1.77 25.66±2.03
× 15
SSIM 0.547±0.19 0.372±0.29 0.435±0.26 0.692±0.13 0.704±0.08 0.688±0.11 0.732±0.10 0.695±0.09 0.693±0.08
PSNR [db] 29.52±1.26 20.89±3.09 20.70±3.08 32.60±1.88 31.98±0.51 31.05±0.46 35.31±0.79 35.36±0.41 35.39±0.57
×8
SSIM 0.562±0.11 0.576±0.10 0.592±0.18 0.833±0.05 0.816±0.07 0.811±0.08 0.897±0.07 0.875±0.09 0.915±0.11
VD Poisson disk
PSNR [db] 26.19±2.36 16.01±5.59 18.82±3.30 30.22±1.89 29.59±1.22 30.02±1.72 34.84±1.44 35.18±0.97 34.59±1.50
× 15
SSIM 0.510±0.20 0.537±0.21 0.548±0.19 0.749±0.17 0.702±0.15 0.753±0.15 0.934±0.06 0.931±0.05 0.940±0.05
Table 1: Quantitative metrics for noiseless parallel imaging reconstruction. Numbers in parenthesis:
NFE. ∗ : expressed as ×2 × C as the network is evaluated for real/imag part of each coil. Bold: best.
Mean ± 1 std.
Furthermore, our CG step can be easily extended for noisy image restoration problems. Unlike the
DDNM approach that relies on the singular value decomposition to handle noise, which is non-trivial
to perform on forward operators in medical imaging (e.g. PI CS-MRI, CT), we can simply minimize
the cost function
γ 1
ℓ(x) = ∥y − Ax∥22 + ∥x − x̂t ∥22 , (26)
2 2
by performing CG iteration CG(γA∗ A + I, x̂t + γA∗ y, x̂t , M ) in the place of (24), where γ is
a hyper-parameter that weights the proximal regularization (Parikh & Boyd, 2014). Finally, our
method can also be readily extended to accelerate DiffusionMBIR (Chung et al., 2023b) for 3D
CT reconstruction by adhering to the same principles. Specifically, we implement an optimization
strategy to impose the conditioning:
1
min ∥Ax − y∥22 + λ∥D z x∥1 , (27)
x 2
where D z is the finite difference operator that is applied to the z-axis that is not learned through the
diffusion prior, and unlike Chung et al. (2023b), the optimization is performed in the clean manifold
starting from the denoised x̂t rather than the noisy manifold starting from xt . As the additional
prior is only imposed in the direction that is orthogonal to the axial slice dimension (xy) captured by
the diffusion prior (i.e. manifold M), (27) can be solved effectively with the alternating direction
method of multipliers (ADMM) (Boyd et al., 2011) after sampling 2D diffusion slice by slice. See
Appendix C for details in implementation.
4 E XPERIMENTS
Problem setting. We have the following general measurement model (see Fig. 3 for illustration of
the imaging physics).
y = P T sx =: Ax, y ∈ Cn , A ∈ Cn×d , (28)
where P is the sub-sampling matrix, T is the discrete transform matrix (i.e. Fourier, Radon), and
s = I when we have a single-array measurement including CT, and s = [s(1) , . . . , s(c) ] when we
have a c−coil parallel imaging (PI) measurement.
We conduct experiments on two distinguished applications—accelerated MRI, and 3D CT recon-
struction. For the former, we follow the evaluation protocol of Chung & Ye (2022) and test our
method on the fastMRI knee dataset (Zbontar et al., 2018) on diverse sub-sampling patterns. We
provide comparisons against representative DIS methods Score-MRI (Chung & Ye, 2022), Jalal et al.
(2021), DPS (Chung et al., 2023a). Notably, for DPS, we use the DDIM sampling strategy to show
that the strength of DDS not only comes from the DDIM sampling strategy but also the use of the
sampling together with the CG update steps. The optimal η values for DPS (DDIM) are obtained
through grid search. We do not compare against (Song et al., 2022) as the method cannot cover the
multi-coil setting. Other than DIS, we also compare against strong supervised learning baselines:
E2E-Varnet (Sriram et al., 2020), U-Net (Zbontar et al., 2018); and compressed sensing baseline: total
7
Published as a conference paper at ICLR 2024
variation reconstruction (Block et al., 2007). For the latter, we follow Chung et al. (2023b) and test
both sparse-view CT reconstruction (SV-CT), and limited angle CT reconstruction (LA-CT) on the
AAPM 256×256 dataset. We compare against representative DIS methods DiffusionMBIR (Chung
et al., 2023b), Song et al. (2022), MCG (Chung et al., 2022a), and DPS (Chung et al., 2023a);
Supervised baselines Lahiri et al. (2022) and FBPConvNet (Jin et al., 2017); Compressed sensing
baseline ADMM-TV. For all proposed methods, we employ M = 5, η = 0.15 for 19 NFE, η = 0.5
for 49 NFE, η = 0.8 for 99 NFE unless specified otherwise. While we only compare against a single
CS baseline, it was reported in previous works that diffusion model-based solvers largely outperform
the classic CS baselines (Jalal et al., 2021; Luo et al., 2023), for example, L1-wavelet (Lustig et al.,
2007) and L1-ESPiRiT (Uecker et al., 2014). For PI CS-MRI experiments, we employ the rejection
sampling based on a residual-based criterion to ensure stability. Further experimental details can be
found in appendix G.
Improvement over DDNM. Fixing the sampling strategy the same, we inspect the effect of the
three different data consistency imposing strategies: Score-MRI (Chung & Ye, 2022), DDNM (Wang
et al., 2023), and DDS. For DDS, we additionally search for the optimal number of CG iterations
per sampling step. In Tab. 2, we see that under the low NFE regime, the score-MRI DC strategy
has significantly worse performance than the proposed methods, even when using the same DDIM
sampling strategy. Moreover, we see that overall, DDS outperforms DDNM by a few db in PSNR. We
see that 5 CG iterations per denoising step strike a good balance. One might question the additional
computational overhead of introducing the iterative CG into the already slow diffusion sampling.
Nonetheless, from our experiments, we see that on average, a single CG iteration takes about 0.004
sec. Consequently, it only takes about 0.2 sec more than the analytic counterpart when using 50 NFE
(Analytic: 4.51 sec vs. CG(5): 4.71 sec.).
Improvement on VE (Chung & Ye,
2022). Keeping the pre-trained model DDS (ours)
Score-MRI DDNM
intact from Chung & Ye (2022), we switch 1 3 5 10
from the Score-MRI sampling to Algorithm 5,
PSNR[db] 26.48 31.36 31.51 33.78 34.61 32.48
and report on the reconstruction results from SSIM 0.688 0.932 0.934 0.952 0.956 0.949
uniform 1D ×4 accelerated measurements in
Tab. 6. Note that Score-MRI uses 2000 PC Table 2: Ablation study on the DC strategy. 49
as the default setting, which amounts to 4000 NFE VP DDIM sampling strategy, uniform 1D ×4
NFE, reaching 33.25 PSNR. We see almost no acc. reconstruction.
degradation in quality down to 200 NFE, but
the performance rapidly degrades as we move down to 100, and completely fails when we set
the NFE ≤ 50. On the other hand, by switching to the proposed solver, we are able to achieve
the reconstruction quality that better than Score-MRI (4000 NFE) with only 100 NFE sampling.
Moreover, we see that we can reduce the NFE down to 30 and still achieve decent reconstructions.
This is a useful property for a reconstruction algorithm, as we can trade off reconstruction quality
with speed. However, we observe several downsides of using the VE parameterization including
numerical instability with large NFE, which we analyze in detail in appendix F.
Parallel Imaging with VP parameterization (Noiseless). We conduct thorough PI reconstruction
experiments with 4 different types of sub-sampling patterns following Chung & Ye (2022). Algo-
rithm 2 in supplementary material is used for all experiments. Quantitative results are shown in
Tab. 1 (Also see Fig. 7 for qualitative results). As the proposed method is based on diffusion models,
it is agnostic to the sub-sampling patterns, generalizing well to all the different sampling patterns,
whereas supervised learning-based methods such as U-Net and E2E-Varnet fail dramatically on 2D
subsampling patterns. Furthermore, to emphasize that the proposed method is agnostic to the imaging
forward model, we show for the first time in the DIS literature that DDS is capable of reconstructing
from non-cartesian MRI sub-sampling patterns that involve non-uniform Fast-Fourier Transform
(NUFFT) (Fessler & Sutton, 2003). See Appendix F.2.
In Tab. 1, we see that DDS sets the new state-of-the-art in most cases even when the NFE is
constrained to < 100. Note that this is a dramatic improvement over the previous method Chung &
Ye (2022), as for parallel imaging, Score-MRI required 120k(C = 15) NFE for the reconstruction
of a single image. Contrarily, DDS is able to outperform score-MRI with 49 NFE, and performs
8
Published as a conference paper at ICLR 2024
Table 4: Quantitative evaluation of SV-CT on the AAPM 256×256 test set (mean values; std values
in Tab. 8). (Numbers in parenthesis): NFE, Bold: best.
on par with score-MRI with 19 NFE. Even disregarding the additional ×2C more NFEs required
for score-MRI to account for the multi-coil complex valued acquisition, the proposed method still
achieves ×80 ∼ ×200 acceleration. We note that on average, our method takes about 4.7 seconds for
49 NFE, and about 2.25 seconds for 19 NFE on a single commodity GPU (RTX 3090).
Noisy multi-coil MRI reconstruction. One of
the most intriguing properties of the proposed Mask Pattern Acc. TV DPS (1000) DDS VP (49)
DDS is the ease of handling measurement noise ×4
PSNR [db] 24.19 24.40 29.47
SSIM 0.687 0.656 0.866
without careful computation of singular value Uniform 1D
PSNR [db] 23.02 24.60 26.77
decomposition (SVD), which is non-trivial to ×8
SSIM 0.638 0.666 0.827
perform for our tasks. With (26), we can solve ×8
PSNR [db] 23.07 23.48 30.95
it with CG, arriving at Algorithm 3 in supple- SSIM 0.609 0.592 0.890
VD Poisson disk
PSNR [db] 20.92 23.57 29.36
mentary material. For comparison, methods that × 15
SSIM 0.554 0.622 0.853
try to cope with measurement noise via SVD in
the diffusion model context (Kawar et al., 2022; Table 3: Quantitative metrics for noisy parallel
Wang et al., 2023) are not applicable and cannot imaging reconstruction. Numbers in parenthesis:
be compared. One work that does not require NFE.
computation of SVD and hence is applicable is
DPS (Chung et al., 2023a) relying on backpropagation. To test the efficacy of DDS on noisy inverse
problems, we add a rather heavy complex Gaussian noise (σ = 0.05) to the k-space multi-coil
measurements and reconstruct with Algorithm 3 by setting γ = 0.95 found through grid search. In
Tab. 3, we see that DDS far outperforms DPS (Chung et al., 2023a) with 1000 NFE by a large margin,
while being about ×40 faster as DPS requires the heavy computation of backpropagation.
4.2 3D CT RECONSTRUCTION
Sparse-view CT. Similar to the accelerated MRI experiments, we aim to both 1) improve the original
VE model of Chung et al. (2023b), and 2) train a new VP model better suitable for DDS. Inspecting
Tab. 4, we see that by using Algorithm 6 in supplementary material, we are able to reduce the NFE
to 100 and still achieve results that are on par with DiffusionMBIR with 4000 NFE. However, we
observe similar instabilities with the VE parameterization. Additionally, we find it crucial to initialize
the optimization process with CG and later switch to ADMM-TV using a CG solver for proper
convergence (see appendix E.2 for discussion). Switching to the VP parameterization and using
Algorithm 4, we now see that DDS achieves the new state-of-the-art with ≤ 49 NFE. Notably, this
decreases the sampling time to ∼ 25 min for 49 NFE, and ∼ 10 min wall-clock time for 19 NFE on a
single RTX 3090 GPU, compared to the painful 2 days for DiffusionMBIR. In Tab. 7, we see similar
improvements that were seen from SV-CT, where DDS significantly outperforms DiffusionMBIR
while being several orders of magnitude faster.
5 C ONCLUSION
In this work, we present Decomposed Diffusion Sampling (DDS), a general DIS for challenging
real-world medical imaging inverse problems. Leveraging the geometric view of diffusion models and
the property of the CG solvers on the tangent space, we show that performing numerical optimization
schemes on the denoised representation is superior to the previous methods of imposing DC. Further,
we devise a fast sampler based on DDIM that works well for both VE/VP settings. With extensive
experiments on multi-coil MRI reconstruction and 3D CT reconstruction, we show that DDS achieves
superior quality while being ≥ ×80 faster than the previous DIS.
9
Published as a conference paper at ICLR 2024
Ethics Statement We recognize the profound potential of our approach to revolutionize diagnostic
procedures, enhance patient care, and reduce the need for invasive techniques. However, we are
also acutely aware of the ethical considerations surrounding patient data privacy and the potential
for misinterpretation of generated images. All medical data used in our experiments were publicly
available and fully anonymized, ensuring the utmost respect for patient confidentiality. We advocate
for rigorous validation and clinical collaboration before any real-world application of our findings, to
ensure both the safety and efficacy of our proposed methods in a medical setting.
Reproducibility Statement For every different application and different circumstances (noise-
less/noisy, VE/VP, 2D/3D), we provide tailored algorithms(See Appendix. C,E) to ensure maximum
reproducibility. All the hyper-parameters used in the algorithms are detailed in Section 4 and
Appendix G. Code is open-sourced at [Link]
ACKNOWLEDGMENTS
This research was supported by the National Research Foundation of Korea(NRF)(RS-2023-
00262527), Field-oriented Technology Development Project for Customs Administration funded by
the Korean government (the Ministry of Science & ICT and the Korea Customs Service) through
the National Research Foundation (NRF) of Korea under Grant NRF2021M3I1A1097910 & NRF-
2021M3I1A1097938, Korea Medical Device Development Fund grant funded by the Korea gov-
ernment (the Ministry of Science and ICT, the Ministry of Trade, Industry, and Energy, the Min-
istry of Health & Welfare, the Ministry of Food and Drug Safety) (Project Number: 1711137899,
KMDF_PR_20200901_0015), and Culture, Sports, and Tourism R&D Program through the Korea
Creative Content Agency grant funded by the Ministry of Culture, Sports and Tourism in 2023.
R EFERENCES
Kai Tobias Block, Martin Uecker, and Jens Frahm. Undersampled radial MRI with multiple coils.
Iterative image reconstruction using a total variation constraint. Magnetic Resonance in Medicine:
An Official Journal of the International Society for Magnetic Resonance in Medicine, 57(6):
1086–1098, 2007.
Stephen Boyd, Neal Parikh, and Eric Chu. Distributed optimization and statistical learning via the
alternating direction method of multipliers. Now Publishers Inc, 2011.
Guangyong Chen, Fengyuan Zhu, and Pheng Ann Heng. An efficient statistical method for image
noise level estimation. In Proceedings of the IEEE International Conference on Computer Vision,
pp. 477–485, 2015.
Minshuo Chen, Kaixuan Huang, Tuo Zhao, and Mengdi Wang. Score approximation, estima-
tion and distribution recovery of diffusion models on low-dimensional data. arXiv preprint
arXiv:2302.07194, 2023.
Hyungjin Chung and Jong Chul Ye. Score-based diffusion models for accelerated mri. Medical
Image Analysis, pp. 102479, 2022.
Hyungjin Chung, Byeongsu Sim, Dohoon Ryu, and Jong Chul Ye. Improving diffusion models for
inverse problems using manifold constraints. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave,
and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022a. URL
[Link]
Hyungjin Chung, Byeongsu Sim, and Jong Chul Ye. Come-Closer-Diffuse-Faster: Accelerating
Conditional Diffusion Models for Inverse Problems through Stochastic Contraction. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022b.
Hyungjin Chung, Jeongsol Kim, Michael Thompson Mccann, Marc Louis Klasky, and Jong Chul Ye.
Diffusion posterior sampling for general noisy inverse problems. In International Conference on
Learning Representations, 2023a. URL [Link]
Hyungjin Chung, Dohoon Ryu, Michael T Mccann, Marc L Klasky, and Jong Chul Ye. Solving
3d inverse problems using pre-trained 2d diffusion models. IEEE/CVF Conference on Computer
Vision and Pattern Recognition, 2023b.
10
Published as a conference paper at ICLR 2024
Prafulla Dhariwal and Alexander Quinn Nichol. Diffusion models beat GANs on image synthesis.
In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural
Information Processing Systems, 2021.
Yilun Du, Conor Durkan, Robin Strudel, Joshua B Tenenbaum, Sander Dieleman, Rob Fergus, Jascha
Sohl-Dickstein, Arnaud Doucet, and Will Sussman Grathwohl. Reduce, reuse, recycle: Composi-
tional generation with energy-based diffusion models and mcmc. In International Conference on
Machine Learning, pp. 8489–8510. PMLR, 2023.
Bradley Efron. Tweedie’s formula and selection bias. Journal of the American Statistical Association,
106(496):1602–1614, 2011.
Jeffrey A Fessler and Bradley P Sutton. Nonuniform fast fourier transforms using min-max interpola-
tion. IEEE transactions on signal processing, 51(2):560–574, 2003.
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in
Neural Information Processing Systems, 33:6840–6851, 2020.
Aapo Hyvärinen and Peter Dayan. Estimation of non-normalized statistical models by score matching.
Journal of Machine Learning Research, 6(4), 2005.
Ajil Jalal, Marius Arvinte, Giannis Daras, Eric Price, Alexandros G Dimakis, and Jonathan Tamir.
Robust compressed sensing mri with deep generative priors. Advances in Neural Information
Processing Systems, 34, 2021.
Kyong Hwan Jin, Michael T McCann, Emmanuel Froustey, and Michael Unser. Deep convolutional
neural network for inverse problems in imaging. IEEE Transactions on Image Processing, 26(9):
4509–4522, 2017.
Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-
based generative models. In Proc. NeurIPS, 2022.
Bahjat Kawar, Michael Elad, Stefano Ermon, and Jiaming Song. Denoising diffusion restoration
models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances
in Neural Information Processing Systems, 2022. URL [Link]
kxXvopt9pWK.
Dongjun Kim, Seungjae Shin, Kyungwoo Song, Wanmo Kang, and Il-Chul Moon. Soft truncation: A
universal training technique of score-based diffusion model for high precision score estimation.
International conference on machine learning, 2022.
Dana A Knoll and David E Keyes. Jacobian-free newton–krylov methods: a survey of approaches
and applications. Journal of Computational Physics, 193(2):357–397, 2004.
Anish Lahiri, Marc Klasky, Jeffrey A Fessler, and Saiprasad Ravishankar. Sparse-view cone beam ct
reconstruction using data-consistent supervised and adversarial learning from scarce training data.
arXiv preprint arXiv:2201.09318, 2022.
Jörg Liesen and Zdenek Strakos. Krylov subspace methods: principles and analysis. Oxford
University Press, 2013.
Guanxiong Luo, Moritz Blumenthal, Martin Heide, and Martin Uecker. Bayesian mri reconstruction
with joint uncertainty estimation using diffusion models. Magnetic Resonance in Medicine, 90(1):
295–311, 2023.
Michael Lustig, David Donoho, and John M Pauly. Sparse MRI: The application of compressed
sensing for rapid MR imaging. Magnetic Resonance in Medicine: An Official Journal of the
International Society for Magnetic Resonance in Medicine, 58(6):1182–1195, 2007.
Neal Parikh and Stephen Boyd. Proximal algorithms. Foundations and Trends in optimization, 1(3):
127–239, 2014.
Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d
diffusion. arXiv, 2022.
11
Published as a conference paper at ICLR 2024
Matteo Ronchetti. Torchradon: Fast differentiable routines for computed tomography. arXiv preprint
arXiv:2009.14788, 2020.
Litu Rout, Advait Parulekar, Constantine Caramanis, and Sanjay Shakkottai. A theoretical jus-
tification for image inpainting using denoising diffusion probabilistic models. arXiv preprint
arXiv:2302.01217, 2023a.
Litu Rout, Negin Raoof, Giannis Daras, Constantine Caramanis, Alexandros G Dimakis, and Sanjay
Shakkottai. Solving linear inverse problems provably via posterior sampling with latent diffusion
models. arXiv preprint arXiv:2307.00619, 2023b.
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In 9th
International Conference on Learning Representations, ICLR, 2021a.
Jiaming Song, Arash Vahdat, Morteza Mardani, and Jan Kautz. Pseudoinverse-guided diffusion
models for inverse problems. In International Conference on Learning Representations, 2023.
URL [Link]
Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution.
In Advances in Neural Information Processing Systems, volume 32, 2019.
Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and
Ben Poole. Score-based generative modeling through stochastic differential equations. In 9th
International Conference on Learning Representations, ICLR, 2021b.
Yang Song, Liyue Shen, Lei Xing, and Stefano Ermon. Solving inverse problems in medical imaging
with score-based generative models. In International Conference on Learning Representations,
2022. URL [Link]
Anuroop Sriram, Jure Zbontar, Tullie Murrell, Aaron Defazio, C Lawrence Zitnick, Nafissa Yakubova,
Florian Knoll, and Patricia Johnson. End-to-end variational networks for accelerated MRI re-
construction. In International Conference on Medical Image Computing and Computer-Assisted
Intervention, pp. 64–73. Springer, 2020.
Martin Uecker, Peng Lai, Mark J Murphy, Patrick Virtue, Michael Elad, John M Pauly, Shreyas S
Vasanawala, and Michael Lustig. ESPIRiT—an eigenvalue approach to autocalibrating parallel
MRI: where SENSE meets GRAPPA. Magnetic resonance in medicine, 71(3):990–1001, 2014.
Pascal Vincent. A connection between score matching and denoising autoencoders. Neural computa-
tion, 23(7):1661–1674, 2011.
Yinhuai Wang, Jiwen Yu, and Jian Zhang. Zero-shot image restoration using denoising diffusion
null-space model. In The Eleventh International Conference on Learning Representations, 2023.
URL [Link]
Jure Zbontar, Florian Knoll, Anuroop Sriram, Tullie Murrell, Zhengnan Huang, Matthew J Muckley,
Aaron Defazio, Ruben Stern, Patricia Johnson, Mary Bruno, et al. fastMRI: An open dataset and
benchmarks for accelerated MRI. arXiv preprint arXiv:1811.08839, 2018.
Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng, and Lei Zhang. Beyond a gaussian denoiser:
Residual learning of deep CNN for image denoising. IEEE transactions on image processing, 26
(7):3142–3155, 2017.
12
Published as a conference paper at ICLR 2024
A P RELIMINARIES
A.1 D IFFUSION MODELS
Let us define a random variable x0 ∼ p(x0 ) = pdata (x0 ), where pdata denotes the data distri-
bution. In diffusion models, we construct a continuous Gaussian perturbation kernel p(xt |x0 ) =
N (xt ; st x0 , s2t σt2 I) with t ∈ [0, 1], which smooths out the distribution. As t → 1, the marginal
distribution pt (xt ) is smoothed such that it approximates the Gaussian distribution, which becomes
our reference distribution to sample from. Using the reparametrization trick, one can directly sample
xt = st x0 + st σt z, z ∼ N (0, I). (29)
Diffusion models aim to revert the data noising process. Remarkably, it was shown that the data
noising process and the denoising process can both be represented as a stochastic differential equation
(SDE), governed by the score function ∇xt log p(xt ) (Song et al., 2021b; Karras et al., 2022).
Namely, the forward/reverse diffusion SDE can be succinctly represented as (assuming st = 1 for
simplicity,
p
dx± = −σ̇t σt ∇xt log p(xt ) dt ± σ̇t σt ∇xt log p(xt ) dt + σ̇t σt dwt , (30)
where wt is the standard Wiener process. Here, the + sign denotes the forward process, where (30)
collapses to a Brownian motion. With the − sign, the process runs backward, and we see that the score
function ∇xt log p(xt ) governs the reverse SDE. In other words, in order to run reverse diffusion
sampling (i.e. generative modeling), we need access to the score function of the data distribution.
The procedure called score matching, where one tries to train a parametrized model sθ to approximate
∇xt log p(xt ) can be done through score matching (Hyvärinen & Dayan, 2005). As explicit and
implicit score matching methods are costly to perform, the most widely used training method in the
modern sense is the so-called denoising score matching (DSM) (Vincent, 2011)
h i
(t)
min Ext ,x0 ,ϵ ∥sθ (xt ) − ∇xt log p(xt |x0 )∥22 , (31)
θ
which is easy to train as our perturbation kernel is Gaussian. Once sθ∗ is trained, we can use it as a
plug-in approximation of the score function to plug into (30).
The score function has close relation to the posterior mean E[x0 |xt ], which can be formally linked
through Tweedie’s formula
Lemma 1 (Tweedie’s formula). Given a Gaussian perturbation kernel p(xt |x0 ) =
N (xt ; st x0 , σt2 I), the posterior mean is given by
1
E[x0 |xt ] = (xt + σt2 ∇xt log p(xt )) (32)
st
Proof.
∇xt p(xt )
∇xt log p(xt ) = (33)
p(xt )
Z
1
= ∇xt p(xt |x0 )p(x0 ) dx0 (34)
p(xt )
Z
1
= ∇xt p(xt |x0 )p(x0 ) dx0 (35)
p(xt )
Z
1
= p(xt |x0 )∇xt log p(xt |x0 )p(x0 ) dx0 (36)
p(xt )
Z
= p(x0 |xt )∇xt log p(xt |x0 ) dx0 (37)
st x0 − xt
Z
= p(x0 |xt ) dx0 (38)
σt2
st E[x0 |xt ] − xt
= . (39)
σt2
13
Published as a conference paper at ICLR 2024
In other words, having access to the score function is equivalent to having access to the posterior
mean through time t, which we extensively leverage in this work.
B P ROOFS
(t)
Lemma 2 (Total noise). Assuming optimality of ϵθ∗ , the total noise w̃t in (13) can be represented by
p
w̃t = 1 − ᾱt−1 ϵ̃ (49)
√
for some ϵ̃ ∼ N (0, I). In other words, (23) is equivalently represented by xt−1 = ᾱt−1 x̂t +
√
1 − ᾱt−1 ϵ̃ for some ϵ̃ ∼ N (0, I).
Proof. Given the independence of the estimated ϵ̂t and the stochastic ϵ ∼ N (0, I) along with the
Gaussianity, the noise variance in (15) is given as 1 − ᾱt−1 − η 2 β̃t2 + η 2 β̃t2 = 1 − ᾱt−1 , recovering
a sample from q(xt−1 |x0 ).
Proposition 1 (Manifold Constrained Gradient). Suppose the clean data manifold M is represented
as an affine subspace and assumes the uniform distribution on M. Then,
∂ x̂t 1
= √ PM (19)
∂xt ᾱt
x̂t − γt ∇xt ℓ(x̂t ) = PM (x̂t − ζt ∇x̂t ℓ(x̂t )) (20)
for some ζt > 0, where PM denotes the orthogonal projection to M.
14
Published as a conference paper at ICLR 2024
√
Proof.
√ First, we provide a proof for (19). Thanks to the forward sampling xt−1 = ᾱt−1 x0 +
1 − ᾱt−1 ϵ̃, we can obtain the closed-form expression of the likelihood:
p !
1 ∥xt − ᾱ(t)x0 ∥2
p(xt |x0 ) = exp − , (50)
(2π(1 − ᾱ(t)))d/2 2(1 − ᾱ(t))
which is a Gaussian distribution. Using the Bayes’ theorem, we have
Z
p(xt ) = p(xt |x0 )p(x0 )dx0 . (51)
According to the assumption, p(x0 ) is a uniform distribution on the subspace M. To incorporate this
in (51), we first compute p(xt ) by modeling p(x0 ) as the zero-mean Gaussian distribution with the
isotropic variance σ 2 and then take the limit σ → ∞. More specifically, we have
∥PM x0 ∥2
1
p(x0 ) = exp − , (52)
(2πσ 2 )l/2 2σ 2
where we use PM x0 = x0 as x0 ∈ M. Therefore, we have
p(xt , x0 ) = p(xt |x0 )p(x0 )
1 1
= exp (−d(xt , x0 ))
(2π(1 − ᾱ(t)))d/2 (2πσ 2 )l/2
where
d(xt , x0 )
p
⊥ ∥PM xt − ᾱ(t)PM x0 ∥2
∥PM xt ∥2 ∥PM x0 ∥2
= + +
2(1 − ᾱ(t)) 2σ 2 2(1 − ᾱ(t))
2⊥
∥PM x0 − µt ∥ ∥PM xt ∥2 + c(t)∥PM xt ∥2
= +
2s(t) 2(1 − ᾱ(t))
and
−1
1 ᾱ(t)
s(t) = 2
+ (53)
σ 1 − ᾱ(t)
√
ᾱ(t)
1−ᾱ(t)
µt = ᾱ(t)
xt (54)
1
σ 2 + 1−ᾱ(t)
1
c(t) = ᾱ(t)
(55)
1+ 2
1−ᾱ(t) σ
15
Published as a conference paper at ICLR 2024
where we use √
(t) (t)
sθ∗ (xt ) = ∇xt log p(xt ) = −ϵθ∗ (xt )/ 1 − ᾱt
Accordingly, we have
1 ⊥ 1
x̂t = √ (xt − PM xt ) = √ PM xt (58)
ᾱt ᾱt
Therefore, we have
∂ x̂t 1
= √ PM . (59)
∂xt ᾱt
Second, we provide a proof for (20). Since x̂t ∈ M, we have x̂t = PM x̂t . Thus, using (59), we
have
∂ x̂t
x̂t − γt ∇xt ℓ(x̂t ) = x̂t − γt ∇x̂t ℓ(x̂t )
∂xt
γt
= PM x̂t − √ PM ∇x̂t ℓ(x̂t )
ᾱt
= PM (x̂t − ζt ∇x̂t ℓ(x̂t ))
√
where ζt := γt / ᾱt . Q.E.D.
For score functions trained in the context of DDPM (VP-SDE), DDIM sampling of (13) can directly
be used. However, for VE-SDE that was not developed under the variational framework, it is unclear
how to construct a sampler equivalent of DDIM for VP-SDE. As one of our goals is to enable the
direct adoption of pre-trained diffusion models regardless of the framework, our goal is to devise a
fast sampler tailored for VE-SDE. Now, the idea of the decomposition of DDIM steps can be easily
adopted to address this issue.
First, observe that the forward conditional density for VE-SDE can be written as q(xt |x0 ) =
N (xt |x0 , σt2 I), where σt is taken to be a geometric series following Song et al. (2021b). This
directly leads to the following result:
Proposition 2 (VE Decomposition). The following update rule recovers a sample from the marginal
q(xt−1 |x0 ) ∀η ∈ [0, 1].
xt−1 = x̂t + w̃t (60)
where
x̂t := xt + σt2 ŝt (61)
q
w̃t = −σt−1 σt 1 − β̃ 2 η 2 ŝt + σt−1 η β̃t ϵ (62)
Proof. From the equivalent parameterization as in (14), we have the following relations in VE-SDE
xt − x̂t
ŝt = − (63)
σt2
xt − x̂t
ϵ̂t = = −σt ŝt (64)
σt
x̂t = xt + σt2 ŝt . (65)
Plugging in, we have
q
xt−1 = x̂t + σt−1 1 − β̃ 2 η 2 ϵ̂t + σt−1 η β̃ϵ. (66)
Since the variance can be now computed as
q 2 2
2 2 2
σt−1 1 − β̃ η + σt−1 η β̃ = σt−1 , (67)
2
xt−1 ∼ q(xt−1 ; x̂t , σt−1 I) = q(xt−1 |x0 ) by the assumption.
16
Published as a conference paper at ICLR 2024
2
Again, x̂t arises from Tweedie’s formula. With η = 1 and β̃ = 1 − σt−1 /σt2 , we can recover the
original VE-SDE, whereas η = 0 leads to the deterministic variation of VE-SDE, which we call
VE-DDIM. We can now use the usual VP-DDIM (13) or VE-DDIM (60) depending on the training
strategy of the pre-trained score function.
Here, we present our main geometric observations in the VE context, which are analogous to
Proposition 1 and Lemma 2. Their proofs are straightforward corollaries, and hence, are omitted.
Proposition 3 (VE-DDIM Decomposition). Under the same assumption of Proposition 1, we have
∂ x̂t
= PM (68)
∂xt
x̂t − γt ∇xt ℓ(x̂t ) = PM (x̂t − γt ∇x̂t ℓ(x̂t )) (69)
where PM denotes the orthogonal projection to the subspace M.
C A LGORITHMS
In the following tables, we list all the DDS algorithms used throughout the manuscript. For simplicity,
we define CG(A, y, x, M ) to be running M steps of conjugate gradient steps with initialization x. For
completeness, we include a pseudo-code of the CG method in Algorithm. 1 that is used throughout
the work.
D D ISCUSSION ON C ONDITIONING
Projection type Methods that belong to this class aim to directly replace1 what we have in the range
space of the intermediate noisy xi with the information from the measurement y. Two representative
works that utilize projection are Chung & Ye (2022) and Song et al. (2022). In Chung & Ye (2022),
we use
xt = (I − A† A)x′t + A† y, (72)
where the information from y will be used to fill in the range space of A† . However, this is clearly
problematic when considering the geometric viewpoint, as the sample may fall off the noisy manifold.
1
Both hard (Chung & Ye, 2022) or soft (Song et al., 2022) constraints can be utilized.
17
Published as a conference paper at ICLR 2024
Gradient type In the Bayesian reconstruction perspective, it is natural to incorporate the gradi-
ent of the likelihood as ∇ log p(xt |y) = ∇ log p(xt ) + ∇ log p(y|xt ). Here, while one can use
∇ log p(xt ) ≃ sθ∗ (xt ), ∇ log p(y|xt ) is intractable for all t ̸= 0 (Chung et al., 2023a), one has to
resort to approximations of ∇ log p(y|xt ). In a similar spirit to Score-MRI (Chung & Ye, 2022), one
can simply use gradient steps of the form
xt = x′t − ξt A∗ (y − Ax′t ), (73)
where ξt is the step size (Jalal et al., 2021). Nevertheless, ∇xt log p(xt |y) is far from Gaussian as i
gets further away from 0, and hence is hard to interpret nor analyze what the gradient steps in the
direction of A∗ (y−Ax′t ) is leading. Albeit not in the context of MRI, a more recent approach (Chung
et al., 2023a) proposes to use
xt = x′t − ξi ∇xt+1 ∥y − Ax̂t+1 ∥22 . (74)
As x̂t+1 is the Tweedie denoised estimate and is free from Gaussian noise, (74) can be thought of as
minimizing the residual on the noiseless data manifold M. However, care must be taken since taking
the gradient with respect to xt corresponds to computing automatic differentiation through the neural
net sθ∗ , often slowing down the compute by about ×2 (Chung et al., 2023a).
T2I vs. DIS For the former, the likelihood is usually given as a neural net-parameterized function
pϕ (y|x) (e.g. classifier, implicit gradient from CFG), whereas for the latter, the likelihood is given as
an analytic distribution arising from some linear/non-linear forward operator (Kawar et al., 2022;
Chung et al., 2023a).
18
Published as a conference paper at ICLR 2024
E A LGORITHMIC DETAILS
We provide the VP counterpart of the DDS VE algorithms presented in Algorithm 5,6 in Algorithm 2,4.
The only differences are that the model is now parameterized with ϵθ that estimates the residual noise
components and that we have a different noise schedule.
As stated in section 4.1, using ≥ 200 NFE when using Algorithm 5 degrades the performance due to
the numerical instability. To counteract this issue, we use the same iteration until i ≤ N/50 =: k,
and directly acquire the final reconstruction by Tweedie’s formula:
19
Published as a conference paper at ICLR 2024
Figure 3: Illustration of the imaging forward model used in this work. (a) 3D sparse-view CT:
the forward matrix A transforms the 3D voxel space into 2D projections. (b) Multi-coil CS-MRI:
the forward matrix A first applies Fourier transform to turn the image into k-space. Subsequently,
sensitivity maps are applied as element-wise product to achieve multi-coil measurements. Finally, the
multi-coil measurements are sub-sampled with the masks.
E.2 3D CT RECONSTRUCTION
Recall that to perform 3D reconstruction, we resort to the following optimization problem (omitting
the indices for simplicity)
1
x̂∗ = arg min ∥Ax̂ − y∥22 + λ∥D z x̂∥1 . (76)
x̂ 2
20
Published as a conference paper at ICLR 2024
One can utilize ADMM to stably solve the above problem. Here, we include the steps to arrive at
Algorithms 6,4 for completeness. Reformulating into a constrained optimization problem, we have
1
min ∥y − Ax̂∥22 + λ∥z∥1 (77)
x̂,z 2
s.t. z = D z x̂. (78)
Then, ADMM can be implemented as separate update steps for the primal and the dual variables
1 ρ
x̂j+1 = arg min ∥y − Ax̂j ∥22 + ∥D z x̂j − z j + w∥22 (79)
x̂j 2 2
ρ
z j+1 = arg min λ∥z j ∥1 + ∥D z x̂j − z j + w∥22 (80)
zj 2
wj+1 = wj + D z x̂j+1 − z j+1 . (81)
We have a closed-form solution for (79)
x̂j+1 = (AT A + ρD Tz D z )−1 (AT y + ρD Tz (z − w)), (82)
which can be numerically solved by iterative CG
x̂j+1 = CG(ACG , bCG , x̂j , M ) (83)
ACG := A A + ρD Tz D z
T
(84)
bCG := AT y + ρD Tz (z − w). (85)
Moreover, as (80) is in the form of proximal mapping (Parikh & Boyd, 2014), we have that
z j+1 = Sλ/ρ (D z x + w). (86)
Thus, we have the following update steps
x̂j+1 = CG(ACG , bCG , x̂j , M )
z j+1 = Sλ/ρ (D z x̂j+1 + wj )
wj+1 = wj + D z x̂j+1 − z j+1 .
Usually, such an update step is performed in an iterative fashion until convergence. However, in
our algorithm for 3D reconstruction (Algorithm 6, 4), we only use a single iteration of ADMM per
each step of denoising. This is possible as we share the primal variable z and the dual variable w as
global variables throughout the update steps, leading to proper convergence with a single iteration
(fast variable sharing technique (Chung et al., 2023b)).
F A DDITIONAL E XPERIMENTS
F.1 N OISE OFFSET
Once the update step is performed, the Gaussian noise level of the updated samples is estimated with
the method from Chen et al. (2015). Note that as the estimation method is imperfect, there is already
a gap between the ground truth noise level and the estimated noise level.
No DDS
Method process
Score-MRI Jalal et al. DPS DDNM
(CG)
To put an emphasis on the fact that the proposed method is agnostic to the forward imaging model
at hand, for the first time in DIS literature, we conduct MRI reconstruction experiments on the
non-Cartesian trajectory, which involves non-uniform Fast Fourier Transform (NUFFT) (Fessler &
Sutton, 2003). In Fig. 4 we show that DDS is able to reconstruct high-fidelity images from radial
trajectory sampling, even under aggressive accelerations.
For our sampling strategy, the stochasticity of the sampling process is determined by the parameter
η ∈ [0, 1]. When η → 0, the sampling becomes deterministic, whereas when η → 1, the sampling
becomes maximally stochastic. It is known in the literature that for unconditional sampling using low
NFE, setting η as close to 0 leads to better performance Song et al. (2021a). In Fig. 6, we see a similar
trend when we set NFE = 20. However, when we set NFE ≥ 50, we observe that the choice of η
does not matter too much, as it often fluctuates within the boundary that can be as well be thought
of as arising from the inherent stochasticity of the sampling procedure. This is different from the
observation made in (Song et al., 2021a), which can be thought of as stemming from the conditional
sampling strategy that the proposed method uses.
One observation that is made in this experiment, however, is that using ≥ 200 NFEs for the proposed
method degrades the performance. We find that this degradation stems from the numerical pathologies
that arise when VE-SDE is combined with the parameterizing the neural network with sθ . Specifically,
the score function is parameterized to estimate sθ∗ (xt ) ≃ − xtσ−x
2
0
= ϵ/σt . Near t = 0, σt attains a
t
2
[Link]
22
Published as a conference paper at ICLR 2024
(a) VE – parameterization 𝒔𝜃
‖𝒙0 − 𝒙𝜃∗ (𝒙𝑡 )‖2 ‖𝒙0 − 𝒙𝜃∗ (𝒙𝑡 )‖2
𝑖 𝑖
(b) VP – parameterization 𝝐𝜃 (c) Evolution of denoised estimates
(i) VE
ෝ500
𝒙 ෝ250
𝒙 ෝ100
𝒙 ෝ3
𝒙
(ii) VP
ෝ500
𝒙 ෝ250
𝒙 ෝ100
𝒙 ෝ3
𝒙
Figure 5: Evolution of the reconstruction error through time. ±1.0σ plot. (a) VE parameterized with
sθ , (b) VP parameterized with ϵθ , (c) Visualization of x̂t .
PSNR[db]
35.0
34.0
NFE = 20
Axial∗ Coronal Sagittal
NFE = 50
33.0 Method PSNR ↑ SSIM ↑ PSNR ↑ SSIM ↑ PSNR ↑ SSIM ↑
NFE = 100
DDS VP (49) 35.07 0.963 32.90 0.955 29.34 0.861
32.0 DiffusionMBIR Chung et al. (2023b) 34.92 0.956 32.48 0.947 28.82 0.832
MCG Chung et al. (2022a) 26.01 0.838 24.55 0.823 21.59 0.706
Song et [Link] et al. (2022) 27.80 0.852 25.69 0.869 22.03 0.735
31.0 DPS Chung et al. (2023a) 25.32 0.829 24.80 0.818 21.50 0.698
Lahiri et al. Lahiri et al. (2022) 28.08 0.931 26.02 0.856 23.24 0.812
Zhang et al. Zhang et al. (2017) 26.76 0.879 25.77 0.874 22.92 0.841
η =0.0 0.2 0.4 0.6 0.8 1.0 ADMM-TV 23.19 0.793 22.96 0.758 19.95 0.782
Figure 6: Ablation study on the selection Table 7: Quantitative evaluation of LA-CT (90◦ ) on the
of η for Algorithm 2. AAPM 256×256 test set. Bold: Best.
Table 8: Standard deviation of the quantitative metrics presented in Table. 4, which correspond to the
results for sparse-view CT reconstruction. Mean values in Tab. 4. (Numbers in parenthesis): NFE.
very small value e.g.σ0 = 0.01 (Kim et al., 2022), meaning that the score function has to approximate
relatively large values in such regime, leading to numerical instability. This phenomenon is further
illustrated in Fig. 5 (a), where the reconstruction (i.e. denoising) error has a rather odd trend of
jumping up and down, and completely diverging as t → 0. This may be less of an issue for using
samplers such as PC where x̂i are not directly used but becomes a much bigger problem when
23
Published as a conference paper at ICLR 2024
the proposed sampler is used. In fact, for NFE > 200, we find that simply truncating the last few
evolutions is necessary to yield the result reported in Tab. 6 (See appendix E.1 for details).
Such instabilities worsened when we tried scaling our experiments to complex-valued PI reconstruc-
tion due to the network only being trained on magnitude images. On the other hand, the reconstruction
errors for VP trained with epsilon matching have a much stabler evolution of denoising reconstruc-
tions, suggesting that it is indeed a better fit in the context of the proposed methodology. Hence, all
experiments reported hereafter use a network parameterized with ϵθ trained within a VP framework,
and also by stacking real/imag part in the channel dimension to account for the complex-valued
nature of the MR imagery and to avoid using ×2 NFE for a single level of denoising.
G E XPERIMENTAL D ETAILS
G.1 DATASETS
fastMRI knee. We conduct all PI experiments with fastMRI knee dataset (Zbontar et al., 2018).
Following the practices from Chung & Ye (2022), among the 973 volumes of training data, we drop
the first/last five slices from each volume. As the test set, we select 10 volumes from the validation
dataset, which consists of 281 2D slices. While Chung et al. (2022b) used DICOM magnitude
images to train the network, we used the minimum-variance unbiased estimates (MVUE) (Jalal
et al., 2021) complex-valued data by stacking the real and imaginary parts of the images into the
channel dimension. When performing inference, we pre-compute the coil sensitivity maps with
ESPiRiT (Uecker et al., 2014).
AAPM. AAPM 2016 CT low-dose grand challenge data leveraged in Chung et al. (2022a; 2023b)
is used. From the filtered backprojection (FBP) reconstruction of size 512 × 512, we resize all the
images to have the size 256 × 256 in the axial dimension. We use the same 1 volume of testing data
that was used in Chung et al. (2023b). This corresponds to 448 axial slices, 256 coronal, and 256
sagittal slices in total. To generate sparse-view and limited angle measurements, we use the parallel
view geometry for simplicity, with the torch-radon (Ronchetti, 2020) package.
VE models for both MRI/CT are taken from the official github repositories (Chung & Ye, 2022; Chung
et al., 2022a), which are both based on ncsnpp model of Score-SDE (Song et al., 2021b). In order to
train our VP epsilon matching models, we take the U-Net implementation from ADM (Dhariwal &
Nichol, 2021), and train each model for 1M iterations with the batch size of 4, initial learning rate of
1e − 4 on a single RTX 3090 GPU. Training took about 3 days for each task.
When running Algorithm 2 in the low NFE regime (e.g. 19, 49), we see that our method sometimes
yields infeasible reconstructions. Sampling 100 reconstructions for 1D uniform ×4 acc., we see
that 5% of the samples were infeasible for 19 NFE, and 3% of the samples were infeasible for 49
NFE. In such cases, we simply compute the Euclidean norm of the residual ∥y − Ax̂∥ with the
reconstructed sample x̂, and reject the sample if the residual exceeds some threshold value τ . Even
when we consider the additional time costs that arise from re-sampling the rejected reconstructions,
we still achieve dramatic acceleration to previous methods Song et al. (2022); Chung & Ye (2022).
Score-MRI (Chung & Ye, 2022) We use the official pre-trained model3 with 2000 PC sampling.
Note that for PI, this amounts to running the sampling procedure per coil. When reducing the number
of NFE presented in Fig. 1, we use linear discretization with wider bins.
3
[Link]
24
Published as a conference paper at ICLR 2024
Jalal et al. (2021). As the official pre-trained model is trained on fastMRI brain MVUE images, in
order to perform fair comparison, we train the NCSN v2 model with the same fastMRI knee MVUE
images for 1M iterations in the setting identical to when training the model for the proposed method.
For inference, we follow the default annealing step sizes as proposed in the original paper. We use 3
Langevin dynamics steps per noise scale for 700 discretizations, which amounts to a total of 2100
NFE. When reducing the number of NFE presented in Fig. 1, we keep the 3 Langevin steps intact,
and use linear discretization with wider bins.
E2E-Varnet (Sriram et al., 2020), U-Net. We train the supervised learning-based methods with
Gaussian 1D subsampling as performed in Chung & Ye (2022), adhering to the official implementation
and the default settings of the original work.
TV. We use the implementation in [Link].TotalVariation4 , with calibrated sensitivity
maps with ESPiRiT (Uecker et al., 2014). The parameters for the optimizer are found via grid search
on 50 validation images.
G.4.2 3D CT RECONSTRUCTION
DiffusionMBIR (Chung et al., 2023b), MCG (Chung et al., 2022a). Both methods use the same
score function as provided in the official repository5 . For both DiffusionMBIR and Score-CT, we
use 2000 PC sampler (4000 NFE). For DiffusionMBIR, we set λ = 10.0, ρ = 0.04, which is the
advised parameter setting for the AAPM 256×256 dataset. For Chung et al., we use the iterating
ART projections as used in the comparison study in Chung et al. (2023b).
Lahiri et al. (2022). We use two stages of 3D U-Net based CNN architectures. For each greedy
optimization process, we train the network for 300 epochs. For CG, we use 30 iterations at each stage.
Networks were trained with the Adam optimizer with a static learning rate of 1e − 4, batch size of 2.
FBPConvNet Jin et al. (2017), Zhang et al. (2017). Both methods utilize the same network
architecture and the same training strategy, only differing in the application (SV, LA). We use a
standard 2D U-Net architecture and train the models with 300 epochs. Networks were trained with
the Adam optimizer with a learning rate of 1e − 4, batch size 8.
ADMM-TV. Following the protocol of Chung et al. (2023b), we optimize the following objective
1
x∗ = arg min ∥Ax − y∥22 + λ∥Dx∥2,1 , (87)
x 2
with D := [D x , D y , D z ], and is solved with standard ADMM and CG. Hyper-parameters are set
identical to Chung et al. (2023b).
H Q UALITATIVE RESULTS
4
[Link]
5
[Link]
25
Published as a conference paper at ICLR 2024
Figure 7: Comparison of parallel imaging reconstruction results. (a) subsampling mask (1st row:
uniform1D×4, 2nd row: Gaussian1D×8, 3rd row: Gaussian2D×8, 4th row: variable density
poisson disc ×8), (b) U-Net (Zbontar et al., 2018), (c) E2E-VarNet (Sriram et al., 2020), (d) Score-
MRI (Chung & Ye, 2022) (4000 × 2 × c NFE), (e) DDS (49 NFE), (f) ground truth.
26
Published as a conference paper at ICLR 2024
Mask Zero-filled DPS (50) DPS (1000) DDS (49) Ground Truth
(a) (b) 25.13 / 0.778 (c) 26.40 / 0.884 (d) 24.94 / 0.546 (e) 27.07 / 0.834 (f)
27
Published as a conference paper at ICLR 2024
(a) 13.36 / 0.404 (b) 20.65 / 0.688 (c) 27.83 / 0.860 (d) 33.34 / 0.938 (e) 34.67 / 0.953 (f)
3D sparse-view CT (8view)
14.19 / 0.537 27.76 / 0.912 28.44 / 0.892 34.23 / 0.968 35.30 / 0.970
15.65 / 0.674 20.94 / 0.849 25.41 / 0.860 34.06 / 0.960 34.52 / 0.967
(a) 11.41 / 0.450 (b) 27.00 / 0.932 (c) 19.34 / 0.634 (d) 31.43 / 0.932 (e) 31.43 / 0.932 (f)
3D limited-angle CT (90∘)
13.78 / 0.616 28.37 / 0.920 25.14 / 0.808 35.49 / 0.963 35.22 / 0.958
18.73 / 0.826 21.25 / 0.910 23.15 / 0.818 29.85 / 0.964 29.91 / 0.959
Figure 9: Comparison of 3D CT reconstruction results. (Top): 8-view SV-CT, (Bottom): 90◦ LA-CT.
(a) FBP, (b) Lahiri et al., (c) Chung et al., (d) DiffusionMBIR (4000 NFE), (e) DDS (49 NFE), (f)
ground truth. Numbers in top right corners denote PSNR and SSIM, respectively.
28