Deep Learning
Deep Learning
net/publication/355467879
CITATIONS READS
0 690
1 author:
Hamza Bouzidi
Sapienza University of Rome
1 PUBLICATION 0 CITATIONS
SEE PROFILE
All content following this page was uploaded by Hamza Bouzidi on 21 October 2021.
Candidate Advisor
Hamza Bouzidi Christian Napoli
Deep Neural Networks for N-MRI image
processing
Hamza Bouzidi
Matricola 1915027
Advisor Co-Advisor
A.A. 2020-2021
Table of Contents
1.Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1
5.2 Noise and signal model for multiple correlated coils . . . . . . . . . . . . . .41
6. Implemented Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.1 Optimizing the input pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .46
6.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
7.Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
8. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .76
1
1. Introduction
In current clinical practice, the role of medical images has become very
prominent for diagnosing and treating several diseases. The medical images
frequency coils or movements of the patients. Noise in MRI scans affects not
Hence, estimation and removal of noise from MR images are essential for
enhancement techniques are applied [1–8], and a number of them are based
on deep learning.
methods try to infer the clean image from the noisy input. One of the main
denoising of MRI images using CNNs has not been extensively studied in
the literature.
Moreover, the study revealed that the noise model present in MR images is
very different from that of natural images[16]. This happens due to several
evident that the technique used in natural image denoising may not work
raw data in the k-space (frequency domain) rather than in magnitude space.
The aim of using the denoiser network on this type of data is that the noise
non-invasive technique and safer than CT, X-Ray, and other techniques. It
also provides better soft-tissue contrast and image resolution for the
diagnostic purpose [9], [10]. The MRI modality built upon the phenomenon
notable works by Richard Ernst, Paul C. Lauterbur, and Sir Peter Mansfield,
who won the Nobel prize in 1991, 2003, and 2003 respectively.
external Magnetic field. The crux of the MRI modality is the utilization of
the form of water. The protons of H-atoms are aligned by the external field.
Under a radio frequency (RF) pulse, protons release their energy generating
addition, namely T1, T2, PD (Proton Density) modality, shown in figure 2.1.
4
Figure 2.1: Sample images from the simulated BrainWeb Database [13]
during the image acquisition process. The artifacts can mainly be classified
as:
inhomogeneity etc.
• Patient Related: such as body movement, holding breath for a long time,
etc.
Many of the artifacts mentioned above are taken care of by the MR scanner
Problem). Here, Figure 2.2 shows two real sample images of different
subjects (human) from benchmark databases [14], [15] where noise is clearly
Figure 2.2: Sample images from Real Databases. On the left Oasis [14], and right BRATS [15]
visible. The image denoising problem is, in fact, an inverse problem that
tries to reconstruct a true noise-free image [9]; hence, it can ease the image
crucial ingredient of the medical image analysis process. Hence, there are
and other artifacts such as intensity inhomogeneity, bias correction, etc., and
can be considered independent problems, or one can use the first one as
guided input for the other. An inaccurate noise model may lead to doubt on
preferred at high SNR locations in MRI [16]. A lot of efforts have been put
into building a statistical noise model in MRI [16], [17], [18], [19]. Similarly,
6
efforts have been made to estimate parameters of models in [20], [21], [22],
[23].
been modified accordingly to adjust the nature of MRI data [24], [25], [26],
[27]. However, one needs to take care of the tissue information and
and White Matter (WM) play significant role in differentiating healthy brain
from an abnormal one and also in clinical examinations [28]. So, even a
found in [1].
7
often refers to the temporary image space, usually a matrix, in which data
through a quadrature detector that provides the real and imaginary part of
the signal. Each part of the signal is assumed to be affected by white noise,
the main source of the noise is the RF coil resistance [29], and the final effect
bandwidth.
The real and imaginary parts from k-space are reconstructed through the
The images usually obtained in MRI are magnitude images, but others
which are derived from the phase of the complex image could also be
found, but still, the most common ones are the magnitude images, that
study will be based on, since discarding the phase information can make
Magnitude images can not be divided into a part of the signal and a part of
noise since, as said earlier, the noise is no longer additive. Thus, the
by :
(I 2 + M 2)
M − IM
p(M ) = 2 e 2σ 2 I0( ) (3.1)
σ σ2
Where I0 is the modified 0-th order Bessel function. It is also called the Rice
distribution. A Gaussian approximation of this distribution can be made
only if I /σ > > 1(which is the Signal to noise Ratio in the x-space). So a
magnitude image with a high level of noise will be far from this Gaussian
approximation of the signal, and it would suffer from what is called the
“Rician bias”.
As a consequence, clinical MRI with low SNR, also being hard to be read
since new generation scanners can achieve excellent imaging quality, but
the effect of Rician noise still causes problems in many new acquisition
modalities [32] or when the low signal is given by the low concentration of
Image denoising is the task of removing the effect of noise from an image,
which means that denoising an image should restore (the noisy image) to
the condition it was before the application of the noise (the original image),
the performance of the denoiser that performs this task is evaluated on how
much the restored image is close to the original one. Still, the denoised
image can inevitably lose some details in the process of denoising since
of a good denoiser can be defined as; Flat areas should be smooth, edges
should not be blurred, textures should be preserved, and new details not
In the past few years, several deep learning methods proposed to denoise
denoising of magnitude images both on the image and on the raw data in k-
space.
10
The method is inspired by the work done by Zhang and his collaborators
Normalization are both used to speed up the training process and also to
clean image in the hidden layers to separate the noise from the original
image.
3x3x1.
More details about neural networks and residual learning will be available
section 6.3.
denoising Rician distributed noise for the same quality of image, in terms of
PSNR.
So first the two equations of the images affected by both Gaussian and
Mgauss = I + ϵ (3.2)
Where the resulting equations are the images after adding Gaussian and
Rician noise, I is the original image and ϵ is a zero mean Gaussian noise
is trained on 400 images from the train dataset, the dataset that was used is
Gaussian noise are reported in figure 3.2 and 3.4a, and for the Rician noise
in figure 3.3 and 3.4b, the metrics used to evaluate the performance are
filtering technique called the wiener filter, highly effective for white noise
removal [37].
The results show that comparing the restoration of images at the same level
Figure 3.2 : Performance of the DnCNN on the Gaussian blind denoising. From left clockwise :
The noisy version of the image, the processed image by DnCNN, the original image, the
processed image with Wiener filter. Each point is an image in the test set, the same color refer to
the same level of noise applied, and the dotted red line means that no improvement of
performance in PSNR after the application of denoiser is recorded [PSNR(processed) =
PSNR(noisy)].
12
Figure 3.3 : Performance of the DnCNN on the Rician blind denoising. From left clockwise : The
noisy version of the image, the processed image by DnCNN, the original image, the processed
image with Wiener filter. Each point is an image in the test set, the same color refer to the same
level of noise applied, and the dotted red line means that no improvement of performance in
PSNR after the application of denoiser is recorded [PSNR(processed) = PSNR(noisy)].
Figure 3.4 : Average PSNR for the test dataset after the application
Of the DnCNN (blue color) and Wiener Filter (Orange Color) in function of
the standard deviation of the noise for the Gaussian and Rician models.
For blind Gaussian denoising, DnCNN is always better than a wiener filter,
In the case of Rician noise Wiener filter outperforms on PSNR at high noise levels.
DnCNN performs always better than a wiener filter applied on same image
In the next section, the theory of neural networks will be explained and
Section 4.1.2.
First, methods were validated on simulated data. This means that the work
to real data but grants the possibility to control every step of the pipeline.
modalities.
execution on GPU.
complex white noise in the frequency domain, the noise standard deviation
Figure 3.5 shows the effect of the addition of noise in k-space in the
magnitude images.
DnCNN was trained for the task of denoising directly on k-space, this
clean and noisy images by minimization of the loss defined in section 3.2.2,
and to compare the results with a network of the same complexity, the
DnCNN was also trained on the noisy images on magnitude space, this
Both the networks are trained for 300 epochs with Adam [47], and a
learning rate lr = 10−3 , then, the learning is reduced to lr = 10−4 and the
network is trained for another 100 epochs.
15
version to the clean one. The mapping should be done according to k-space
and the final reconstructed magnitude image, that’s why a term is added to
With SY the two channel output of the network, S = (SR, SI ) represents the
real and imaginary parts of the signal and ground truth in k-space , M the
ground truth in magnitude space, and reco(SY ) the reconstruction of the
output two channel signal by taking the modulus of the 2D Inverse Discrete
3.2.3 Testing
To test the networks, data from a realistic brain phantom is used, available
fine details never seen during training. This test will be a measure of the
3.2.4 Results
Figure 3.6 : Denoising results for (a) axial and (b) coronal views of
the brain phantom both on k-space and magnitude space.
Denoising in magnitude seems to create smoother surfaces at the
cost of losing details.
17
trained, Kspace-Dn and M-Dn, with pairs of original and noisy simulated
MRIs of the simple phantom; figure 3.6 shows few compared images
denoised both with Ks-Dn and M-Dn for visual inspection, it is noticed that
M-Dn favors smoother surfaces at the cost of losing details, the reason of it
may be an incorrect noise estimation during the blind denoising task, it can
be seen how the networks that work on the k-space always outperform the
network trained with magnitude images in terms of image quality. This was
The aim of this chapter is to shed light about basic concepts of Deep
CNNs became increasingly important in the past years for their huge
localisation.
they are also one of the most used models in computer aided diagnosis
(CAD).
This field is in continuous development and every month best models for a
given task are being changed, so the focus will be laid on the fundamentals
found in [40].
The reason of success of CNNs is because of many reasons that make them
scalable feature learning architecture that for a given task, optimises the
(ANN), ANNs are computers whose architecture is modeled after the brain.
19
simplified model of a real neuron which sends off a new signal or fires if it
receives a sufficiently strong Input signal from the other nodes to which it is
connected.
where input data travels in one direction only, passing through artificial
neural nodes and exiting through output nodes; the aim of this section is to
could be given is a classifier that maps an element x to its label y, and it can
outputs such as y = f (x; W ) which models the original function f * and the
function approximation.
They are called feedforward networks because the information flows from the
input x to the output y, without feedback between layers, and they are an
f 1 is called the first layer of the network, f 2 until the last layer which is
called f N, and the length of the chain is called the depth of the model, since
the architecture of the network can be complex and composed of many
The objective of the training is to match our function f to the original f *, the
layers can be shaped freely to better approximate f *, the role of the training
algorithm is to select the best parameters or weights for these layers, they
20
are called hidden because during training they are hidden, and composed by
several hidden units that perform the basic computation in a neural network,
that the final layer is a simple linear model that operates not on the input
(complex) features.
Connections come only from the previous layer, and elements in the same
layer are not directly connected. This architecture is called a multilayer fully
every element is connected to all the elements of the previous layer. This is
the most basic example of a neural network, but it still finds applications.
given by the many inputs that it receives from the previous layer.
and supply it as the input to the next layer. The purpose of the activation
values in the input layers are generally centered around zero and have
get beyond the range of their original scale, which is where the activation
functions come into play, forcing values back within this acceptable range
One of the most used activation functions is the Rectified Linear Unit
(ReLu), which outputs 0 if the input is negative and linear when the input is
positive.
relu(x) = m a x(x,0) (2.2)
vanish. This arises when x > 0. In this regime the gradient has a constant
faster learning.
The other benefit of ReLUs is sparsity. Sparsity arises when x ≤ 0. The more
such units that exist in a layer, the more sparse the resulting representation.
Sigmoids, on the other hand, are always likely to generate some non-zero
points is available, in image recognition tasks for example this dataset may
values.
which applies an algorithm to map one input to one output. For supervised
learning to work, a labeled set of data that the model can learn from to
example, labelers may be asked to tag all the images in a dataset. The
In machine learning, a properly labeled dataset that you use as the objective
standard to train and assess a given model is often called “ground truth.”
The accuracy of our trained model will depend on the accuracy of our
ground truth, so spending the time and resources to ensure highly accurate
Then, in order to evaluate how good our algorithm performs the task, a
the key to drive the learning phase and grant the ability to generalize the
task to other data that will be unlabelled in supervised learning. The metric
should be generic enough to be well defined for all our data examples and
Our learning goal should be also evaluating the performance of our model
on data that is not present in the training set to check if it has learned
adequately, thus, a different data set called the test set, will be used for this
purpose and measured performance on this set will be the indicator of how
ability to :
• Generalize the learning in order to be good enough also on test set, and
make the gap of performance between the training and test sets as small
as possible.
One interesting task that can be solved with Machine Learning is denoising,
the subject of our thesis. In denoising tasks the algorithm is given as input a
version y, the model will try to restore x into a state similar to y. In this case
the metric that needs to be used has to calculate the similarity between the
two images, in the case of images pixels are usually worked with, so the
the objective function is usually called the error function, or the loss
exact form of L(x) will be problem dependent but in general a low value of
it will result in high performance for the algorithm. The argument that
minimises the loss function is denoted by x* such that:
Solving the equation through the method of gradient descent is the one
generally adopted to solve the equation ∇x L(x) = 0 , more talk about is
available in [41].
25
With ϵ the learning rate, a positive constant that defines the size of the step,
choosing the best learning rate is also a matter of discussion, but in the
computation since they have a large derivative in every points they are
After giving this overview about Neural Networks, a specific and very
popular type of neural networks and the main subject of this thesis,
are a subset of deep learning algorithms. CNNs can be both supervised and
The use of this operation in one or more layers is what defines a CNN.
26
∫
s(t) = (x * w)(t) = x(a)w(t − a) d a (4.3)
where x is the function mapping to a specific value in the input data, and w
represents the kernel. The output function s(t) is usually called a feature
If the input values are discrete, the convolution operation can be rewritten
using summation:
∑
s(t) = (x * w)(t) = x(a)w(t − a) (4.4)
a
One can then use a two-dimensional kernel K, and the operation can be
written as follows:
∑∑
S(i, j ) = (I, K )(i, j ) = K(i − m, j − n) (4.5)
m n
That is, for a given pixel in the input, positioned in row i and column j, the
parameters and input pixels to produce the output value for i and j.
kernel and a small portion of a larger image. Usually, the kernels adopted
for image processing in CNNs are significantly smaller than the image they
27
a convolutional layer the same kernel will be applied to all elements of the
input, meaning that the same operation is repeated in the image space
The effect of a linear kernel can be seen on the figure 4.2, a Kernel K with
2x2 pixel size is applied to an image I, the kernel slides on the image with S
pixels step and it is called the stride, if the value of the stride is changed the
example if the value of the stride is 1, then the kernel would process every
element of the input, however is the value of the stride is bigger then the
and the coefficients of this combination are learned during the training.
record the precise position of features in the input. This means that small
movements in the position of the feature in the input image will result in a
different feature map. This can happen with re-cropping, rotation, shifting,
operation reaches this scope by replacing the output of the net in a certain
location with aggregate information over all the nearby input units. Max
the maximum, or largest, value in each patch of each feature map. The
results are down sampled or pooled feature maps that highlight the most
present feature in the patch, not the average presence of the feature in the
CNNs have been used for several tasks, so there is no typical architectures
that define the best CNN models for every task. But when processing
images, there are some guidelines that are valid for most of the times.
One of the best vision model architectures to date is VGG from the paper
(Simonyan and Zisserman, 2015), which was a popular solution for image
from it. However, the unique thing about VGG16 is that instead of having a
layers of 3x3 filter with a stride 1 and always used the same padding and
29
max pool layer of 2x2 filter of stride 2, which made the idea of small kernels
operators that perform a linear combination of the KxK (with K) the size of
kernels. The size of the kernel on the other controls how many elements are
combined together. Thus, the number of the parameters of the layer will be :
PW = N * M * K * K (4.6)
After the convolutional operations, the output elements are the inputs of an
size.
composed by multiple blocks, each of them with an input size reduced due
to the pooling stage. This step is called the subsampling path as the input
The convolutional blocks get flattened in order to get the output of the
called the feature vector and it is the input to the classification task.
30
As an example for the architecture explained, figure 4.3 is reported with the
same structure. From left to right : an input matrix of 128 by 128 units, that
can represent the analysed data or the output of the previous convolutional
with a filter of stride 2, and after the pooling the output is reshaped as
vector and used an input for a two layer fully connected classifier.
The CNN described here is one of the most basic implementations of the
this is not a problem since our objective is not to survey CNN architectures
that has the same shape as the input. It is called (FCNN) Fully
Convolutional Neural Network and it was introduced first for the task of
the segmentation task is the assignment of each pixel of the input to the
In a brain image, the task could be the separation of a white matter zone
The difference between the FCNN and the architecture of CNN previously
intermediate layer feature map in the subsampling path back to the size of
the input image through the transposed convolution operation, so that the
Unlike the CNN with a fully contractive path, in FCNN it is possible that
each output unit depends only on a part of the input. This area is usually
called the receptive field, every pixel outside of this area will not contribute
In the case there are no pooling operation, the case of the receptive field is
interesting, the ability to process larger areas depends only by the deepness
of the layers the network. The receptive field then grows from the initial
size of the kernel K by the stride S of the operator in each direction for each
PS = K + 2S(D − 1) (4.7)
Figure 4.5 shows an example for a network with a kernel of size 3 by 3 with
a stride of 1 and depth d, which will have an effective patch width of 2d+1.
This kind of architecture has a small receptive field but it holds the spatial
improve translation invariance but it makes also at the same time the
output less dependent on the exact spatial position. This network will
perform extremely good for a task which the definition and sharpness of
denoising strategy in the work of Zhang and collaborators [35]. When both
utilized, they speed up the training process and boost the denoising
performance.
33
latent clean image gradually in the hidden layers to separate the noise
original image :
NNW (X ) ∼ y (4.8)
The residual learning formulation instead aims to map the output of the
network to the noise part of the input. This is performed by subtracting the
NNW (X ) = X − ypred
(4.11)
The loss function can also be written, using a pixel-wise mean squared error
1
(yipred − yi )2
∑
L(ypred , y) = (4.9)
n i
In order to simplify the notation, the sum over the pixels will not be written
explicitly in the next steps. In normal learning for which the output is
ypred = NNW (X ) the loss function is :
34
LW = | ypred − y |2 (4.10)
While for the residual learning approach X − ypred = NNW (X ) the loss
becomes :
L(ypred , y) = | X − ypred − y |2 (4.11)
For the additive noise model where it is possible to write X = y + ϵ, the loss
reduces to :
And this leads to ypred ∼ ϵ . This slight modification in the loss function
helps the neural network to find a solution that focuses on the noise part of
The original paper's authors show that a simple neural network that applies
this strategy can decrease training time and has a more remarkable
generalization ability (the training can be translated to the related task). Its
standard deviation has been known for a long time [44], to be beneficial to
idea across the intermediate layers within a deep network [Szegedy et Al,
can lead to divergence. However, small learning rates yield little progress
along with flat directions of the optimization landscape and may be more
generalization performance.
activations cannot grow uncontrollably since their means and variances are
Ib,c,x,y − μc
Ob,c,x,y ← γc + βc ∀b, c, x, y . (4.13)
σ2 + ϵ
With Ib,c,x,y and Ob,c,x,y are four dimensional tensor input and outputs of a
channel c, and two spatial dimensions x,y respectively. For input images the
features b in the entire mini-batch and all spatial x,y locations. Subsequently,
BN divides the centered activation by the standard deviation σc .
36
In this section, light is shed on the original MRI dataset, called the fast MRI
Still, it is a very good candidate for the denoising test because it consists of
which simulated noise can be added to train our Dn-CNN, some of the
acquisitions can be noisy, but they form just a minority, and usually, the
In any case, the present noise in the acquisition will be negligible compared
to the artificial noise added for training purposes. The presence of noise in
real data is even a perk so that any denoising method would be robust to
applied by the MRI machine. Multiple receiver coil implies that each of
matrices will be different (see figure 5.1), because each of the coils will
regions.
38
The dataset among the diverse datasets available in FastMRI that is worked
on in this thesis for denoising the DnCNN is the Knee k-space Data. It is a
Multi-coil raw data that was stored for 1,594 scans acquired for the purpose
of diagnostic knee MRI. A single fully sampled MRI volume was acquired
for each scan on one of three clinical 3T systems (Siemens Magnetom Skyra,
Magnetom Aera). Data acquisition used a 15 channel knee coil array and
of Medicine. The dataset includes data from two pulse sequences, yielding
(PD, 796 scans) fat suppression (see figure 5.2). Sequence parameters are, as
systems. The following sequence parameters were used: Echo train length 4,
3mm, no gap between slices. The timing varied between systems, with a
repetition time (TR) ranging between 2200 and 3000 milliseconds and echo
Figure 5.2 : Proton density weighted image (a) with fat suppression (PDFS), (b) without fat
suppression (PD). [45]
The total number of patients used for training and testing are shown in
Table 5.1.
The dataset also provides 6970 fully sampled brains scans. A portion of 255
of them was later used for additional testing of the solution but no training
was performed on it. A future version of our denoiser will be trained on the
whole dataset. This dataset provides examples from multiple sequence and
FastMRI patients Patients Slices
Fast MRI Train Set 973 Used in Train : 350 10236
Used in Test : 100 2959
Sl(x) is expressed as the complex signal at the lth coil in the x-space, which
corresponds with the inverse Fourier transform of sl (k), such as:
signal, while in the multiple-coil case, one complex image is available per
coil, and in order to get the real image, it is necessary to combine all that
Figure 5.3 : Test of reconstruction using a simple unweighted SoS, the green looking
pictures show k-space data up to 15 coils, below them the Individual coil spatial
images from fully sampled data, and on the right the reconstructed image from the
total coils.
41
The most popular approach that will be adopted for the reconstruction of
CMS in the multiple-coil acquisition is the Sum of Squares (SoS), and it has
been proven that it is one of the most efficient and Spatial Matched filters
(SMF). The advantage of using SoS is that it does not require a prior
estimation of the coil sensitivity, and thus, the CMS will be directly
reconstruct the CMS from multiple signals, but for the sake of simplicity the
space, and given the fact that the noise affects equally all frequencies (all the
(AWGN), in this case, the acquired signal in the lth coil in k-space can be
modeled as :
With al (k) noise-free signal in the l-th coil of a total of L, and sl (k) is the
received noisy signal at the coil l. This is the assumption of noise in MRI,
get the complex image domain, the inverse Fourier transform of sl (k) is
used in each slice and in every coil.
42
show a particular coupling [16], which means that the noisy samples at each
k-space location are correlated from coil to coil. Under the assumption that
∑
image domain and then becomes a covariance matrix which is non-
Given the assumption that each coil has some Gaussian noise initially with
1 p ... 1
ρ 1 ... ρ
= σ02
∑
(5.4)
⋮ ⋮ ⋱ ⋮
ρ ρ ⋯ 1
Usually, ρi, j is significant in the multi-coils system, but their value is defined
The exact value of the correlation matrix for the scanners present in the
if the coils are first neighbours and ρi, j = 0.15 if they are second
Figure 5.4 : Covariance matrix for the correlated acquisition, the non-diagonal elements
are the correlations between each coil, the correlation between neighbour pairs is bigger
than the distant ones.
load during training that turns out to be heavy without preprocessing, the
Low pass filtering (aka smoothing) removes high spatial frequency noise
from a digital image. The low-pass filters usually employ a moving window
operator, which affects one pixel of the image at a time, changing its value
moves over the image to affect all the pixels in the image.
44
ground truth images and a minor example with reduced memory needed
for training.
This filtering does not produce some significant effects on the final image.
The complex k-space signal for each coil is normalized to the maximum
during training and the standardization of the examples. Since not every
coil has the same sensitivity in each volume point, a good solution would
be to normalize by the max value over the whole volume. The obtained
Moreover, not all the coils are used in the end, so only 8 are selected out of
15 fixed for all examples; this method reduces the overall quality of the final
reconstructed image since less data is used but needed to perform the
compared to single-precision.
It uses both 16bit and 32bit variables during training; lower precision data
types in the model weights use less memory and exploit the presence of
computations and 16bit data types can be read from memory faster.
The powerful thing about this method is that it is possible to double the size
of the mini-batch at the same memory, and thus, double the rate of
6. Implemented Methodology
The tf.data1 API enables to build complex input pipelines from simple,
transformation.
entry from the file and then using the data for training. When the model is
training, clear inefficiencies can be seen, the input pipeline is idle and when
the tf.data runtime to tune the value dynamically at runtime, that level of
parallelism.
• Caching : to cache a large dataset in local storage, this saves operation like
file opening and data reading from being executed during each epoch, the
• Shuffle : The dataset was shuffled by the buffer size parameter, it affects
1 https://www.tensorflow.org/guide/data
47
preprocessing and model execution of the training step, for example when
our model will be executing the training step n, the input pipeline will be
reading data for step n+1. Again AUTOTUNE is used to tune the value
• Batch : Takes first the batch size entries, that were set to 64, and make a
6.2 Metrics
The following metrics will be used to assess the performance of the DnCNN
deviation:
μsignal
SNR = (6.1)
σbackground
• For MRI the Peak Signal to Noise Ratio (PSNR) will be used. PSNR is
defined from the mean squared error (MSR), and for a pair of NxM real
images I and I* , can be written as :
1 N M
[I(i, j ) − I*(i, j )]2
N.M ∑ ∑
MSE = (6.2)
i=1 j=1
between two images normalized to the maximum value that a pixel can
visible structures in the image. The SSIM between the path of the original
σmn is the covariance, and c1,c2 are small regularising constants set to 0.01
and 0.03.
pixels intensity of the original and restored image so each value of the
residual map Di, j is:
The average of the residual map over the image is also reported, this metric
SSI and residual maps are two complementary metrics since SSI is
computed over small regions and gives information about the relation
without any downsampling path (no pooling), and with a residual learning
its output units have a small receptive field, which means that only partial
local information and not the entire input is used to compute (and train) the
The network contains also a receptive field, of 41 x 41 pixels, and the input
is formed by the real and imaginary part of the complex k-space data for
every single coil. Since we’re using 8 coils, then the number of channels
Thus, the Dn-CNN are effectively denoising areas of the image-based only
anatomical structures since only a patch on the input is seen at each output
unit.
The network was trained after applying blind denoising to vary the
standard deviation of the noise between σ ∈ [5,20] ⋅ 10−3 in the k-space.
Considering that the starting quality of the original images is not
homogeneous and both scans acquired with and without fat suppression
it is difficult to measure the effect of this noise on the whole dataset in terms
improvement.
Overall, this random noise addition with this range of standard deviation
damaged. The network will then learn to estimate the noise content and
remove it. In a multi-coil scan with a strong correlation between noise in the
coils, this task tends to be particularly difficult at the image level because
the noise is not stationary and will depend on the intensity of the image.
The supervised learning approach requires ground truth images, and the
learning process is used to find the best set of weights that minimizes the
loss function. The network that processes data in k-space will map a noisy
minimization of the loss should reproduce it, and since our metrics will be
based only on the reconstructed version of the image, a term will be added
noisy input to the ground truth, the loss function used in the training is
defined as :
Lk = MSE(Sy, S ) + β * MSE(Sos(Sy ), M ) (6.8)
signal in the k-space, and M is the ground truth magnitude image, the two
terms in the equation seek to compare by the mean of MSE both the
The parameter β is used to balance the two terms. It controls how much the
reconstruction in magnitude space is weighted in the loss function. It is not
The network is trained with the Adam [47] optimiser for 300 epochs with a
learning rate of 3 ⋅ 10−3, then the learning rate is reduced to 3 ⋅ 10−4 and the
network is trained until the validation loss decrease.
5.103 between epoch 50 and 200. The final value is chosen to set the the
order of magnitude if the reconstruction term to be of the same order of
task.
neural network using only multiple corrupted examples of the same target
without explicitly using the target itself. Their method allows to exploit the
The idea is based on the fact that, if the loss function L is thought as a
generalized point estimator, and therefore an operator that involves the use
of sample data, to calculate a single value which serves as the best guess of
over the distribution of the pair of noisy /clean image (X,y) such as :
The conditional distribution over all the possible clean images and the
In this case, P(y | X ) denotes all the possible correct observations that can be
matched to a noisy example. An example of this visualisation can be all the
possible positions of an edge when the borders are noisy, or the exact
It can be noticed that if the target is replaced with another noisy observation
that has the same expectation value (for example additive white noise), then
the solution still holds and, for example in this case of the L 2 it is possible to
train using target corrupted samples if their noise has zero mean.
The loss function can be written explicitly, remembering the output of the
For the blind denoising task, there is no need to use the same standard
deviation of the noise and the input to the network does not need to be of
better quality with respect to the target. So it is possible to ask the network
to map an image with higher PSNR to another with lower PSNR without
blind denoising since the noise level can also be unknown at every step.
In contrast to the training in the supervised approach, the aim here is not to
Figure 6.2 shows the loss function for the training and the validation of the
validation loss decrease together and differ from a small amount that is
54
training loss that is almost flat and a validation loss that rapidly decreases;
flat because the train task is effectively impossible, and rapidly decreasing
because the weight gradients are the correct ones so when the network is
The network is trained with the Adam [47] optimizer for 300 epochs with a
learning rate of 3 · 10−3 , afterward, the learning rate is reduced to 3 · 10−4,
and the network is trained for another 100 epochs. Then the learning rate is
7. Experimental results
The Dn-CNN is trained for a blind denoising task, so to test its performance
different levels of image corruption are used. The test was performed using
The final reconstructed image from the denoised k-space, without post-
PSNR and SSI are the metrics used to evaluate the restored images, and, to
avoid that large background area may contribute too much in the
calculation, each image is processed entirely but the metrics are computed
only on a large central region of the signal as shown for example in Figure
3.4 for the whole image and in Figures 3.10 for the central patch.
This is crucial since the background is very easy to treat. Its presence will
give exceedingly high scores that are not representative of the real
performances.
The test was held on a portion of data of 100 patients and 2959 scans
In Table 7.1 the results for the supervised and unsupervised training are
high and low level of noise and remarkably, at a high noise level, their
performance matches.
56
Supervised
Noise2Noise
From figures 7.1 to 7.4, whole knee images with and without fat
suppression, showing the noise effect and denoising with K-DnCNN for
look smooth, edges are not blurred, and textures are preserved, moreover,
new details absent in the original image (artifacts) are not generated.
57
Figure 7.1 : Noise effect and denoising with K-DnCNN for supervised and
unsupervised (Noise2Noise) training at σ = 8 . (10e-3) (Low) and σ = 16 . (10e-3)
(High) noise level.
58
Figure 7.2 : Noise effect and denoising with K-DnCNN for supervised and
unsupervised (Noise2Noise) training at σ = 8 . (10e-3) (Low) and σ = 16 . (10e-3)
(High) noise level.
59
Figure 7.3 : Noise effect and denoising with K-DnCNN for supervised and
unsupervised (Noise2Noise) training at σ = 8 . (10e-3) (Low) and σ = 16 . (10e-3)
(High) noise level.
60
Figure 7.4 : Noise effect and denoising with K-DnCNN for supervised and
unsupervised (Noise2Noise) training at σ = 8 . (10e-3) (Low) and σ = 16 . (10e-3)
(High) noise level.
61
reported in figures (7.5-7.8), both on high and low noise levels. Both
maps, each pixel is the square difference between pixel intensities of the
ground truth and corrupted or filtered image. In figure 7.5, for example,
where a high level of noise was applied, most of the pixels in the GT-
prediction residual map are close to black color, which, in the map refers to
an identity mapping, this proves that the filtered images using K-DnCNN
given the average value of the residual map at 7.5 (4,25 ⋅ 10−3 ≃ 0,004), it
can be seen that, on average, that this particular restored image at a high
Conversely, the GT-noised map has overall red pixels, which implies that
the pixel intensities between ground truth and noised image are not similar.
between the GT and recovered image. The higher the SSIM, the better.
Values close to 1 mean identical sets of data. In figure 7.6, for example, the
pixels are mostly red and yellow, which means that the similarity is high.
Even better results are obtained with denoising images at low level of
noises with supervised learning as can be seen in figures 7.7 and 7.8, the
(7,54 ⋅ 10−4 ≃ 0,0007 ≃ 0). The SS index on the other hand, is close to 1
(0,871 ≃ 1).
62
The distribution of the PSNR and SSI of the value computed at slice level on
the noisy and recovered image, along with the slice-wise gain, are also
reported in figures 7.9 and 7.10. The distribution shows that at a high noise
level (16 ⋅ 10−3) the average PSNR of noisy images is 18.9 ± 2.5, as around
1200 slices belong to this range, the average results of the PSNR for most of
the slices fall into the range of 25.6 ± 2.6, which implies that the average
PSNR gain for denoising at high levels of noise is 6.7 ± 1.8. This gain value
proves that the supervised denoiser performs extremely well at high levels
of noise. For SSI index predictions, values close to 1 are reported at slice
On low levels of noise (8 ⋅ 10−3), the K-DnCNN has a slightly less good
performance but, remarkable overall gains can be seen as well.
FIGURE 7.9: Distribution of the a) PSNR and b) SSI of the value com-
puted at slice level on the noisy image (Left), on the restored image
(center),and the slice-wise gain. Results for the supervised training
with at noise level σ = 16
65
FIGURE 7.10: Distribution of the a) PSNR and b) SSI of the value com-
puted at slice level on the noisy image (Left), on the restored image
(center),and the slice-wise gain. Results for the supervised training
with at noise level σ = 8
image patches are reported in figures (7.11-7.12), both on high and low
noise levels.
the same results as the supervised learning one. On figure 7.11 for instance,
the average residual map value is (5,03 ⋅ 10−3 ≃ 0,005) while for the same
image patch filtered by the supervised learning method an average value of
(7.54 ⋅ 10−4 ≃ 0,0007) is obtained, it can be said that for high levels of
noise, the two methods are comparable. Same thing is valid in figure 7.12
66
The distribution of the PSNR and SSI of the value computed at slice level on
the noisy and recovered image by Noise2Noise, along with the slice-wise
In figure 7.13, where high noise level is applied, the average PSNR for the
restored images is (6.7 ± 2.4) which is very similar to the average PSNR
obtained for supervised learning, the average SSI value obtained is also
similar (0.3 ± 0.1).
However, in figure 7.14, where a low noise level is applied, less favourable
results were reached at the level of the PSNR gain, the structural similarity
gain is same as it was for the supervised approach, but the average PSNR
gain reported is (4.0 ± 3.1) < (5.1 ± 2.1).
FIGURE 7.13: Distribution of the a) PSNR and b) SSI of the value com-
puted at slice level on the noisy image (Left), on the restored image
(center),and the slice-wise gain. Results for the supervised training
with at noise level σ = 16
68
FIGURE 7.14: Distribution of the a) PSNR and b) SSI of the value com-
puted at slice level on the noisy image (Left), on the restored image
(center),and the slice-wise gain. Results for the supervised training
with at noise level σ = 8
The possibility to perform the denoising task on data of the same kind but
with essential differences with respect to the one used during training is a
practice.
the Dn-CNN trained on the knee dataset was applied to the denoising of
when not directly used for training. Also, the shapes, contrasts, and average
69
intensities of a brain scan are very different visually from those commonly
To test the Dn-CNN denoiser trained previously for the blind denoising
task with supervised learning, 637 slices of brain scans from 255 patients
present in the validation set of the Brain FastMRI dataset were selected, and
noise with a σ = 16 ⋅ 10−3 and with the correlation between coils defined in
section 5.2 for simulating a highly noisy acquisition was applied.
Since the initial quality of the scans is very different because it depends on
the acquisition used and the noise already present, this noise injection
The range of PSNR and SSI of the corrupted version of the images and of
the predictions is shown in Figure 7.16. The mean PNSR is (23.6 ± 4.5)dB
After applying the denoiser, the average gain of image quality is 4.6 ± 2.7
dB for the PSNR and 0.2 ± 0.1 for the SSI which signifies that, on average,
the image is improved both in its original intensity restoration and pixel
correlation.
figures 7.15d-f.
70
Figure 7.16: Results for the brain dataset: Distribution of the a) PSNR
and b) SSI of the value computed at slice level on the noisy image
(Left), on the restored image (center),and the slice-wise gain. Results
for the supervised training with at noise level σ = 16
71
repeatedly appear. The noise can be removed by taking profit from the
definition of similarities between the patch of the noisy pixel and its
spatially local neighborhood patches in NLM is not strict, it’s just calculated
The basic principle of the nonlocal means denoising is to replace the noisy
value I(i) of pixel i with a weighted average of all the pixels on the image.
∑s
I(i ) = wij I( j ) (7.1)
j∈Ni
Where Nis is the search window of size (2n + 1) * (2n + 1) centered at i and
wij is the weight of two pixels i and j which is calculated depending on the
similarity of their patches and is defined as :
72
1 | N d (i) − N d ( j) |2
−
wij = exp h2 (7.2)
Zi
∑
Where Zi is a normalising term, Zi = wij, and h acts as a filtering
j
parameter.
The nlmeans function was imported from the Dipy Library2 ; this library
Few parameters are available inside the function and can be cited :
also possible to choose the number of used coils of the receiver array. In
• patch_radius : the similar patches in the non-local means are searched for
locally, inside a cube of side 2 * v + 1 centered at the voxel of interest.
means implementation.
2https://dipy.org/documentation/1.4.1./examples_built/denoise_nlmeans/#example-denoise-
nlmeans
73
The NLM denoising result on a single slice from the brain dataset is shown
in figure 7.17.
The scatter plot by groups of the three denoising algorithms used on the
shows the PSNR of the denoised slices given the 3 different algorithms, in
The blue color refers to K-DnCNN, green to K-N2N, and orange to NLM, it
can be seen from the plot that most of the processed images by K-DnCNN
and K-N2N have higher PSNR values than the ones processed with NLM
In figures 7.19 and 7.20, the PSNR and SSI gains concerning multiple noise
levels collected from the three algorithms are also compared. It is clear that
(with the chosen tuning for the NLM) DnCNN and N2N show superior
performances with respect to NLM both using PSNR and SSI. It is also
8. Conclusion
The denoising task was performed on the k-space raw data of the FastMRI
dataset, the largest and most complete type. In addition, a method was
The advantage was taken of the power of the residual learning framework
performing the denoising task with residual learning over frequency data
directly on the image data. Then the same method was applied to the
parts of the knee such as : the muscle, the bones and the cartilage.
Results were quantified using PSNR and SSI: the first measures the pixel-
wise restoration of the true intensity and second the correct reproduction in
In the test the metrics improved after the action of the denoiser in a blind
denoising task both at high and low level of noise. This is important since
our denoiser seems to be generalizing the task of denoising to both low and
high noise levels, usually also because the starting quality of MRI is not
good use.
77
achieved comparable and similar performances while dealing with the task;
the metrics showed performance similarity between the two methods. The
dataset even of small size, since it is decided to work in k-space where noise
The applicability of N2N is very trivial since it is often impossible to get the
Details were usually restored in the low noise case in the results, but still,
careful about maintaining the vital aspect of the solution, which is retaining
analysis, quantitative results are the ones that matter the most.
Another good point to consider would be to make use of all coils instead of
only 8; this will surely improve the results since more signal from the same
exploit this information. 8 coils were taken in the first place instead of 15 to
advantage of the Tensor Cores. One solution would be to add a dummy coil
number of channels.
78
relatively small (dozen of copies). In this case, it is possible to try N2N with
the finite noise examples availability, and it may still be a viable approach.
constrained with the number of noisy examples if the noise is not generated
from a model.
It works in partial context since its receptive field is smaller than the image,
Nevertheless, there are still many ways to enhance the chosen architecture
while keeping the original shape and parameters; choosing the correct
denoising.
A new idea could be implementing a layer block derived from ResNet [49],
replacement perhaps to our convolutional layer; this would keep the same
benchmark and a trivial proof for this thesis, that denoising in K-space is an
both on the level of PSNR and SSI the image quality and the similarity to
the original image showed to be consistent, in general the goal of this work
was to show that taking advantage of the additivity of the noise in K-space
space.
80
References
1. Mohan, J., Krishnaveni, V., and Guo, Y. (2014). A Survey on the Magnetic
2. Tomasi, C., and Manduchi, R. (1998). “Bilateral Filtering for gray and Color
4. Phophalia A, Rajwade A, Mitra SK. Rough set based image de-noising for
6. Xu, J., Huang, Y., Cheng, M.-M., Liu, L., Zhu, F., Xu, Z., et al. (2020).
2015;1(1):60-84
81
resonance imaging of bone and soft tissue tumors: early experience in 31 patients
[13] D. L. Collins, A.P. Zijdenbos, V. Kollokian, J.G. Sled, N.J. Kabani, C.J.
Holmes, and A.C. Evans. Design and construction of a realistic digital brain
[14] Daniel S Marcus, Tracy H Wang, Jamie Parker, John G Csernansky, John
C Morris, and Randy L Buckner. Open access series of imaging studies (oasis):
cross-sectional mri data in young, middle aged, nondemented, and demented older
[17] Hakon Gudbjartsson and Samuel Patz. The rician distribution of noisy mri
[19] Pierre Gravel, Gilles Beaudoin, and Jacques A De Guise. A method for
[20] Alessandro Foi. Noise estimation and removal in mr imaging: The variance
Montserrat Robles, D Louis Collins, et al. Robust rician noise estimation for mr
[23] Jose V Manjon, Pierrick Coupe, and Antonio Buades. Mri noise
estimation and denoising using non-local pca. Medical image analysis, 22(1):35–
47, 2015.
[24] Guido Gerig, Olaf Kubler, Ron Kikinis, and Ferenc A Jolesz. Nonlinear
11(2):221–232, 1992.
[26] Jan Sijbers, Arnold J. den Dekker, Paul Scheunders, and Dirk Van Dyck.
Robles. Mri denoising using non local means. Medical Image Analysis, 12:514–
523, 2008.
83
Ryoi Goto, Ryuta Kawashima, and Hiroshi Fukuda. Correlations among brain
gray matter volumes, age, gender, and hemisphere in healthy individuals. PloS
NMR imaging”. In: Magn. Reson. Med. 3.4, pp. 604–618. ISSN: 0740-3194.
DOI: 10.1002/mrm.1910030413.
[30] Raya, José G. et al. (Jan. 2010). “T2 measurement in articular cartilage:
Impact of the fitting method on accuracy and precision at low SNR”. In: Magn.
URL: https://doi.org/10.1002/mrm.22178.
[31] Dietrich, Olaf, Sabine Heiland, and Klaus Sartor (Mar. 2001). “Noise
SNR”. In: Magn. Reson. Med. 45.3, pp. 448–453. ISSN: 0740-3194. URL:
https://doi.org/10.1002/1522-2594(200103)45:3<448::AID-
MRM1059>3.0.CO;2-W.
[32] Glenn, G. Russell, Ali Tabesh, and Jens H. Jensen (2015). “A simple noise
Estimate Sensitivity and Detection Limits for 19F Magnetic Resonance Imaging”.
[34] Fan, Linwei et al. (Dec. 2019). “Brief review of image denoising techniques”.
Deep CNN for Image Denoising”. In: IEEE Transactions on Image Processing
[36] Arbelaez, Pablo et al. (May 2011). “Contour Detection and Hierarchical
Image Segmentation”. In: IEEE Trans. Pattern Anal. Mach. Intell. 33.5, pp.
parameters for Wiener Hunt deconvolution”. In: Journal of the Optical Society
[38] Liu, F. et al. (2017). “Fast Realistic MRI Simulations Based on Generalized
[40] Lu, Le et al. (2017). Deep Learning and Convolutional Neural Networks for
Datasets.
[42] Goodfellow, Ian, Yoshua Bengio, and Aaron Courville (2016). Deep
[44] LeCun, Haffner, Bottou and Bengio (1998). Object recognition with
Gradient-Based Learning.
[45] Zbontar, Jure et al. (2019). fastMRI: An Open Dataset and Benchmarks for
Stochastic Optimization.
[48] A. Buades, B. Coll, and J.-M. Morel, “A non-local algorithm for image
[49] He, Kaiming et al. (2016). Identity Mappings in Deep Residual Networks.
convolutional neural network, IEEE Signal Process. Lett. 24 (12) (2017) 1763–
1767.
86