Deep Learning

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/355467879
Using Deep Learning to de-noise MRI images
Article · October 2021
CITATIONS READS
0 690
1 author:
Hamza Bouzidi
Sapienza University of Rome
1 PUBLICATION 0 CITATIONS
SEE PROFILE
All content following this page was uploaded by Hamza Bouzidi on 21 October 2021.
The user has requested enhancement of the downloaded file.

Deep Neural Networks for N-
MRI image processing
Candidate Advisor
Hamza Bouzidi Christian Napoli
Deep Neural Networks for N-MRI image
processing
Facoltà di Ingegneria dell'informazione, informatica e statistica
Dipartimento di Ingegneria Informatica, Automatica E Gestionale

Corso di laurea in Engineering in Computer Science
Hamza Bouzidi
Matricola 1915027
Advisor Co-Advisor
Christian Napoli Dr. Stefano Giagu
A.A. 2020-2021
Table of Contents
1.Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1
2. Magnetic Resonance Imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3
3. Denoising and Rician noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7
3.1.The problem and approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9
3.2 Proof of concept. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.1 Training and noise generation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14
3.2.2 Loss function in k-space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15
3.2.3 Testing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16
3.2.4 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16
4. Deep Learning and Convolutional Neural Networks. . . . . . . . . . . . . . . . .18
4.1 Neural Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.1.1 Activation functions and hidden units. . . . . . . . . . . . . . . . . . . . . . . 21
4.1.2 Learning from examples, loss function and training . . . . . . . . . . .22
4.2 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2.1 Convolutional Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .26
4.2.2 Pooling Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .28
4.2.3 CNN architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2.4 The fully convolutional architecture. . . . . . . . . . . . . . . . . . . . . . . . .30
4.4 Densoising using Residual Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.5 Batch Normalization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .34
5. Dataset and Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .37
5.1 The MRI dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .37
5.2 Noise and signal model for multiple correlated coils . . . . . . . . . . . . . .41
5.3 Data preprocessing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.3.1 Low-pass filtering and reduced number of coils . . . . . . . . . . . . . . 43
5.3.2 Mixed precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .44
6. Implemented Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.1 Optimizing the input pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .46
6.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.3 Training the K-DnCNN to the FastMRI . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.3.1 K-DnCNN using Supervised Learning . . . . . . . . . . . . . . . . . . . . . . .50
6.3.2 K-DnCNN using Unsupervised Learning . . . . . . . . . . . . . . . . . . . . 51
7.Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.1 Qualitative results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
7.2 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
7.2.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
7.2.2 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .65
7.2.3 Application on the Brain Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . .68
7.3 Comparison with a state-of-the-art method . . . . . . . . . . . . . . . . . . . . . . .71
7.3.1 Non Local Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
7.3.2 Denoising the fastMRI and results . . . . . . . . . . . . . . . . . . . . . . . . . . .72
8. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .76
1
1. Introduction
In current clinical practice, the role of medical images has become very
prominent for diagnosing and treating several diseases. The medical images
assist the medical practitioners in identifying a disease, locating the
abnormal sites, monitoring tumor size, etc. Among different medical
imaging modalities, Computer Tomography (CT), Magnetic Resonance
Imaging (MRI), Positron Emission Tomography (PET), and Ultrasound are
widely utilized by physicians.
More sophisticated machines and non-invasive techniques have made
medical imaging popular, thereby making the diagnosis more accurate.
However, the central premise of accurate diagnosis is noise-free images
which are still elusive.
Magnetic Resonance Imaging (MRI), which plays a vital role in clinical
diagnosis producing high-quality 2D and 3D images of the body, is also
degraded by noise at the acquisition time due to imperfection in radio-
frequency coils or movements of the patients. Noise in MRI scans affects not
only the performance of computerized diagnosis systems but also creates
difficulty for the manual inspection of a disease.
Hence, estimation and removal of noise from MR images are essential for
proper interpretation, analysis, accurate parameter estimation, and further
preprocessing. Noise remains one of the principal causes of quality
deterioration in MRI, causing artifacts and blurring, and is a subject in a
large number of papers in the MRI literature. Many denoising and
enhancement techniques are applied [1–8], and a number of them are based
on deep learning.
Recently, deep learning methods have been proposed to denoise natural
images using different architectures [35,50]. Most of these methods use

2
supervised learning by training different architectures with pairs of noisy
and noise-free input and outputs, respectively. Such learning-based
methods try to infer the clean image from the noisy input. One of the main
benefits of these techniques is that, after training, denoising can be applied
extremely fast. Convolution Neural Networks (CNNs), for example, have
obtained remarkable performances on image denoising [35]. However, the
denoising of MRI images using CNNs has not been extensively studied in
the literature.
Moreover, the study revealed that the noise model present in MR images is
very different from that of natural images[16]. This happens due to several
unique sources of noise generation and their combination. Thus, it is
evident that the technique used in natural image denoising may not work
correctly for medical images. Therefore, another motivation is to investigate
such noise and explore the technique of its removal.
Motivated by recent advancements in CNN and the particular noise model
of MRI, this thesis proposes a novel CNN-based denoising method of MRI
raw data in the k-space (frequency domain) rather than in magnitude space.
The aim of using the denoiser network on this type of data is that the noise
model is simpler, additive, and Gaussian in the frequency domain.

3
2. Magnetic Resonance Imaging

Magnetic Resonance Imaging (MRI) is one of the most prevalent clinical
imaging methods developed alongside Computer Tomography (CT) and X-
Rays Technology. It is ionization and radiation-free modality; hence it is a
non-invasive technique and safer than CT, X-Ray, and other techniques. It
also provides better soft-tissue contrast and image resolution for the
diagnostic purpose [9], [10]. The MRI modality built upon the phenomenon
of Nuclear Magnetic Resonance (NMR) was discovered by F. Bloch and E.
Purcell independently in 1946 (both awarded Nobel prizes in 1952). Further
investigation of NMR phenomena led it to be useful for human society in
notable works by Richard Ernst, Paul C. Lauterbur, and Sir Peter Mansfield,
who won the Nobel prize in 1991, 2003, and 2003 respectively.
The principle of NMR involves quantum and classical mechanics, which
involve the processing of protons (present in the human body) under an
external Magnetic field. The crux of the MRI modality is the utilization of
abundance containment of hydrogen nuclei present in the human body in
the form of water. The protons of H-atoms are aligned by the external field.
Under a radio frequency (RF) pulse, protons release their energy generating
an electromagnetic signal that gets recorded by receiver coils of MRI
scanners. These electromagnetic signals get encoded in phase and
frequency components. The Inverse Fourier transform of raw data
generates an image slice, either 2 or 3 dimensional, also known as k-space.
The reconstruction process from raw signal to image space provides an
added choice to generate any particular slice in 2D form or complete
volumetric (3D) representation [11], [12]. MRI offers various modalities, in
addition, namely T1, T2, PD (Proton Density) modality, shown in figure 2.1.
4
Figure 2.1: Sample images from the simulated BrainWeb Database [13]
Although a versatile technique, the quality of the image is often affected
during the image acquisition process. The artifacts can mainly be classified
as:
• Hardware Related: such as power supply instability, thermal noise etc.
• Software Related: such as error in decoding pulse sequence, intensity
inhomogeneity etc.
• Patient Related: such as body movement, holding breath for a long time,
blood flow etc.
• Physics Related: such as magnetic susceptibility, Gibbs ringing artifacts
etc.
Many of the artifacts mentioned above are taken care of by the MR scanner
available. Some noise/artifacts still remain in the scan which needed to be
removed. Otherwise, it may affect post-processing step, which involves
tissue identification, tissue segmentation, and other diagnostic decisions
In the reconstruction step, there is always uncertainty involved due to
sampling of the Fourier domain to the spatial domain, interpolation
techniques used, etc. This uncertainty can be defined as whether a spatial
location represents actual tissue information of the subject, or if a proper
signal may be affected by some encoding scheme or effect of
neighbourhood. This uncertainty leads to some undesirable visual effects,

5
commonly referred to as noisy images, which are needed to be overcome by
some software/mathematical modelling (referred to as Image Denoising
Problem). Here, Figure 2.2 shows two real sample images of different
subjects (human) from benchmark databases [14], [15] where noise is clearly
Figure 2.2: Sample images from Real Databases. On the left Oasis [14], and right BRATS [15]
visible. The image denoising problem is, in fact, an inverse problem that
tries to reconstruct a true noise-free image [9]; hence, it can ease the image
segmentation, disease identification, etc.
The acquisition process of medical images is sensitive to noise or undesired
signals. Since noise is an inherent part of MRI data, denoising becomes a
crucial ingredient of the medical image analysis process. Hence, there are
two sets of problems: (a) estimation and analysis of noise model/parameter
and other artifacts such as intensity inhomogeneity, bias correction, etc., and
(b) construction of adaptive models for denoising purposes. However, these
can be considered independent problems, or one can use the first one as
guided input for the other. An inaccurate noise model may lead to doubt on
the reliability of the denoising method. Traditionally, the Gaussian model is
preferred at high SNR locations in MRI [16]. A lot of efforts have been put
into building a statistical noise model in MRI [16], [17], [18], [19]. Similarly,
6
efforts have been made to estimate parameters of models in [20], [21], [22],
[23].
On the other side, to develop denoising methods according to noise model
in MRI is highly sought. In this regard, many conventional methods have
been modified accordingly to adjust the nature of MRI data [24], [25], [26],
[27]. However, one needs to take care of the tissue information and
boundary information in image and keep them intact at the end of
denoising process. In fact, Cerebrospinal Fluid (CSF), Gray Matter (GM)
and White Matter (WM) play significant role in differentiating healthy brain
from an abnormal one and also in clinical examinations [28]. So, even a
small change in it may produce a wrong clinical decision. Hence, any
preprocessing part must preserve the structure and properties of tissue as in
the human subject. A large review on denoising methods in MRI can be
found in [1].
7
3. Denoising and Rician noise

Figure 3.1 explains the process of acquiring an MR image in the frequency
domain (k-space); k-space is the 2D or 3D Fourier transform of the MR
image measured. Its complex values are sampled during an MR
measurement in a scheme controlled by a pulse sequence, i.e., an accurately
Figure 3.1: MRI is acquired on frequency domain first, then through

Inverse Fourier transformed to geometric space. And then the
magnitude image is is obtained by the calculation pixel-by-pixel of the
complex x-space image.
timed sequence of radiofrequency and gradient pulses. In practice, k-space
often refers to the temporary image space, usually a matrix, in which data
from digitized MR signals are stored during data acquisition; it is provided
through a quadrature detector that provides the real and imaginary part of
the signal. Each part of the signal is assumed to be affected by white noise,
the main source of the noise is the RF coil resistance [29], and the final effect
on the quality of the images depends on a variety of factors such as the
pixel dimension, the duration of the acquisition, and the receiver
bandwidth.
The real and imaginary parts from k-space are reconstructed through the
complex Fourier transform in the x-space. The noise in x-space is still
Gaussian, and the real and imaginary part can be assumed to be
uncorrelated since the Fourier transform is a linear and orthogonal
transform [17]. The magnitude image is then acquired by the calculation

8
pixel-by-pixel from the complex image. The nature of noise in magnitude
space is no longer additive and no longer Gaussian.
The images usually obtained in MRI are magnitude images, but others
which are derived from the phase of the complex image could also be
found, but still, the most common ones are the magnitude images, that
study will be based on, since discarding the phase information can make
avoid phase artifact.
Magnitude images can not be divided into a part of the signal and a part of
noise since, as said earlier, the noise is no longer additive. Thus, the
probability distribution of intensities in a noisy magnitude image M,
reconstructed from an image of signal I and gaussian noise 𝜎, is denoted
by :
(I 2 + M 2)
M − IM
p(M ) = 2 e 2σ 2 I0( ) (3.1)
σ σ2
Where I0 is the modified 0-th order Bessel function. It is also called the Rice
distribution. A Gaussian approximation of this distribution can be made
only if I /σ > > 1(which is the Signal to noise Ratio in the x-space). So a
magnitude image with a high level of noise will be far from this Gaussian
approximation of the signal, and it would suffer from what is called the
“Rician bias”.
As a consequence, clinical MRI with low SNR, also being hard to be read
and interpreted, can lead to the erroneous quantification of the physical
quantity. According to [30] , in T2 relaxation images, the accuracy and
precision of the measured T2 may be substantially impaired by the low
signal-to-noise ratio of images available from clinical examinations or in
diffusion-weighted images the decreasing SNR at increasing diffusion
weighting causes systematic errors when calculating obvious diffusion

9
coefficients [31]. In morphological scan, the effect can be less important
since new generation scanners can achieve excellent imaging quality, but
the effect of Rician noise still causes problems in many new acquisition
modalities [32] or when the low signal is given by the low concentration of
the excited nuclei as in the case of fluorine magnetic resonance imaging[33].
3.1 The problem and approach
Image denoising is the task of removing the effect of noise from an image,
which means that denoising an image should restore (the noisy image) to
the condition it was before the application of the noise (the original image),
the performance of the denoiser that performs this task is evaluated on how
much the restored image is close to the original one. Still, the denoised
image can inevitably lose some details in the process of denoising since
noise, edge, and texture are high-frequency components [34]. Therefore,
image denoising is considered to be a classic problem, and many solutions
as described in the introduction were proposed. Furthermore, the features
of a good denoiser can be defined as; Flat areas should be smooth, edges
should not be blurred, textures should be preserved, and new details not
present in the original image (called artifacts) should not be present.
In the past few years, several deep learning methods proposed to denoise
MR images by training different architectures with pairs of noisy and noise-
free training patterns used for supervised learning.
Our experiment seeks to evaluate the performance of a feed-forward
denoising convolutional neural network (DnCNN) that is applied to the
denoising of magnitude images both on the image and on the raw data in k-
space.
10
The method is inspired by the work done by Zhang and his collaborators
[35] that uses a strategy-based deep architecture (Deep Learning and
regularisation method into image denoising). Residual Learning and Batch
Normalization are both used to speed up the training process and also to
boost the denoising performance, RL aims to gradually remove the latent
clean image in the hidden layers to separate the noise from the original
image.
As a beginning test, the network consists only of convolutional operations,
detector stages, and batch normalization, as explained in the next section. It
will be composed as follows:
• First Layer : 2D Convolutional with ReLU activation, 64 filter of size
3x3x1.
• Layer 2 to (D-1) : 2D Convolutional with Batch Normalization, ReLu
activation, 64 filter of size 3x3x64.
• 2D Convolutional with linear activation with a residual layer.
More details about neural networks and residual learning will be available
in the next section, and a schematic representation will be available in
section 6.3.
In this test, it is required to compare the performance of the denoiser in the
task of denoising additive Gaussian noise, to the performance obtained in
denoising Rician distributed noise for the same quality of image, in terms of
PSNR.
So first the two equations of the images affected by both Gaussian and
Rician noise are denoted by:
Mgauss = I + ϵ (3.2)
Mrician = (I + ϵ1)2 + ϵ22 (3.3)

11
Where the resulting equations are the images after adding Gaussian and
Rician noise, I is the original image and ϵ is a zero mean Gaussian noise
with standard deviation σ, extracted randomly between [35,60), the model
is trained on 400 images from the train dataset, the dataset that was used is
the"BSDS500" dataset that is described in [36] and it is often used as a
benchmark for Denoising tasks. Results for the Denoising additive
Gaussian noise are reported in figure 3.2 and 3.4a, and for the Rician noise
in figure 3.3 and 3.4b, the metrics used to evaluate the performance are
defined in section 6.2. The results are compared to a standard image
filtering technique called the wiener filter, highly effective for white noise
removal [37].
The results show that comparing the restoration of images at the same level
of degradation expressed as PSNR with respect to the original image,
Gaussian denoising is more effective than Rician denoising, and the
Figure 3.2 : Performance of the DnCNN on the Gaussian blind denoising. From left clockwise :
The noisy version of the image, the processed image by DnCNN, the original image, the
processed image with Wiener filter. Each point is an image in the test set, the same color refer to
the same level of noise applied, and the dotted red line means that no improvement of
performance in PSNR after the application of denoiser is recorded [PSNR(processed) =
PSNR(noisy)].
12
Figure 3.3 : Performance of the DnCNN on the Rician blind denoising. From left clockwise : The
noisy version of the image, the processed image by DnCNN, the original image, the processed
image with Wiener filter. Each point is an image in the test set, the same color refer to the same
level of noise applied, and the dotted red line means that no improvement of performance in
PSNR after the application of denoiser is recorded [PSNR(processed) = PSNR(noisy)].
Figure 3.4 : Average PSNR for the test dataset after the application
Of the DnCNN (blue color) and Wiener Filter (Orange Color) in function of
the standard deviation of the noise for the Gaussian and Rician models.
For blind Gaussian denoising, DnCNN is always better than a wiener filter,
In the case of Rician noise Wiener filter outperforms on PSNR at high noise levels.
DnCNN performs always better than a wiener filter applied on same image
affected by Gaussian noise.

13
In the next section, the theory of neural networks will be explained and
focused deeply into the architecture of CNNs. Then a test of the
performance of the new proposed Denoiser method is performed on
simulated data that is easily accessible. And later validation on already
collected data from an open dataset.
3.2 Proof of concept
In order to perform supervised learning, a network with pairs of corrupted
and noiseless images to be used as a ground truth needs to be provided.
The ground truth will be used to provide an example of the expected
output. Discussion of supervised learning in more detail is available in
Section 4.1.2.
First, methods were validated on simulated data. This means that the work
should be considered a "proof of concept," It may not be directly applicable
to real data but grants the possibility to control every step of the pipeline.
To generate the dataset, an MRI simulation was performed in MRiLab [38],
a comprehensive simulator for large-scale realistic MRI simulations.
MRiLab combines realistic tissue modeling with numerical virtualization of
an MRI system and scanning experiment to assess a broad range of MRI
modalities.
Realistic simulation can be performed with plausible biological phantoms
modeled as large 3D objects with biologically relevant tissue models. The
computational power needed for the simulation is gained using parallelized
execution on GPU.
Shape, position, rotation, and dimensions of organs can vary in a
predefined interval; an example can be seen in Figure 3.4.

14
Figure 3.4 : Examples of phantom generated.

Each phantom is a 3D model with organs of different sizes, shape, orientation and position.
(a) and (b) are two examples of phantoms used in training
3.2.1 Training and noise generation

When the noise free dataset is formed, noised data is generated by adding
complex white noise in the frequency domain, the noise standard deviation
in k-space σk is chosen to generate images with the SNR value between 5

and 2.
Figure 3.5 shows the effect of the addition of noise in k-space in the
magnitude images.
DnCNN was trained for the task of denoising directly on k-space, this
network is referred to as Kspace-Dn, the training is performed on pairs of
clean and noisy images by minimization of the loss defined in section 3.2.2,
and to compare the results with a network of the same complexity, the
DnCNN was also trained on the noisy images on magnitude space, this
network is called M-Dn.
Both the networks are trained for 300 epochs with Adam [47], and a
learning rate lr = 10−3 , then, the learning is reduced to lr = 10−4 and the
network is trained for another 100 epochs.
15
Figure 3.5 : Noise effect on simulated MRI

data, Magnitude image reconstruction after
the addition of noise.
3.2.2 Loss Function in k-space

The network that processes data in k-space has to map the corrupted
version to the clean one. The mapping should be done according to k-space
and the final reconstructed magnitude image, that’s why a term is added to
the loss function to help concentrate on the final image.
The loss equation used for the k-space denoiser is :
LK = MSE(SY , S ) + β * MSE(reco(SY ), M ) (3.4)

16
With SY the two channel output of the network, S = (SR, SI ) represents the
real and imaginary parts of the signal and ground truth in k-space , M the
ground truth in magnitude space, and reco(SY ) the reconstruction of the
output two channel signal by taking the modulus of the 2D Inverse Discrete
Fourier Transform (iDFT), MSE is the mean square error.
3.2.3 Testing
To test the networks, data from a realistic brain phantom is used, available
in the simulation software [38], to check the noise removal capabilities on
fine details never seen during training. This test will be a measure of the
generalization capability of the networks.
3.2.4 Results
Figure 3.6 : Denoising results for (a) axial and (b) coronal views of
the brain phantom both on k-space and magnitude space.
Denoising in magnitude seems to create smoother surfaces at the
cost of losing details.
17
In this test, as described previously in Section 3.2.1 two networks were
trained, Kspace-Dn and M-Dn, with pairs of original and noisy simulated
MRIs of the simple phantom; figure 3.6 shows few compared images
denoised both with Ks-Dn and M-Dn for visual inspection, it is noticed that
M-Dn favors smoother surfaces at the cost of losing details, the reason of it
may be an incorrect noise estimation during the blind denoising task, it can
be seen how the networks that work on the k-space always outperform the
network trained with magnitude images in terms of image quality. This was
expected from the experiences achieved before in section 3.1.

18
4. Deep learning and Convolutional Neural Networks
The aim of this chapter is to shed light about basic concepts of Deep
Learning, and particularly on Convolutional Neural Networks (CNN).
CNNs became increasingly important in the past years for their huge
capacity to create hierarchical representation and have been used in many
computer vision problems like classification, super-resolution and object
localisation.
CNNs are now successfully implemented in many tasks in the common
computer vision domain due to the breakthrough in image classification in
ImageNet LSVRC-2010 classification challenge [39], CNN are also applied
on the medical field and nowadays applied often to diagnostic imaging,
they are also one of the most used models in computer aided diagnosis
(CAD).
This field is in continuous development and every month best models for a
given task are being changed, so the focus will be laid on the fundamentals
of this concept in what comes next.
A good overview of the application of CNN in the medical field can be
found in [40].
The reason of success of CNNs is because of many reasons that make them
superior to the traditional machine learning algorithms, for example the
scalable feature learning architecture that for a given task, optimises the
model parameters without and the small reliability on feature-engineering.
The technical details needed will be provided in order to understand the
implementation of the CNN based denoiser proposed in the approach.
4.1 Neural Networks

Neural networks are more commonly called Artificial Neural Networks
(ANN), ANNs are computers whose architecture is modeled after the brain.
19
They typically consist of hundreds of simple processing units which are
wired together in a complex communication network. Each unit or node is a
simplified model of a real neuron which sends off a new signal or fires if it
receives a sufficiently strong Input signal from the other nodes to which it is
connected.
Feedforward network is the simplest form of artificial neural network
where input data travels in one direction only, passing through artificial
neural nodes and exiting through output nodes; the aim of this section is to
give an overview of this widely used model. The goal of an FNN or an

ANN in general is to approximate a function f *. The simplest example that
could be given is a classifier that maps an element x to its label y, and it can
be described as y = f *(x). FNN define mappings between inputs and their
outputs such as y = f (x; W ) which models the original function f * and the
parameters or weights W are being learned in order to give the best
function approximation.
They are called feedforward networks because the information flows from the
input x to the output y, without feedback between layers, and they are an
ensemble of simple functions built in a chain structure. For example, let us

have N function f n with n = 1,...,N, these functions are subsequently
applied to the input x and they are chained together to

y = f N ( f N−1( . . . f 2( f 1(x)) . . . )).
f 1 is called the first layer of the network, f 2 until the last layer which is
called f N, and the length of the chain is called the depth of the model, since
the architecture of the network can be complex and composed of many
layers they are referred to as deep.
The objective of the training is to match our function f to the original f *, the
layers can be shaped freely to better approximate f *, the role of the training
algorithm is to select the best parameters or weights for these layers, they
20
are called hidden because during training they are hidden, and composed by
several hidden units that perform the basic computation in a neural network,
the number of hidden units is called the width.
One way to build an intuition of how a neural network works is to imagine
that the final layer is a simple linear model that operates not on the input
itself x but a transformation of it φ(x) created through the other layers. It is
possible to describe the action of the hidden layers as the formation of a
synthetic description of the input shaped by the training algorithm to be
beneficial to represent it correctly. For this reason, these structured
intermediate descriptions of the data are called representations or
(complex) features.
In Figure 4.1 there is a schematic representation of a network with two
hidden layers: An input layer x of width 12 is connected to the first hidden
layer of width 8 connected to the second layer of width 6 that, finally, is
connected to the output layer. The direction of the connections is always
Figure 4.1 : Schematical representation of a fully connected neural

network with two hidden layers. The first layer on the top is the input
layer. Connections go from the input layer to the bottom, the arrows
represent the weights.
21
from input (top) to output (bottom) without loops between elements:
Connections come only from the previous layer, and elements in the same
layer are not directly connected. This architecture is called a multilayer fully
connected neural network since it is formed by multiple layers in which
every element is connected to all the elements of the previous layer. This is
the most basic example of a neural network, but it still finds applications.
From a biological point of view, the structure of these networks resembles a
biological neural network: The elements of the layers can be seen as
neurons and the parameters of the chained functions as synapses. As a
biological neuron, which activity is modulated by many signals coming
from synapses with other neurons, the value of an element of a layer is
given by the many inputs that it receives from the previous layer.
4.1.1 Activation Functions and Hidden Units
To finish the description of the network architecture (i.e., the overall
structure of the network) in the hidden layers, the concept of activation
functions is introduced; In an artificial neural network, the sum of products
of inputs and their corresponding weights are calculated and finally an
activation function is applied to it to get the output of that particular layer
and supply it as the input to the next layer. The purpose of the activation
function is to introduce non-linearity into the output of a neuron. Since
values in the input layers are generally centered around zero and have
already been appropriately scaled, they do not require transformation.
However, these values, once multiplied by weights and summed, quickly
get beyond the range of their original scale, which is where the activation
functions come into play, forcing values back within this acceptable range
and making them useful.

22
One of the most used activation functions is the Rectified Linear Unit
(ReLu), which outputs 0 if the input is negative and linear when the input is
positive.
relu(x) = m a x(x,0) (2.2)
The major benefit of ReLU is the reduced likelihood of the gradient to
vanish. This arises when x > 0. In this regime the gradient has a constant
value. In contrast, the gradient of sigmoids becomes increasingly small as
the absolute value of x increases. The constant gradient of RELUs results in
faster learning.
The other benefit of ReLUs is sparsity. Sparsity arises when x ≤ 0. The more
such units that exist in a layer, the more sparse the resulting representation.
Sigmoids, on the other hand, are always likely to generate some non-zero
value resulting in dense representations.
4.1.2 Learning from examples, loss function and training
The goal of machine learning is to gain experience from data, usually, a
collection of features that is called the dataset composed of many data
points is available, in image recognition tasks for example this dataset may
be a collection of images that are represented as a matrix of pixel intensity
values.
In general, machine learning can be divided into two classes: supervised
and unsupervised; the approach in this thesis will be using both.
Unsupervised Learning algorithms allow users to perform more complex
processing tasks compared to supervised learning. Although, unsupervised
learning can be more unpredictable compared with other natural learning
methods. Unsupervised Learning algorithms include clustering, anomaly
detection, neural networks, etc.

23
Today, most practical machine learning models utilize Supervised Learning,
which applies an algorithm to map one input to one output. For supervised
learning to work, a labeled set of data that the model can learn from to
make correct decisions is needed. Data labeling typically starts by asking
humans to make judgments about a given piece of unlabeled data. For
example, labelers may be asked to tag all the images in a dataset. The
tagging can be as rough as a simple yes/no or as granular as identifying the
specific pixels in a specific image.
In machine learning, a properly labeled dataset that you use as the objective
standard to train and assess a given model is often called “ground truth.”
The accuracy of our trained model will depend on the accuracy of our
ground truth, so spending the time and resources to ensure highly accurate
data labeling is essential.
Then, in order to evaluate how good our algorithm performs the task, a
performance metric is needed. The choice of the metric is crucial since it is
the key to drive the learning phase and grant the ability to generalize the
task to other data that will be unlabelled in supervised learning. The metric
should be generic enough to be well defined for all our data examples and
precise and straightforward criteria to check if our objectives are reached.
Our learning goal should be also evaluating the performance of our model
on data that is not present in the training set to check if it has learned
adequately, thus, a different data set called the test set, will be used for this
purpose and measured performance on this set will be the indicator of how
good our model is.
In conclusion, machine learning methods would be evaluated by their
ability to :
• Have a good level of performance on the level of training examples

24
• Generalize the learning in order to be good enough also on test set, and
make the gap of performance between the training and test sets as small
as possible.
One interesting task that can be solved with Machine Learning is denoising,
the subject of our thesis. In denoising tasks the algorithm is given as input a
corrupted instance x which was previously not corrupted and a clean
version y, the model will try to restore x into a state similar to y. In this case
the metric that needs to be used has to calculate the similarity between the
two images, in the case of images pixels are usually worked with, so the
metric could be a distance calculator of pixels between the two images.
Now the process of optimisation is discussed, the objective function tends
to be optimised, this optimisation is done most likely by minimisation and
the objective function is usually called the error function, or the loss
function. The model's level of performance depends on the behaviour of the
loss function, which means that if it is managed to grant minimisation of
the loss value, a performance improvement can be induced.

The argument that minimises the loss function is referred to as L(x) , the
exact form of L(x) will be problem dependent but in general a low value of
it will result in high performance for the algorithm. The argument that
minimises the loss function is denoted by x* such that:
x* = argmin L(x) (4.1)
Solving the equation through the method of gradient descent is the one
generally adopted to solve the equation ∇x L(x) = 0 , more talk about is
available in [41].
25
Gradient descent is an optimization algorithm used to minimize some
function by iteratively moving in the direction of steepest descent as
defined by the negative of the gradient :

x′ = x − ϵ ∇x L(x) (4.2)
With ϵ the learning rate, a positive constant that defines the size of the step,
choosing the best learning rate is also a matter of discussion, but in the
future works, it will be chosen as a small value. The gradient descent
converges when all the components of the gradients are 0 or close to 0.
Particularly, ReLu units contribute in the learning phase and gradient
computation since they have a large derivative in every points they are
active, mentioning also the low computation cost of their activation as
described in equation 2.2.
After giving this overview about Neural Networks, a specific and very
popular type of neural networks and the main subject of this thesis,
convolutional neural networks, is examined.
4.2 Convolutional Neural Networks
In the Machine Learning taxonomy, Convolutional neural networks (CNNs)
are a subset of deep learning algorithms. CNNs can be both supervised and
unsupervised. CNNs are a type of neural network which are used to
analyze data with a grid-like structure. A good example of this is image
data, represented as two dimensions of RGB values.
Thus, CNNs solve the problem of understanding images using networks of
more manageable complexity. The special neural network considers that
physical closeness between pixels bears a meaning and that elements of
interest can appear anywhere in a picture. This is accomplished by using a
linear convolution operation, which will be discussed in the next section.
The use of this operation in one or more layers is what defines a CNN.
26
4.2.1 Convolutional Operation
In a theoretical sense, the convolutional operator in its most primitive form
can be considered an operation on two functions of a real-valued input [42].
The convolution operation is defined as :
∫
s(t) = (x * w)(t) = x(a)w(t − a) d a (4.3)
where x is the function mapping to a specific value in the input data, and w
represents the kernel. The output function s(t) is usually called a feature
map of x respect to the kernel w.
If the input values are discrete, the convolution operation can be rewritten
using summation:
∑
s(t) = (x * w)(t) = x(a)w(t − a) (4.4)
a
The input is commonly multidimensional. In that case, the functions can be
replaced with multivariable functions, operating on tensors. Consider an
example of applying convolution to a two dimensional image I as input.
One can then use a two-dimensional kernel K, and the operation can be
written as follows:
∑∑
S(i, j ) = (I, K )(i, j ) = K(i − m, j − n) (4.5)
m n
That is, for a given pixel in the input, positioned in row i and column j, the
convolution is computed by ”placing” the centre of the kernel over the
input pixel, and summing over the product of overlapping kernel
parameters and input pixels to produce the output value for i and j.
The convolutional operator can be seen as a matrix operation between the
kernel and a small portion of a larger image. Usually, the kernels adopted
for image processing in CNNs are significantly smaller than the image they
27
are applied to because it is assumed that in images the information is local,
so an object (or a part of an object) will be made of spatially close pixels. In
a convolutional layer the same kernel will be applied to all elements of the
input, meaning that the same operation is repeated in the image space
connecting groups of close hidden units.
The effect of a linear kernel can be seen on the figure 4.2, a Kernel K with
Figure 4.2 : Example of an application of a convolutional kernel that performs an

affine transformation. It is applied on the input image and a linear combination of
input elements with coefficients given by the Kernel parameters is stored in the
output matrix. The stride indicated is 1 so the filter is moved by one element.
2x2 pixel size is applied to an image I, the kernel slides on the image with S
pixels step and it is called the stride, if the value of the stride is changed the
number of output units of the convolutional operation changes as well, for
example if the value of the stride is 1, then the kernel would process every
element of the input, however is the value of the stride is bigger then the
input would be subsampled proportionally to its length. The obtained

28
output from the convolution is a linear combination of the input elements,
and the coefficients of this combination are learned during the training.
4.2.2 Pooling Layer

A limitation of the feature map output of convolutional layers is that they
record the precise position of features in the input. This means that small
movements in the position of the feature in the input image will result in a
different feature map. This can happen with re-cropping, rotation, shifting,
and other minor changes to the input image.
A common approach to addressing this problem is called pooling, this
operation reaches this scope by replacing the output of the net in a certain
location with aggregate information over all the nearby input units. Max
pooling is one of the most used pooling operations [43].
Maximum pooling, or max pooling, is a pooling operation that calculates
the maximum, or largest, value in each patch of each feature map. The
results are down sampled or pooled feature maps that highlight the most
present feature in the patch, not the average presence of the feature in the
case of average pooling.
4.2.3 CNN architecture
CNNs have been used for several tasks, so there is no typical architectures
that define the best CNN models for every task. But when processing
images, there are some guidelines that are valid for most of the times.
One of the best vision model architectures to date is VGG from the paper
(Simonyan and Zisserman, 2015), which was a popular solution for image
classification, and many successive approaches to this task took inspiration
from it. However, the unique thing about VGG16 is that instead of having a
large number of hyper-parameters, they focused on having convolution
layers of 3x3 filter with a stride 1 and always used the same padding and
29
max pool layer of 2x2 filter of stride 2, which made the idea of small kernels
become the base idea of most modern implementations.
A convolutional layer is composed usually by a set of M convolution
operators that perform a linear combination of the KxK (with K) the size of
the kernel elements of the inputs. As described earlier the number of
transformations that a network can learn is related to the number of
kernels. The size of the kernel on the other controls how many elements are
combined together. Thus, the number of the parameters of the layer will be :
PW = N * M * K * K (4.6)
Where N denotes the size of the input.
After the convolutional operations, the output elements are the inputs of an
activation function to introduce the nonlinearity in the learned
transformations. Finally, a pooling operation is performed, and as already
mentioned, the layer size is reduced to a factor of the pooling operator's
size.
A convolutional block is the ensemble of the operations described above,
combination of convolutions, activations and poolings. The CNN is then
composed by multiple blocks, each of them with an input size reduced due
to the pooling stage. This step is called the subsampling path as the input
dimension is being compressed continuously allowing more convolutional
layers to fit in the same amount of memory.
The convolutional blocks get flattened in order to get the output of the
network, and after some connected layers the output is passed to a
classifier, the flattened vector of the last convolutional block is usually
called the feature vector and it is the input to the classification task.
30
As an example for the architecture explained, figure 4.3 is reported with the
same structure. From left to right : an input matrix of 128 by 128 units, that
can represent the analysed data or the output of the previous convolutional
Figure 4.3 : Schematic description of a convolutional network with

convolution and pooling operations, after the max pooling operation
the output is flattened to an array that is passed to a fully connected network.
block, processed by a series of 8 convolutional operators. The convolutional
maps are pooled with a value of 2 so the dimensions will be divided by 2
and would be 64 by 64. Next, a convolutional stage with 24 filters is applied
with a filter of stride 2, and after the pooling the output is reshaped as
vector and used an input for a two layer fully connected classifier.
The CNN described here is one of the most basic implementations of the
concept, and it is probably not used anymore in real applications. However,
this is not a problem since our objective is not to survey CNN architectures
but to highlight the motivation behind their success.
4.2.4 The fully convolutional architecture

This is a particular class of CNNs where the output is structured as a map
that has the same shape as the input. It is called (FCNN) Fully
Convolutional Neural Network and it was introduced first for the task of
image segmentation (Long, Shelhamer, and Darrell, 2015). The meaning of

31
the segmentation task is the assignment of each pixel of the input to the
class of the object it is a part of. So it is possible to say that FCNN
transforms pixels into pixel categories.
In a brain image, the task could be the separation of a white matter zone
from a grey matter.
The difference between the FCNN and the architecture of CNN previously
introduced is that a FCNN transforms the height and width of the
intermediate layer feature map in the subsampling path back to the size of
the input image through the transposed convolution operation, so that the
predictions have a one-to-one correspondence in shape with the input
image. By consequence every pixel in the original input will be associated
with the output unit.
Unlike the CNN with a fully contractive path, in FCNN it is possible that
each output unit depends only on a part of the input. This area is usually
called the receptive field, every pixel outside of this area will not contribute
to determine the unit output.
In the case there are no pooling operation, the case of the receptive field is
interesting, the ability to process larger areas depends only by the deepness
of the layers the network. The receptive field then grows from the initial
size of the kernel K by the stride S of the operator in each direction for each
consecutive layer, then after D layers the patch size would be :
PS = K + 2S(D − 1) (4.7)
Figure 4.5 shows an example for a network with a kernel of size 3 by 3 with
a stride of 1 and depth d, which will have an effective patch width of 2d+1.
This kind of architecture has a small receptive field but it holds the spatial
information at the same level of detail that is present in the input. It is
possible because of the lack of the pooling operation which helps to

32
Figure 4.5 : Receptive field in a CNN without a pooling

operation with a kernel size of 3 and a stride of 1. The layers
operations are performed from left to right. A unit on the
output layer will depend on 7 by 7 pixel in the first layer. The
size of the receptive field is given by 2D+1 with D the
deepness of the network.
improve translation invariance but it makes also at the same time the
output less dependent on the exact spatial position. This network will
perform extremely good for a task which the definition and sharpness of
the output is important.
4.4 Denoising using Residual Learning
Residual learning is adopted along with batch normalization as an image
denoising strategy in the work of Zhang and collaborators [35]. When both
utilized, they speed up the training process and boost the denoising
performance.
33
Interest is particularly given to residual learning that aims to remove the
latent clean image gradually in the hidden layers to separate the noise
contribution from the original image.
Residual Learning can be schematized as follows: Focus on an image and its
noisy approximation (y, X) where the noise model is assumed to be additive

X = y + ϵ. The exact form of ϵ may vary, but in this example it may be
thought as Gaussian with zero mean.
A neural network can learn in discriminative denoising to map the noisy

example to the original by matching its output ypred = NNW (X ) to the
original image :
NNW (X ) ∼ y (4.8)
The residual learning formulation instead aims to map the output of the
network to the noise part of the input. This is performed by subtracting the
output to the noisy input :
NNW (X ) = X − ypred
(4.11)
The loss function can also be written, using a pixel-wise mean squared error
for simplicity, of a denoising neural network with a fully convolutional
architecture like described in section 4.2.4 as :
1
(yipred − yi )2
∑
L(ypred , y) = (4.9)
n i
In order to simplify the notation, the sum over the pixels will not be written
explicitly in the next steps. In normal learning for which the output is
ypred = NNW (X ) the loss function is :
34
LW = | ypred − y |2 (4.10)
While for the residual learning approach X − ypred = NNW (X ) the loss
becomes :
L(ypred , y) = | X − ypred − y |2 (4.11)
For the additive noise model where it is possible to write X = y + ϵ, the loss
reduces to :
L(ypred , y) = | y + ϵ − ypred − y |2 (4.12)
And this leads to ypred ∼ ϵ . This slight modification in the loss function
helps the neural network to find a solution that focuses on the noise part of
the problem instead of learning features that depends on the image.
The original paper's authors show that a simple neural network that applies
this strategy can decrease training time and has a more remarkable
generalization ability (the training can be translated to the related task). Its
training converges with a relatively small dataset, is more stable, and
outperforms many traditional algorithms in blind denoising, a task where
the noise level is unknown.
It is not surprising that adding a residual layer means using prior
information, in this case, our knowledge of the additivity of the noise
model, as an assumption to simplify the network task.
4.5 Batch Normalization
Normalizing the input data of neural networks to zero-mean and constant
standard deviation has been known for a long time [44], to be beneficial to
neural network training. Batch Normalization (BN) naturally extends this
idea across the intermediate layers within a deep network [Szegedy et Al,
2015]. Unfortunately, the activations and gradients in deep neural networks
without BN tend to be heavy-tailed. In particular, during an early on-set of

35
divergence, a small subset of activations (typically in the deep layer)
“explode.” The typical practice to avoid such divergence is to set the
learning rate to be sufficiently small such that no steep gradient direction
can lead to divergence. However, small learning rates yield little progress
along with flat directions of the optimization landscape and may be more
prone to convergence to sharp local minima with possibly the worst
generalization performance.
BN avoids the activation explosion by repeatedly correcting all activations
to be zero-mean and of unit standard deviation. This “safety precaution”
makes it possible to train the networks with large training rates, as
activations cannot grow uncontrollably since their means and variances are
normalized. As a result, SGD with large learning rates yields faster
convergence along with the flat directions of the optimization landscape
and is less likely to get stuck in sharp minima.
The Batch Normalization Algorithm is denoted by the following equation :
Ib,c,x,y − μc
Ob,c,x,y ← γc + βc ∀b, c, x, y . (4.13)
σ2 + ϵ
With Ib,c,x,y and Ob,c,x,y are four dimensional tensor input and outputs of a
BN layer, the dimensions corresponding to examples within a batch b,
channel c, and two spatial dimensions x,y respectively. For input images the
channels correspond to the RGB channels. BN applies the same
normalization for all activations in a given channel.

1
|β| ∑
BN subtracts here the mean activation μc = b,x,y Ib,c,x,y from all
input activations in c, where β contains all activations in channel c across all
features b in the entire mini-batch and all spatial x,y locations. Subsequently,
BN divides the centered activation by the standard deviation σc .
36
Normalization is followed by a channel-wise affine transformation

parametrized through γc, βc which are learned during training.
37
5. Dataset and Preprocessing

5.1 The MRI dataset
In this section, light is shed on the original MRI dataset, called the fast MRI
dataset [45]. This dataset originally is not conceived as a denoising task: It is
used to test reconstruction algorithms in parallel acquisitions with
subsampling go the k-space.
Still, it is a very good candidate for the denoising test because it consists of
fully sampled, HD K-space data of images with generally high SNR to
which simulated noise can be added to train our Dn-CNN, some of the
acquisitions can be noisy, but they form just a minority, and usually, the
resulting images are of great quality.
In any case, the present noise in the acquisition will be negligible compared
to the artificial noise added for training purposes. The presence of noise in
real data is even a perk so that any denoising method would be robust to
labels that are not perfect.
To extract our images parallel MR imaging was used, which means a
multiple receiver coil, this instrument is usually placed in proximity to the
area to be imaged (brain or knee), and during imaging a sequence of
spatially and temporally varying magnetic field called “pulse sequence” is
applied by the MRI machine. Multiple receiver coil implies that each of
them produces a separate k-space measurement matrix, and each of these
matrices will be different (see figure 5.1), because each of the coils will
provide a different view of the imaged volume modulated by the
differential sensitivity that coil exhibits to MR signal arising from different
regions.
38
Figure 5.1 : Multiple coil acquisition (8 coils) of MR data.
The dataset among the diverse datasets available in FastMRI that is worked
on in this thesis for denoising the DnCNN is the Knee k-space Data. It is a
Multi-coil raw data that was stored for 1,594 scans acquired for the purpose
of diagnostic knee MRI. A single fully sampled MRI volume was acquired
for each scan on one of three clinical 3T systems (Siemens Magnetom Skyra,
Prisma, and Biograph mMR), or one clinical 1.5T system (Siemens
Magnetom Aera). Data acquisition used a 15 channel knee coil array and
conventional Cartesian 2D TSE protocol employed clinically at NYU School
of Medicine. The dataset includes data from two pulse sequences, yielding
coronal proton-diversity weighting with (PDFS, 798 scans) and without
(PD, 796 scans) fat suppression (see figure 5.2). Sequence parameters are, as
per standard clinical protocol, matched as closely as possible between the
systems. The following sequence parameters were used: Echo train length 4,
matrix size 320 × 320, in-plane resolution 0.5mm×0.5mm, slice thickness
3mm, no gap between slices. The timing varied between systems, with a
repetition time (TR) ranging between 2200 and 3000 milliseconds and echo
time (TE) between 27 and 34 milliseconds.

39
Figure 5.2 : Proton density weighted image (a) with fat suppression (PDFS), (b) without fat
suppression (PD). [45]
The total number of patients used for training and testing are shown in
Table 5.1.
The dataset also provides 6970 fully sampled brains scans. A portion of 255
of them was later used for additional testing of the solution but no training
was performed on it. A future version of our denoiser will be trained on the
whole dataset. This dataset provides examples from multiple sequence and
FastMRI patients Patients Slices
Fast MRI Train Set 973 Used in Train : 350 10236
Used in Test : 100 2959
FastMRI Val. Set 197

Used in Validation: 2380
Table 5.1: Description of the FastMRI dataset. In the left column is
reported the total size of the dataset. In the Right column is reported
the number of volumes (patients) used to train, validate, and test the
model. The results shown for the supervised and unsupervised train-
ing are based on the same data split.
40
acquisition modalities. Both T1 and T2 weighted images are present and
there are also contrast medium enhanced acquisitions.
Sl(x) is expressed as the complex signal at the lth coil in the x-space, which
corresponds with the inverse Fourier transform of sl (k), such as:
Sl(x) = ℱ−1{sl(k)} (5.1)
With sl the acquired signal at coil l in k-space.
It is important to remind that, in a single-coil system, the final magnitude
image is obtained by simply taking the absolute value of the complex
signal, while in the multiple-coil case, one complex image is available per
coil, and in order to get the real image, it is necessary to combine all that
information. In this case the last reconstructed image is called Composite
Magnitude Signal CMS (see figure 5.3).
Figure 5.3 : Test of reconstruction using a simple unweighted SoS, the green looking
pictures show k-space data up to 15 coils, below them the Individual coil spatial
images from fully sampled data, and on the right the reconstructed image from the
total coils.
41
The most popular approach that will be adopted for the reconstruction of
CMS in the multiple-coil acquisition is the Sum of Squares (SoS), and it has
been proven that it is one of the most efficient and Spatial Matched filters
(SMF). The advantage of using SoS is that it does not require a prior
estimation of the coil sensitivity, and thus, the CMS will be directly
reconstructed from the signal in each coil:

L
| Sl(x) |2
∑
MT (x) = (5.2)
l=1
It is important to note that there are other techniques proposed to
reconstruct the CMS from multiple signals, but for the sake of simplicity the
most efficient and straightforward would be used.
5.2 Noise and signal model for multiple correlated coils
As explained in the previous sections, the acquisition of MRI happens in k-
space, and given the fact that the noise affects equally all frequencies (all the
samples in k-space), it is concluded that it is signal and source independent,
and it can be modeled as a complex Additive White Gaussian Noise
(AWGN), in this case, the acquired signal in the lth coil in k-space can be
modeled as :
sl(k) = al(k) + nl(k; 0,σK2l(k)) (5.3)
With al (k) noise-free signal in the l-th coil of a total of L, and sl (k) is the
received noisy signal at the coil l. This is the assumption of noise in MRI,
the noise in each coil is considered to be stationary in k-space. In order to
get the complex image domain, the inverse Fourier transform of sl (k) is
used in each slice and in every coil.
42
In modern acquisition systems comprising up to 32 or 64 coils, the receivers
show a particular coupling [16], which means that the noisy samples at each
k-space location are correlated from coil to coil. Under the assumption that
the correlation is not frequency-dependent, (i.e., same for all k-space
samples), the correlation between coils will be extended to the complex
∑
image domain and then becomes a covariance matrix which is non-
diagonal, symmetric, and a positive definite matrix, where the non-diagonal
elements are the correlations between each pair of coils.
Given the assumption that each coil has some Gaussian noise initially with
the same variance σ02, the covariance matrix

∑
is defined as :
1 p ... 1
ρ 1 ... ρ
= σ02
∑
(5.4)
⋮ ⋮ ⋱ ⋮
ρ ρ ⋯ 1
Usually, ρi, j is significant in the multi-coils system, but their value is defined
by how the antenna is built, and it will be a specific characteristic on the
particular model of MRI scanner.
The exact value of the correlation matrix for the scanners present in the
dataset is unknown, although it is possible to estimate it from the few
background scans present, so a dummy coils geometry and a real
correlation matrix are proposed.
A circular geometry is chosen where correlation between coils is ρi, j = 0.3
if the coils are first neighbours and ρi, j = 0.15 if they are second
neighbours. In all other cases ρi, j = 0.05, in figure 5.4 there is a
representation of the chosen geometry, and the correlation matrix is
reported. As mentioned before, the system has 15 coils which are

43
Figure 5.4 : Covariance matrix for the correlated acquisition, the non-diagonal elements
are the correlations between each coil, the correlation between neighbour pairs is bigger
than the distant ones.
subsampled to 8 coils. In the dataset, 3 different scanners are present, but
they will have the same correlation matrix in the approximation.
5.3 Data preprocessing
5.3.1 Low-pass filtering and reduced number of coils
One of the advantages of the method employed is that it can be applied
with minimal data preprocessing. In order to decrease the computational
load during training that turns out to be heavy without preprocessing, the
following operations are performed: A low-pass filter is applied to the
frequency data, and the number of coils is reduced.
Low pass filtering (aka smoothing) removes high spatial frequency noise
from a digital image. The low-pass filters usually employ a moving window
operator, which affects one pixel of the image at a time, changing its value
by some function of a local region (window) of pixels. Thus, the operator
moves over the image to affect all the pixels in the image.
44
When Low-pass filtering is applied in the implementation, the acquisition
size is reduced from 640x372 or 640x368 to 320x184. The motivation of this
filtering is that k-space is sampled at a high frequency that would not be
needed in a noisy acquisition required to stimulate; this gives cleaner
ground truth images and a minor example with reduced memory needed
for training.
This filtering does not produce some significant effects on the final image.
The complex k-space signal for each coil is normalized to the maximum
modulus value on a slice by slice basis. This technique allows convergence
during training and the standardization of the examples. Since not every
coil has the same sensitivity in each volume point, a good solution would
be to normalize by the max value over the whole volume. The obtained
difference is minor in contrast to image reconstruction.
Moreover, not all the coils are used in the end, so only 8 are selected out of
15 fixed for all examples; this method reduces the overall quality of the final
reconstructed image since less data is used but needed to perform the
training in a more reasonable time.
5.3.2 Mixed precision
In order to speed up training during the development phase of this study,
learning is performed with the Tensorflow Mixed Precision.
The benefits of using Mixed Precision are as follows:
• Speeds up math-intensive operations, such as linear and convolution
layers, by using Tensor Cores.
• Speeds up memory-limited operations by accessing half the bytes
compared to single-precision.
• Reduces memory requirements for training models, enabling larger
models or larger mini-batches.

45
It uses both 16bit and 32bit variables during training; lower precision data
types in the model weights use less memory and exploit the presence of
specialized hardware in GPU for faster operations: The modern accelerators
run operations faster in 16bit, as they specialize hardware to run 16bit
computations and 16bit data types can be read from memory faster.
The powerful thing about this method is that it is possible to double the size
of the mini-batch at the same memory, and thus, double the rate of
examples processed at each training step.
The decrease in performance is very little, and training time by
implementing this method is also reduced.

46
6. Implemented Methodology
6.1 Optimizing the input pipeline
In order to fully utilise the GPU capacity, it is crucial to ensure the
achievement of optimal performance and efficiency in our input pipeline.
The tf.data1 API enables to build complex input pipelines from simple,
reusable pieces. Tf.data also makes it possible to handle large amount of
data, reading from different data formats, and perform complex
transformation.
In a naive approach, a training step includes opening a file, fetching a data
entry from the file and then using the data for training. When the model is
training, clear inefficiencies can be seen, the input pipeline is idle and when
the input pipeline is fetching the data, the model is idle.
Prior to training the DnCNN several data configuration methods were
used, such as Prefetching, Parallelising data extraction, Parallelising data
transformation, Caching, and Vectorized Mapping. In what follows the
concepts that were included in the implementation are cited:
• Interleave : Used to process many input files concurrently, good solution
to parallelise the task, it also supports tf.data.AUTOTUNE which prompts
the tf.data runtime to tune the value dynamically at runtime, that level of
parallelism is declared inside num_parallel_calls to specify the level of
parallelism.
• Caching : to cache a large dataset in local storage, this saves operation like
file opening and data reading from being executed during each epoch, the
next epochs reuse the data cached by cache transformation.
• Shuffle : The dataset was shuffled by the buffer size parameter, it affects
the randomness of the transformation, poorly shuffled data can result in
lower training accuracy.
1 https://www.tensorflow.org/guide/data
47
• Prefetching : solves the inefficiencies as it aims to overlap the
preprocessing and model execution of the training step, for example when
our model will be executing the training step n, the input pipeline will be
reading data for step n+1. Again AUTOTUNE is used to tune the value
dynamically at run time.
• Batch : Takes first the batch size entries, that were set to 64, and make a
batch out of them.
6.2 Metrics
The following metrics will be used to assess the performance of the DnCNN
for the fastMRI dataset :
• SNR : In general, the metric used to assess the quality of magnitude
images with respect to noise in the acquisition in Signal-Noise-Ratio
(SNR), For MR images usually it is defined as the mean over the
intensity of pixels in a region of signal divided by the standard
deviation:
μsignal
SNR = (6.1)
σbackground
• For MRI the Peak Signal to Noise Ratio (PSNR) will be used. PSNR is
defined from the mean squared error (MSR), and for a pair of NxM real
images I and I* , can be written as :
1 N M
[I(i, j ) − I*(i, j )]2
N.M ∑ ∑
MSE = (6.2)
i=1 j=1
And finally PSNR is defined as :

M A XI
PSNR = 20 ⋅ log10( ) (6.3)
MSE
48
Which is expressed in decibel (db), the algorithm of the euclidean distance
between two images normalized to the maximum value that a pixel can
assume is for an 8bit image, 255.
• SSIM : is a perceptual metric that quantifies image quality degradation
caused by processing such as data compression or by losses in data
transmission, unlike PSNR (Peak Signal-to-Noise Ratio), SSIM is based on
visible structures in the image. The SSIM between the path of the original
image m and the patch n of the denoised image is defined as :

(2μm μn + c1)(2σmn + c2)
SSIM(m, n) = (6.4)
(μn2 + μm2 + c1)(σn2 + σm2 + c2)
Where : μ is the intensity mean of the patch, σ is the standard deviation,
σmn is the covariance, and c1,c2 are small regularising constants set to 0.01
and 0.03.
• Residual Maps : They are computed as the square differences between
pixels intensity of the original and restored image so each value of the
residual map Di, j is:
Di, j = (Ii, j − Ji, j )2 (6.7)
Both I and J are normalised between 0 and 1.
The average of the residual map over the image is also reported, this metric
is related to PSNR, that is proportional to the logarithm of its inverse.
SSI and residual maps are two complementary metrics since SSI is
computed over small regions and gives information about the relation
between pixels, while the residual maps depend only on a difference
between two single pixels.
6.3 Training the K-DnCNN to the FastMRI

The same model described in the second chapter is used, a FCNN (fully
convolutional neural network) as described in 4.2.4 with 18 convolutional
layers, detector stages, and regularisation with batch normalization,

49
without any downsampling path (no pooling), and with a residual learning
connection implemented in the output layer as in Eq. 4.14.
It is essential to notice that this network has no subsampling operation, so
its output units have a small receptive field, which means that only partial
local information and not the entire input is used to compute (and train) the
value of each output unit.
The network contains also a receptive field, of 41 x 41 pixels, and the input
is formed by the real and imaginary part of the complex k-space data for
every single coil. Since we’re using 8 coils, then the number of channels
would be 16, The diagram of the network is shown in figure 6.1.
Thus, the Dn-CNN are effectively denoising areas of the image-based only
on partial context. This helps to reduce the number of parameters of the
model to limit overfitting even when the network is trained on a small
Figure 6.1 : K-DnCNN model for multichannel

MRI denoising
50
dataset, and it will also contribute to bypass the learning of complex
anatomical structures since only a patch on the input is seen at each output
unit.
The network was trained after applying blind denoising to vary the
standard deviation of the noise between σ ∈ [5,20] ⋅ 10−3 in the k-space.
Considering that the starting quality of the original images is not
homogeneous and both scans acquired with and without fat suppression
that generate brighter or darker images are present in the dataset,
it is difficult to measure the effect of this noise on the whole dataset in terms
of SNR. So instead, PSNR was used to define the degradation of noised
images to the original ones and the PSNR gain as a measure of
improvement.
Overall, this random noise addition with this range of standard deviation
will generate corrupted images with PSNR in a range between 10 to 35,
which means their visual quality is distributed from a reasonable to deeply
damaged. The network will then learn to estimate the noise content and
remove it. In a multi-coil scan with a strong correlation between noise in the
coils, this task tends to be particularly difficult at the image level because
the noise is not stationary and will depend on the intensity of the image.
The supervised and unsupervised training will be discussed in the
following sections. The obtained results will be reported afterward.

6.3.1 K-DnCNN using Supervised Learning
The supervised learning approach requires ground truth images, and the
learning process is used to find the best set of weights that minimizes the
loss function. The network that processes data in k-space will map a noisy
version of an input to its groups truth.
The ground truth is the clean version of the k-space, so the

51
minimization of the loss should reproduce it, and since our metrics will be
based only on the reconstructed version of the image, a term will be added
in the loss function to help the network concentrate on approximating the
noisy input to the ground truth, the loss function used in the training is
defined as :
Lk = MSE(Sy, S ) + β * MSE(Sos(Sy ), M ) (6.8)
Where Sy is the 16 channels output of the network, S is the ground truth
signal in the k-space, and M is the ground truth magnitude image, the two
terms in the equation seek to compare by the mean of MSE both the
restored output to the clean image in k-space and the reconstructed,
restored image to the original magnitude one.
The parameter β is used to balance the two terms. It controls how much the
reconstruction in magnitude space is weighted in the loss function. It is not
a constant, it starts at zero and slowly increase to its maximum value.
The network is trained with the Adam [47] optimiser for 300 epochs with a
learning rate of 3 ⋅ 10−3, then the learning rate is reduced to 3 ⋅ 10−4 and the
network is trained until the validation loss decrease.
The loss coefficient β is initialised at zero and then linearly increased to
5.103 between epoch 50 and 200. The final value is chosen to set the the
order of magnitude if the reconstruction term to be of the same order of
magnitude with respect to the loss in k-space.
6.3.2 K-DnCNN using Unsupervised Learning
For unsupervised learning, the Noise2Noise framework is used. In this
case, there is no ground truth, so the loss function is modified to contain
only the corrupted version of the images.
In general, the denoising task with deep learning is performed by
constructing a map between a corrupted version of an image and the
corresponding clean one in the supervised learning approach.

52
However, when a clean image is not available, it is usually more
challenging to train a neural network with acceptable performance for the
task.
One possible strategy to overcome this limit is to use an unsupervised
learning approach: It was proposed by [46] It offers the possibility to train a
neural network using only multiple corrupted examples of the same target
without explicitly using the target itself. Their method allows to exploit the
general purpose of deep CNN model to unsupervised denoising and to
reach significant performance very close to supervised learning, but
without the problem of collecting ground truth images.
The idea is based on the fact that, if the loss function L is thought as a
generalized point estimator, and therefore an operator that involves the use
of sample data, to calculate a single value which serves as the best guess of
an unknown distribution parameter, then the expected value can be written
over the distribution of the pair of noisy /clean image (X,y) such as :
argmin E(X,y){L(NNW (X ), y)} (6.9)

W
For which a possible solution is the set of W that gives:
NNW(X ) = E( y){y} (6.10)
The conditional distribution over all the possible clean images and the
average over the sample of the noisy images can be written as :

argmin E(X ){E( y|x){L(NNW (X ), y)}} (6.11)
W
In this case, P(y | X ) denotes all the possible correct observations that can be
matched to a noisy example. An example of this visualisation can be all the
possible positions of an edge when the borders are noisy, or the exact
position of a boundary between two noisy surfaces.

53
It can be noticed that if the target is replaced with another noisy observation
that has the same expectation value (for example additive white noise), then
the solution still holds and, for example in this case of the L 2 it is possible to
train using target corrupted samples if their noise has zero mean.
The loss function can be written explicitly, remembering the output of the
network in residual learning in Eq. 4.14, as:
Lk = MSE(Sn − NW(Sn), Sn*) (6.12)
That with additive noise described in 5.6, it becomes :

Lk = | S + n − NW(Sn) − S − n * |2 (6.13)
Where S = s + n and Sn * = S + n * are both independently corrupted

versions of the same original k-space data S, the purpose of this task is to
create a map between two independent samplings of the noise.
For the blind denoising task, there is no need to use the same standard
deviation of the noise and the input to the network does not need to be of
better quality with respect to the target. So it is possible to ask the network
to map an image with higher PSNR to another with lower PSNR without
damaging the training. This is extremely important for true unsupervised
blind denoising since the noise level can also be unknown at every step.
In contrast to the training in the supervised approach, the aim here is not to
minimize the training loss since the task is impossible to complete. In
residual learning, this unsupervised task consists in transforming one
instance of the noise to another one independently sampled.
Figure 6.2 shows the loss function for the training and the validation of the
supervised and unsupervised training: In supervised training, the train and
validation loss decrease together and differ from a small amount that is
54
Figure 6.2 : Example of the qualitative differences in training for

the (a) supervised and (b) unsupervised approaches.
Train loss (blue) and validation loss (orange) during live
training.
called the training bias. In unsupervised learning, instead, there is a
training loss that is almost flat and a validation loss that rapidly decreases;
flat because the train task is effectively impossible, and rapidly decreasing
because the weight gradients are the correct ones so when the network is
validated for the original task the loss correctly decreases.
The network is trained with the Adam [47] optimizer for 300 epochs with a
learning rate of 3 · 10−3 , afterward, the learning rate is reduced to 3 · 10−4,
and the network is trained for another 100 epochs. Then the learning rate is
exponentially decreased to zero.

55
7. Experimental results
The Dn-CNN is trained for a blind denoising task, so to test its performance
different levels of image corruption are used. The test was performed using
a noise standard deviation of σ = 8 ⋅ 10−3,16 ⋅ 10−3 that, since the initial

image quality is not homogenous, will produce an average PSNR of the
noisy images respectively of (24.3 ± 2.9) and (18.9 ± 2.5) dB.
The experimental results obtained are measured only on the reconstructed
images. The denoising is performed in k-space, but since it is never shown
to the operator, it would be useless to base the results on it.
The final reconstructed image from the denoised k-space, without post-
processing (except for a 0 to 1 normalization of pixel intensities), is used in
all performed tests.
PSNR and SSI are the metrics used to evaluate the restored images, and, to
avoid that large background area may contribute too much in the
calculation, each image is processed entirely but the metrics are computed
only on a large central region of the signal as shown for example in Figure
3.4 for the whole image and in Figures 3.10 for the central patch.
This is crucial since the background is very easy to treat. Its presence will
give exceedingly high scores that are not representative of the real
performances.
The test was held on a portion of data of 100 patients and 2959 scans
comprising both acquisitions with and without fat suppression.
In Table 7.1 the results for the supervised and unsupervised training are
reported. Both approaches consistently improve the image quality both at a
high and low level of noise and remarkably, at a high noise level, their
performance matches.
56
Noise std PSNR(dB) ΔPSNR(dB) SSI Δ SSI
Supervised
σ = 8 ⋅ 10−3 29.3 ± 3.1 5.1 ± 2.1 0.80 ± 0.06 0.2 ± 0.1

σ = 16 ⋅ 10−3 25.6 ± 2.6 6.7 ± 1.8 0.60 ± 0.10 0.3 ± 0.1
Noise2Noise
σ = 8 ⋅ 10−3 28.3 ± 3.4. 4.0 ± 3.1 0.80 ± 0.06 0.2 ± 0.1

σ = 16 ⋅ 10−3 25.6 ± 2.8 6.7 ± 2.4 0.61 ± 0.10 0.3 ± 0.1
Table 7.1 : Average results on the test dataset composed by 100

Patients with 2959 slices that were processed separately. These results were obtained
on two levels of noise : σ = 8 ⋅ 10−3, 16 ⋅ 10−3 which produces an average PSNR of
the noisy images respectively of (24.3 ± 2.9) and (18.9 ± 2.5) dB. The results were
reported for both the K-DnCNN trained both with Supervised and Unsupervised
(Noise2Noise) for the PSNR and SSI. The gain Δ is also reported, at higher levels of
noise these two methods perform similarly, while in low levels Supervised training
seems to work better.
7.1 Qualitative results
From figures 7.1 to 7.4, whole knee images with and without fat
suppression, showing the noise effect and denoising with K-DnCNN for
supervised and unsupervised (Noise2Noise) training at σ = 8 ⋅ 10−3 (Low)

and σ = 16 ⋅ 10−3 (High) noise level are reported. Both methods removed
the noise successfully while preserving fine details of the image : Flat areas
look smooth, edges are not blurred, and textures are preserved, moreover,
new details absent in the original image (artifacts) are not generated.
57
Figure 7.1 : Noise effect and denoising with K-DnCNN for supervised and
unsupervised (Noise2Noise) training at σ = 8 . (10e-3) (Low) and σ = 16 . (10e-3)
(High) noise level.
58
(High) noise level.
59
(High) noise level.
60
(High) noise level.
61
7.2 Quantitative Results
7.2.1 Supervised Learning
Examples of supervised denoising results on knee central image patches are
reported in figures (7.5-7.8), both on high and low noise levels. Both
Residual and Structural Similarity maps were obtained between the
Ground Truth-Predictions and Ground Truth-Noised images. In residual
maps, each pixel is the square difference between pixel intensities of the
ground truth and corrupted or filtered image. In figure 7.5, for example,
where a high level of noise was applied, most of the pixels in the GT-
prediction residual map are close to black color, which, in the map refers to
an identity mapping, this proves that the filtered images using K-DnCNN
in supervised learning are very close to the original ones. Furthermore,
given the average value of the residual map at 7.5 (4,25 ⋅ 10−3 ≃ 0,004), it
can be seen that, on average, that this particular restored image at a high
noise level is very similar to the original magnitude image.
Conversely, the GT-noised map has overall red pixels, which implies that
the pixel intensities between ground truth and noised image are not similar.
SS map aims to compare the luminance, contrast, and structure factors
between the GT and recovered image. The higher the SSIM, the better.
Values close to 1 mean identical sets of data. In figure 7.6, for example, the
pixels are mostly red and yellow, which means that the similarity is high.
Even better results are obtained with denoising images at low level of
noises with supervised learning as can be seen in figures 7.7 and 7.8, the
average value of residual map is even lower as shown in figure 7.7
(7,54 ⋅ 10−4 ≃ 0,0007 ≃ 0). The SS index on the other hand, is close to 1
(0,871 ≃ 1).
62
FIGURE 7.5 : a-c) Example of denoising on the central image patch.

d-e) Residual maps.
f-g) Structural similarity maps.
Results for the supervised training at noise level σ = 16

d-e) Residual maps.
63

d-e) Residual maps.

d-e) Residual maps.
64
The distribution of the PSNR and SSI of the value computed at slice level on
the noisy and recovered image, along with the slice-wise gain, are also
reported in figures 7.9 and 7.10. The distribution shows that at a high noise
level (16 ⋅ 10−3) the average PSNR of noisy images is 18.9 ± 2.5, as around
1200 slices belong to this range, the average results of the PSNR for most of
the slices fall into the range of 25.6 ± 2.6, which implies that the average
PSNR gain for denoising at high levels of noise is 6.7 ± 1.8. This gain value
proves that the supervised denoiser performs extremely well at high levels
of noise. For SSI index predictions, values close to 1 are reported at slice
level, and a remarkable gain of 0.3 can be noticed.
On low levels of noise (8 ⋅ 10−3), the K-DnCNN has a slightly less good
performance but, remarkable overall gains can be seen as well.
FIGURE 7.9: Distribution of the a) PSNR and b) SSI of the value com-
puted at slice level on the noisy image (Left), on the restored image
(center),and the slice-wise gain. Results for the supervised training
with at noise level σ = 16
65
7.2.2 Unsupervised Learning
Examples of unsupervised denoising (Noise2Noise) results on knee central
image patches are reported in figures (7.11-7.12), both on high and low
noise levels.
On high levels of noise, the unsupervised learning denoising method gives
the same results as the supervised learning one. On figure 7.11 for instance,
the average residual map value is (5,03 ⋅ 10−3 ≃ 0,005) while for the same
image patch filtered by the supervised learning method an average value of
(7.54 ⋅ 10−4 ≃ 0,0007) is obtained, it can be said that for high levels of
noise, the two methods are comparable. Same thing is valid in figure 7.12
66

d-e) Residual maps.

d-e) Residual maps.
67
The distribution of the PSNR and SSI of the value computed at slice level on
the noisy and recovered image by Noise2Noise, along with the slice-wise
gain, are also reported in figures 7.13 and 7.14.
In figure 7.13, where high noise level is applied, the average PSNR for the
restored images is (6.7 ± 2.4) which is very similar to the average PSNR
obtained for supervised learning, the average SSI value obtained is also
similar (0.3 ± 0.1).
However, in figure 7.14, where a low noise level is applied, less favourable
results were reached at the level of the PSNR gain, the structural similarity
gain is same as it was for the supervised approach, but the average PSNR
gain reported is (4.0 ± 3.1) < (5.1 ± 2.1).
68
7.2.3 Application on the Brain Dataset
The possibility to perform the denoising task on data of the same kind but
with essential differences with respect to the one used during training is a
sought-after feature for a method that may be implemented in actual
practice.
In order to test if a good performance is reached on k-space data derived
from different NMR acquisition sequences from a different body location,
the Dn-CNN trained on the knee dataset was applied to the denoising of
brain data present in the FastMRI dataset.
Compared to the knee dataset, which is mostly homogeneous in acquisition
parameters, the brain dataset is more challenging for denoising, especially
when not directly used for training. Also, the shapes, contrasts, and average
69
intensities of a brain scan are very different visually from those commonly
found on an arthroscopic acquisition, such as the one on the knee.
For these reasons, this preliminary task of denoising based on training on a
different dataset is difficult to overcome. Nevertheless, the results are an
important test of the generalizability of the method.
To test the Dn-CNN denoiser trained previously for the blind denoising
task with supervised learning, 637 slices of brain scans from 255 patients
present in the validation set of the Brain FastMRI dataset were selected, and
noise with a σ = 16 ⋅ 10−3 and with the correlation between coils defined in
section 5.2 for simulating a highly noisy acquisition was applied.
Since the initial quality of the scans is very different because it depends on
the acquisition used and the noise already present, this noise injection
produces a diverse effect on the images in terms of quality.
The range of PSNR and SSI of the corrupted version of the images and of
the predictions is shown in Figure 7.16. The mean PNSR is (23.6 ± 4.5)dB
that usually corresponds to highly noisy images with clearly visible
variations in intensity areas and background, as shown in Figures 7.15c.
After applying the denoiser, the average gain of image quality is 4.6 ± 2.7
dB for the PSNR and 0.2 ± 0.1 for the SSI which signifies that, on average,
the image is improved both in its original intensity restoration and pixel
correlation.
An example of processed images and residual and SS maps can be found in
figures 7.15d-f.
70
Figure 7.15: Brain dataset denoising on central image patch

std noise sigma = 16
Figure 7.16: Results for the brain dataset: Distribution of the a) PSNR
and b) SSI of the value computed at slice level on the noisy image
(Left), on the restored image (center),and the slice-wise gain. Results
for the supervised training with at noise level σ = 16
71
7.3 Comparison with a state-of-the-art method
7.3.1 Non Local Means
NLM has shown to be an effective image denoising technique. In image
denoising, an image is often divided into multiple small patches which
repeatedly appear. The noise can be removed by taking profit from the
redundant patches information while simultaneously preserving images'
small structure. By taking advantage of the redundant patches, the nonlocal
means (NLM) image denoising method [48] could accomplish effective
performance, regarded as the most popular denoising method.
The key principle of nonlocal means is to denoise a pixel by averaging its
local neighborhood pixels with the clues of similarities of the redundant
patches. It has shown to be a useful image denoising technique. But the
definition of similarities between the patch of the noisy pixel and its
spatially local neighborhood patches in NLM is not strict, it’s just calculated
by a block matching process.
The basic principle of the nonlocal means denoising is to replace the noisy
value I(i) of pixel i with a weighted average of all the pixels on the image.
Because it needs too much computation, it is more practical to average the
pixels in a smaller scope. The pixel to be denoised is indicated by i, and the
pixels in the neighborhood of i by j, used to denoise i. The estimated value

̂ ) for a pixel i is computed based on the weighted average of the
I(i
neighborhood pixels j around pixel i:
∑s
I(i ) = wij I( j ) (7.1)
j∈Ni
Where Nis is the search window of size (2n + 1) * (2n + 1) centered at i and
wij is the weight of two pixels i and j which is calculated depending on the
similarity of their patches and is defined as :
72
1 | N d (i) − N d ( j) |2
−
wij = exp h2 (7.2)
Zi
∑
Where Zi is a normalising term, Zi = wij, and h acts as a filtering
j
parameter.
7.3.2 Denoising the fastMRI and results

For the test comparison with NLM the reconstructed magnitude images
were corrupted with 4 levels of noise [5e −3,1e −2,2e −2,5e −2] in K-space. The
resulting SNR is critically dependent on the initial image quality so it may
vary a lot at the same sigma level.
The nlmeans function was imported from the Dipy Library2 ; this library
grants the possibility to import the nonlocal means function to denoise 3D
or 4D images and boost the SNR of datasets. It is also possible to decide
between modeling the noise as Gaussian or Rician.
Few parameters are available inside the function and can be cited :
• estimate_sigma : Standard deviation estimation from local patches. It is
also possible to choose the number of used coils of the receiver array. In
this case 8 was chosen in order to get 16 channels.
• patch_radius : the similar patches in the non-local means are searched for
locally, inside a cube of side 2 * v + 1 centered at the voxel of interest.
• block_radius : the size of the block to be used in the blockwise non-local
means implementation.
• rician : if True the noise is estimated as Rician, otherwise Gaussian noise is
assumed, in this case Gaussian noise is chosen.
By brute-forcing the choice of these values in order to get an optimal
performance, the (patch-radius = 1, block-radius = 2) values were used.
2https://dipy.org/documentation/1.4.1./examples_built/denoise_nlmeans/#example-denoise-
nlmeans
73
The NLM denoising result on a single slice from the brain dataset is shown
in figure 7.17.
Figure 7.17 : NLM denoising on whole brain dataset image.

std noise sigma = 8
The scatter plot by groups of the three denoising algorithms used on the
knee dataset (K-DnCNN/ K-N2N/NLM) is reported on figure 7.18, it
Figure 7.18 : PSNR of the denoised images in function of the

starting PSNR of the noisy images. Each point is a slice in the
knee dataset. Different color represent the three algorithms
used for denoising. Blue : K-DnCNN, Green : K-N2N, Orange :
NLM
74
shows the PSNR of the denoised slices given the 3 different algorithms, in
function of the starting PSNR of the noisy images.
The blue color refers to K-DnCNN, green to K-N2N, and orange to NLM, it
can be seen from the plot that most of the processed images by K-DnCNN
and K-N2N have higher PSNR values than the ones processed with NLM
for multiple levels of noise.
In figures 7.19 and 7.20, the PSNR and SSI gains concerning multiple noise
levels collected from the three algorithms are also compared. It is clear that
(with the chosen tuning for the NLM) DnCNN and N2N show superior
performances with respect to NLM both using PSNR and SSI. It is also
interesting to notice that both the DL algorithms behave similarly with
respect to the level PSNR/SSI on the noisy image.
Figure 7.19 : PSNR gain of the three algorithms in function of noise

level.
75
Figure 7.20 : SSI gain of the three algorithms in function of noise

level.
76
8. Conclusion
The denoising task was performed on the k-space raw data of the FastMRI
dataset, the largest and most complete type. In addition, a method was
proposed based on residual learning denoising, applied to the frequency
data instead of denoising directly the reconstructed images.
The advantage was taken of the power of the residual learning framework
given the presence of additive noise which allows the network to
concentrate on building a high-level representation of the noise component
instead of the clean image.
In MRI, particularly in the multi-coil acquisition, the noise is not additive to
the reconstructed image. Chapter 3 discussed how in a simple simulation
performing the denoising task with residual learning over frequency data
produces superior results than applying a network of the same complexity
directly on the image data. Then the same method was applied to the
denoising of multi-coil data for morphological imaging. The most
important result, in this case, is the correct reproduction of the anatomical
parts of the knee such as : the muscle, the bones and the cartilage.
Results were quantified using PSNR and SSI: the first measures the pixel-
wise restoration of the true intensity and second the correct reproduction in
a small window of the correlation between and contrast between original
and predicted image.
In the test the metrics improved after the action of the denoiser in a blind
denoising task both at high and low level of noise. This is important since
our denoiser seems to be generalizing the task of denoising to both low and
high noise levels, usually also because the starting quality of MRI is not
clear, so an algorithm that performs well with blind denoising would be of
good use.
77
The obtained results were encouraging. The ability to generalize to different
sequences and anatomical subjects is one of the most important perks to
develop in the application's clinical setting.
Another important note is that both supervised and unsupervised learning
achieved comparable and similar performances while dealing with the task;
the metrics showed performance similarity between the two methods. The
reason is also the flexibility of the Noise2Noise algorithm in working with a
dataset even of small size, since it is decided to work in k-space where noise
is simpler than the one present in images.
The applicability of N2N is very trivial since it is often impossible to get the
ground truth images to perform training; this framework enables the
possibility to perform a training having only the corrupted images.
Details were usually restored in the low noise case in the results, but still,
they could be improved. However, at the same time, it was necessary to be
careful about maintaining the vital aspect of the solution, which is retaining
the quantitative information of the original image. This opportunity should
not be neglected at the cost of providing a better visual effect, as in our
analysis, quantitative results are the ones that matter the most.
Another good point to consider would be to make use of all coils instead of
only 8; this will surely improve the results since more signal from the same
volume is present in each coil acquisition, and a neural network would
exploit this information. 8 coils were taken in the first place instead of 15 to
obtain a number of channels multiple of eight in the input images to take
advantage of the Tensor Cores. One solution would be to add a dummy coil
in order to get 16, in order to introduce additional information and a fair
number of channels.
78
One of the non-realistic points of the unsupervised learning approach may
be the existence of an infinite number of noisy examples of a real exam
since the pairs of generated examples are generated automatically,
practically the number of noisy acquisitions of a subject is usually finite and
relatively small (dozen of copies). In this case, it is possible to try N2N with
the finite noise examples availability, and it may still be a viable approach.
Furthermore, in this context, it is essential to remember that SL is even more
constrained with the number of noisy examples if the noise is not generated
from a model.
Coming to the choice of the neural network's architecture, it has many
advantages, as deeply discussed earlier in chapter 2:
It works in partial context since its receptive field is smaller than the image,
and it effectively processes patches of the input; by consequence, it reduces
the leak of information derived from learning to the denoised example.
It uses a residual learning approach that, in addition to helping in the
performance increase, contributes to the reduction of unwanted artifacts in
the final image.
Moreover, the network has a small size in terms of numbers of parameters
and a small architecture which makes it very robust to overfitting.
Nevertheless, there are still many ways to enhance the chosen architecture
while keeping the original shape and parameters; choosing the correct
parameters makes it possible to get the optimal result needed for
denoising.
A new idea could be implementing a layer block derived from ResNet [49],
which is often employed with success in denoising and can be the
replacement perhaps to our convolutional layer; this would keep the same
used strategy but would deploy a more profound solution.

79
Comparing the K-DnCNN to the state-of-the-art NLM was the perfect
benchmark and a trivial proof for this thesis, that denoising in K-space is an
original idea that would outperform even famous denoising algorithms,
both on the level of PSNR and SSI the image quality and the similarity to
the original image showed to be consistent, in general the goal of this work
was to show that taking advantage of the additivity of the noise in K-space
is indeed a better approach than learning from the noise in magnitude
space.
80
References
1. Mohan, J., Krishnaveni, V., and Guo, Y. (2014). A Survey on the Magnetic
Resonance Image Denoising Methods. Biomed. Signal Process. Control. 9, 56–
69. doi:10.1016/j.bspc.2013.10.007Accessed November 19, 2020)
2. Tomasi, C., and Manduchi, R. (1998). “Bilateral Filtering for gray and Color
Images,” in Proceedings of the Sixth International Conference on Computer
Vision (IEEE Computer Society), ICCV ’98), 839.
3. Lakshmi Devasena C, Hemalatha M. Noise removal in magnetic resonance
images using hybrid KSL filtering technique.
4. Phophalia A, Rajwade A, Mitra SK. Rough set based image de-noising for
brain MR images. Signal Processing.
5. Rajeesh J, Moni RS, Palanikumar S, Gopalakrishnan T. Noise reduction in
magnetic resonance images using wave atom shrinkage. International Journal of
Image Processing (IJIP).
6. Xu, J., Huang, Y., Cheng, M.-M., Liu, L., Zhu, F., Xu, Z., et al. (2020).
Noisy-as-clean: Learning Self-Supervised Denoising from Corrupted Image. IEEE
Trans. Image Process. 29, 9316–9329.
7. Akar SA. Determination of optimal parameters for bilateral filter in brain
MRimage denoising. Applied Soft Computing. 2016;43:87-96
8. Dey N, Ashour AS, Beagum S, Sifaki Pistola D, Gospodinov M,
Gospodinova Е, Tavares RS. Parameter optimization for local polynomial
approximation based intersection confidence interval filter using genetic algorithm:
An application for brain MRI image denoising. Journal of Imaging.
2015;1(1):60-84
81
[9] TM Hudson, DJ Hamlin, WF Enneking, and H Pettersson. Magnetic
resonance imaging of bone and soft tissue tumors: early experience in 31 patients
compared with computed tomography. Skeletal radiology, 13(2):134–146, 1985.
[10] WD Zimmer, TH Berquist, RA McLeod, FH Sim, DJ Pritchard, TC
Shives, LE Wold, and GR May. Bone tumors: Magnetic resonance imaging
versus computed tomography. Radiology, 155(3):709–718, 1985.
[11] Timothy G Feeman. The mathematics of medical imaging: a beginners guide.
Springer Science & Business Media, 2010.
[12] K Kirk Shung, Michael Smith, and Benjamin MW Tsui. Principles of
medical imaging. Academic Press, 2012.
[13] D. L. Collins, A.P. Zijdenbos, V. Kollokian, J.G. Sled, N.J. Kabani, C.J.
Holmes, and A.C. Evans. Design and construction of a realistic digital brain
phantom. IEEE Trans. on Medical Imaging, 17(3):463–468, June 1998.
[14] Daniel S Marcus, Tracy H Wang, Jamie Parker, John G Csernansky, John
C Morris, and Randy L Buckner. Open access series of imaging studies (oasis):
cross-sectional mri data in young, middle aged, nondemented, and demented older
adults. Journal of Cognitive Neuroscience, 19(9):1498–1507, 2007.
[15] B. Menze, A. Jakab, S. Bauer, M. Reyes, M. Prastawa, and K. V.
Leemput. Multimodal brain tumor segmentation challenge. In MICCAI
Conference, October 2012.
[16] Santiago Aja-Fernandez and Antonio Tristan-Vega. A review on statistical
noise models for magnetic resonance imaging. LPI, ETSI Telecomunicacion,
Universidad de Valladolid, Spain, Tech. Rep, 2013.
[17] Hakon Gudbjartsson and Samuel Patz. The rician distribution of noisy mri
data. Magnetic Resonance in Medicine, 34(6):910–914, 1995.

82
[18] ER McVeigh, RM Henkelman, and MJ Bronskill. Noise and filtration in
magnetic resonance imaging. Medical physics, 12:586, 1985.
[19] Pierre Gravel, Gilles Beaudoin, and Jacques A De Guise. A method for
modeling noise in medical images. IEEE Transactions on Medical Imaging,
23(10):1221– 1232, 2004.
[20] Alessandro Foi. Noise estimation and removal in mr imaging: The variance
stabilization approach. In ISBI, pages 1809–1814, 2011.
[21] Ranjan Maitra and David Faden. Noise estimation in magnitude mr
datasets. Medical Imaging, IEEE Transactions on, 28(10):1615–1622, 2009.
[22] Pierrick Coupe, Jose V Manjon, Elias Gedamu, Douglas L Arnold,
Montserrat Robles, D Louis Collins, et al. Robust rician noise estimation for mr
images. Medical image analysis, 14(4):483–493, 2010.
[23] Jose V Manjon, Pierrick Coupe, and Antonio Buades. Mri noise
estimation and denoising using non-local pca. Medical image analysis, 22(1):35–
47, 2015.
[24] Guido Gerig, Olaf Kubler, Ron Kikinis, and Ferenc A Jolesz. Nonlinear
anisotropic filtering of mri data. Medical Imaging, IEEE Transactions on,
11(2):221–232, 1992.
[25] Tim McInerney and Demetri Terzopoulos. Deformable models in medical
image analysis: a survey. Medical image analysis, 1(2):91–108, 1996.
[26] Jan Sijbers, Arnold J. den Dekker, Paul Scheunders, and Dirk Van Dyck.
Maximum likelihood estimation of rician distribution parameters. IEEE
Transaction on Medical Imaging, 17(3):357–361, 1998.
[27] J. V. Manjon, J. C. Caballero, G. G. Marti, L. Marti-Baonmati, and M.
Robles. Mri denoising using non local means. Medical Image Analysis, 12:514–
523, 2008.
83
[28] Yasuyuki Taki, Benjamin Thyreau, Shigeo Kinomura, Kazunori Sato,
Ryoi Goto, Ryuta Kawashima, and Hiroshi Fukuda. Correlations among brain
gray matter volumes, age, gender, and hemisphere in healthy individuals. PloS
one, 6(7):e22734, 2011.
[29] Edelstein, W. A. et al. (Aug. 1986). “The intrinsic signal-to-noise ratio in
NMR imaging”. In: Magn. Reson. Med. 3.4, pp. 604–618. ISSN: 0740-3194.
DOI: 10.1002/mrm.1910030413.
[30] Raya, José G. et al. (Jan. 2010). “T2 measurement in articular cartilage:
Impact of the fitting method on accuracy and precision at low SNR”. In: Magn.
Reson. Med. 63.1, pp. 181–193. ISSN: 0740-3194. DOI: 10.1002/mrm.22178.
URL: https://doi.org/10.1002/mrm.22178.
[31] Dietrich, Olaf, Sabine Heiland, and Klaus Sartor (Mar. 2001). “Noise
correction for the exact determination of apparent diffusion coefficients at low
SNR”. In: Magn. Reson. Med. 45.3, pp. 448–453. ISSN: 0740-3194. URL:
https://doi.org/10.1002/1522-2594(200103)45:3<448::AID-
MRM1059>3.0.CO;2-W.
[32] Glenn, G. Russell, Ali Tabesh, and Jens H. Jensen (2015). “A simple noise
correction scheme for diffusional kurtosis imaging”. In: Magnetic Resonance
Imaging 33.1,pp. 124–133. ISSN: 0730-725X.
[33] Taylor, Alexander J. et al. (Oct. 2016). “Probe-Specific Procedure to
Estimate Sensitivity and Detection Limits for 19F Magnetic Resonance Imaging”.
In: PLOS ONE 11.10, e0163704. DOI: 10.1371/journal.pone.0163704.
[34] Fan, Linwei et al. (Dec. 2019). “Brief review of image denoising techniques”.
In: Visual Computing for Industry, Biomedicine, and Art 2.

84
[35] Zhang, K. et al. (2017). “Beyond a Gaussian Denoiser: Residual Learning of
Deep CNN for Image Denoising”. In: IEEE Transactions on Image Processing
26.7, pp. 3142–3155. ISSN: 1941-0042.
[36] Arbelaez, Pablo et al. (May 2011). “Contour Detection and Hierarchical
Image Segmentation”. In: IEEE Trans. Pattern Anal. Mach. Intell. 33.5, pp.
898–916. ISSN:0162-8828. DOI: 10 . 1109 / TPAMI . 2010 . 161.
[37] Orieux, François, Jean-François Giovannelli, and Thomas Rodet (June
2010). “Bayesian estimation of regularization and point spread function
parameters for Wiener Hunt deconvolution”. In: Journal of the Optical Society
of America. A Optics, Image Science, and Vision, p. 1593.
[38] Liu, F. et al. (2017). “Fast Realistic MRI Simulations Based on Generalized
Multi-Pool Exchange Tissue Model”. In: IEEE Transactions on Medical
Imaging 36.2, pp. 527–537. ISSN: 1558-254X.
[39] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton (2012).
“ImageNet Classification with Deep Convolutional Neural Networks”. In:
Proceedings of the 25th International Conference on Neural Information
Processing Systems - Volume 1. NIPS’12. Lake Tahoe, Nevada: Curran
Associates Inc., 1097–1105.
[40] Lu, Le et al. (2017). Deep Learning and Convolutional Neural Networks for
Medical Image Computing: Precision Medicine, High Performance and Large-Scale
Datasets.
[41] Chong, Edwin and Stanislaw Zak (2001). “An Introduction to
Optimization”. In: 2nd. SERIES IN DISCRETE MATHEMATICS AND
OPTIMIZATION. WILEY-INTERSCIENCE. Chap. 8th.
[42] Goodfellow, Ian, Yoshua Bengio, and Aaron Courville (2016). Deep
Learning. http://www.deeplearningbook.org. MIT Press.

85
[43] Zhou and Chellappa (24-2). “Computation of optical flow using a
neural network”. In: IEEE 1988 International Conference on Neural
Networks, 71–78 vol.2.
[44] LeCun, Haffner, Bottou and Bengio (1998). Object recognition with
Gradient-Based Learning.
[45] Zbontar, Jure et al. (2019). fastMRI: An Open Dataset and Benchmarks for
Accelerated MRI. arXiv: 1811.08839 [cs.CV].
[46] Lehtinen, Jaakko et al. (2018). “Noise2Noise: Learning Image Restoration
without Clean Data”. In: Proceedings of the 35th International Conference on
Machine Learning. Ed. by Jennifer Dy and Andreas Krause. Vol. 80.
Proceedings of Machine Learning Research. PMLR, pp. 2965–2974.
[47] Kingma, Diederik P. and Jimmy Ba (2017). Adam: A Method for
Stochastic Optimization.
[48] A. Buades, B. Coll, and J.-M. Morel, “A non-local algorithm for image
denoising,” in IEEE Conference on Computer Vision and Pattern
Recognition, vol. 2, 2005, pp. 60–65
[49] He, Kaiming et al. (2016). Identity Mappings in Deep Residual Networks.
arXiv: 1603.05027 [cs.CV].
[50] ] P. Wang, H. Zhang, V.M. Patel, SAR Image despeckling using a
convolutional neural network, IEEE Signal Process. Lett. 24 (12) (2017) 1763–
1767.
86
View publication stats

Deep Learning

Uploaded by

Deep Learning

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Using Deep Learning to de-noise MRI images

Article · October 2021

The user has requested enhancement of the downloaded file.

Facoltà di Ingegneria dell'informazione, informatica e statistica

Dipartimento di Ingegneria Informatica, Automatica E Gestionale

Christian Napoli Dr. Stefano Giagu

2. Magnetic Resonance Imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3

3. Denoising and Rician noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7

3.1.The problem and approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9

3.2 Proof of concept. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2.1 Training and noise generation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14

3.2.2 Loss function in k-space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15

3.2.3 Testing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16

3.2.4 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16

4. Deep Learning and Convolutional Neural Networks. . . . . . . . . . . . . . . . .18

4.1 Neural Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.1.1 Activation functions and hidden units. . . . . . . . . . . . . . . . . . . . . . . 21

4.1.2 Learning from examples, loss function and training . . . . . . . . . . .22

4.2 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.2.1 Convolutional Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .26

4.2.2 Pooling Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .28

4.2.3 CNN architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.2.4 The fully convolutional architecture. . . . . . . . . . . . . . . . . . . . . . . . .30

4.4 Densoising using Residual Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.5 Batch Normalization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .34

5. Dataset and Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .37

5.1 The MRI dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .37

5.3 Data preprocessing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.3.1 Low-pass filtering and reduced number of coils . . . . . . . . . . . . . . 43

5.3.2 Mixed precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .44

6.3 Training the K-DnCNN to the FastMRI . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6.3.1 K-DnCNN using Supervised Learning . . . . . . . . . . . . . . . . . . . . . . .50

6.3.2 K-DnCNN using Unsupervised Learning . . . . . . . . . . . . . . . . . . . . 51

7.1 Qualitative results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

7.2 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

7.2.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

7.2.2 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .65

7.2.3 Application on the Brain Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . .68

7.3 Comparison with a state-of-the-art method . . . . . . . . . . . . . . . . . . . . . . .71

7.3.1 Non Local Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

7.3.2 Denoising the fastMRI and results . . . . . . . . . . . . . . . . . . . . . . . . . . .72

assist the medical practitioners in identifying a disease, locating the

abnormal sites, monitoring tumor size, etc. Among different medical

imaging modalities, Computer Tomography (CT), Magnetic Resonance

Imaging (MRI), Positron Emission Tomography (PET), and Ultrasound are

widely utilized by physicians.

More sophisticated machines and non-invasive techniques have made

medical imaging popular, thereby making the diagnosis more accurate.

However, the central premise of accurate diagnosis is noise-free images

which are still elusive.

Magnetic Resonance Imaging (MRI), which plays a vital role in clinical

diagnosis producing high-quality 2D and 3D images of the body, is also

degraded by noise at the acquisition time due to imperfection in radio-

only the performance of computerized diagnosis systems but also creates

difficulty for the manual inspection of a disease.

proper interpretation, analysis, accurate parameter estimation, and further

preprocessing. Noise remains one of the principal causes of quality

deterioration in MRI, causing artifacts and blurring, and is a subject in a

large number of papers in the MRI literature. Many denoising and

Recently, deep learning methods have been proposed to denoise natural

images using different architectures [35,50]. Most of these methods use

supervised learning by training different architectures with pairs of noisy

and noise-free input and outputs, respectively. Such learning-based

benefits of these techniques is that, after training, denoising can be applied

extremely fast. Convolution Neural Networks (CNNs), for example, have

obtained remarkable performances on image denoising [35]. However, the

unique sources of noise generation and their combination. Thus, it is

correctly for medical images. Therefore, another motivation is to investigate

such noise and explore the technique of its removal.

Motivated by recent advancements in CNN and the particular noise model

of MRI, this thesis proposes a novel CNN-based denoising method of MRI

model is simpler, additive, and Gaussian in the frequency domain.

2. Magnetic Resonance Imaging

imaging methods developed alongside Computer Tomography (CT) and X-

Rays Technology. It is ionization and radiation-free modality; hence it is a