0% found this document useful (0 votes)
25 views91 pages

Deep Learning

This document discusses using a deep learning approach to denoise MRI images. Specifically, it proposes using a convolutional neural network (CNN) trained on raw MRI data in the k-space domain rather than the magnitude space. The goal is that noise is simpler and Gaussian in k-space. The document provides background on MRI and noise, an overview of deep learning and CNNs, details on the MRI dataset and preprocessing steps, the implemented methodology using a CNN called K-DnCNN, and results showing the approach can effectively reduce noise in MRI images.

Uploaded by

murali
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
25 views91 pages

Deep Learning

This document discusses using a deep learning approach to denoise MRI images. Specifically, it proposes using a convolutional neural network (CNN) trained on raw MRI data in the k-space domain rather than the magnitude space. The goal is that noise is simpler and Gaussian in k-space. The document provides background on MRI and noise, an overview of deep learning and CNNs, details on the MRI dataset and preprocessing steps, the implemented methodology using a CNN called K-DnCNN, and results showing the approach can effectively reduce noise in MRI images.

Uploaded by

murali
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 91

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/355467879

Using Deep Learning to de-noise MRI images

Article · October 2021

CITATIONS READS

0 690

1 author:

Hamza Bouzidi
Sapienza University of Rome
1 PUBLICATION 0 CITATIONS

SEE PROFILE

All content following this page was uploaded by Hamza Bouzidi on 21 October 2021.

The user has requested enhancement of the downloaded file.


Deep Neural Networks for N-
MRI image processing

Candidate Advisor
Hamza Bouzidi Christian Napoli
Deep Neural Networks for N-MRI image
processing

Facoltà di Ingegneria dell'informazione, informatica e statistica

Dipartimento di Ingegneria Informatica, Automatica E Gestionale


Corso di laurea in Engineering in Computer Science

Hamza Bouzidi

Matricola 1915027

Advisor Co-Advisor

Christian Napoli Dr. Stefano Giagu

A.A. 2020-2021
Table of Contents
1.Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1

2. Magnetic Resonance Imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3

3. Denoising and Rician noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7

3.1.The problem and approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9

3.2 Proof of concept. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2.1 Training and noise generation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14

3.2.2 Loss function in k-space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15

3.2.3 Testing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16

3.2.4 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16

4. Deep Learning and Convolutional Neural Networks. . . . . . . . . . . . . . . . .18

4.1 Neural Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.1.1 Activation functions and hidden units. . . . . . . . . . . . . . . . . . . . . . . 21

4.1.2 Learning from examples, loss function and training . . . . . . . . . . .22

4.2 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.2.1 Convolutional Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .26

4.2.2 Pooling Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .28

4.2.3 CNN architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.2.4 The fully convolutional architecture. . . . . . . . . . . . . . . . . . . . . . . . .30

4.4 Densoising using Residual Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.5 Batch Normalization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .34

5. Dataset and Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .37

5.1 The MRI dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .37

5.2 Noise and signal model for multiple correlated coils . . . . . . . . . . . . . .41

5.3 Data preprocessing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.3.1 Low-pass filtering and reduced number of coils . . . . . . . . . . . . . . 43

5.3.2 Mixed precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .44

6. Implemented Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.1 Optimizing the input pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .46
6.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6.3 Training the K-DnCNN to the FastMRI . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6.3.1 K-DnCNN using Supervised Learning . . . . . . . . . . . . . . . . . . . . . . .50

6.3.2 K-DnCNN using Unsupervised Learning . . . . . . . . . . . . . . . . . . . . 51

7.Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

7.1 Qualitative results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

7.2 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

7.2.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

7.2.2 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .65

7.2.3 Application on the Brain Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . .68

7.3 Comparison with a state-of-the-art method . . . . . . . . . . . . . . . . . . . . . . .71

7.3.1 Non Local Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

7.3.2 Denoising the fastMRI and results . . . . . . . . . . . . . . . . . . . . . . . . . . .72

8. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .76
1

1. Introduction
In current clinical practice, the role of medical images has become very

prominent for diagnosing and treating several diseases. The medical images

assist the medical practitioners in identifying a disease, locating the

abnormal sites, monitoring tumor size, etc. Among different medical

imaging modalities, Computer Tomography (CT), Magnetic Resonance

Imaging (MRI), Positron Emission Tomography (PET), and Ultrasound are

widely utilized by physicians.

More sophisticated machines and non-invasive techniques have made

medical imaging popular, thereby making the diagnosis more accurate.

However, the central premise of accurate diagnosis is noise-free images

which are still elusive.

Magnetic Resonance Imaging (MRI), which plays a vital role in clinical

diagnosis producing high-quality 2D and 3D images of the body, is also

degraded by noise at the acquisition time due to imperfection in radio-

frequency coils or movements of the patients. Noise in MRI scans affects not

only the performance of computerized diagnosis systems but also creates

difficulty for the manual inspection of a disease.

Hence, estimation and removal of noise from MR images are essential for

proper interpretation, analysis, accurate parameter estimation, and further

preprocessing. Noise remains one of the principal causes of quality

deterioration in MRI, causing artifacts and blurring, and is a subject in a

large number of papers in the MRI literature. Many denoising and

enhancement techniques are applied [1–8], and a number of them are based

on deep learning.

Recently, deep learning methods have been proposed to denoise natural

images using different architectures [35,50]. Most of these methods use


2

supervised learning by training different architectures with pairs of noisy

and noise-free input and outputs, respectively. Such learning-based

methods try to infer the clean image from the noisy input. One of the main

benefits of these techniques is that, after training, denoising can be applied

extremely fast. Convolution Neural Networks (CNNs), for example, have

obtained remarkable performances on image denoising [35]. However, the

denoising of MRI images using CNNs has not been extensively studied in

the literature.

Moreover, the study revealed that the noise model present in MR images is

very different from that of natural images[16]. This happens due to several

unique sources of noise generation and their combination. Thus, it is

evident that the technique used in natural image denoising may not work

correctly for medical images. Therefore, another motivation is to investigate

such noise and explore the technique of its removal.

Motivated by recent advancements in CNN and the particular noise model

of MRI, this thesis proposes a novel CNN-based denoising method of MRI

raw data in the k-space (frequency domain) rather than in magnitude space.

The aim of using the denoiser network on this type of data is that the noise

model is simpler, additive, and Gaussian in the frequency domain.


3

2. Magnetic Resonance Imaging


Magnetic Resonance Imaging (MRI) is one of the most prevalent clinical

imaging methods developed alongside Computer Tomography (CT) and X-

Rays Technology. It is ionization and radiation-free modality; hence it is a

non-invasive technique and safer than CT, X-Ray, and other techniques. It

also provides better soft-tissue contrast and image resolution for the

diagnostic purpose [9], [10]. The MRI modality built upon the phenomenon

of Nuclear Magnetic Resonance (NMR) was discovered by F. Bloch and E.

Purcell independently in 1946 (both awarded Nobel prizes in 1952). Further

investigation of NMR phenomena led it to be useful for human society in

notable works by Richard Ernst, Paul C. Lauterbur, and Sir Peter Mansfield,

who won the Nobel prize in 1991, 2003, and 2003 respectively.

The principle of NMR involves quantum and classical mechanics, which

involve the processing of protons (present in the human body) under an

external Magnetic field. The crux of the MRI modality is the utilization of

abundance containment of hydrogen nuclei present in the human body in

the form of water. The protons of H-atoms are aligned by the external field.

Under a radio frequency (RF) pulse, protons release their energy generating

an electromagnetic signal that gets recorded by receiver coils of MRI

scanners. These electromagnetic signals get encoded in phase and

frequency components. The Inverse Fourier transform of raw data

generates an image slice, either 2 or 3 dimensional, also known as k-space.

The reconstruction process from raw signal to image space provides an

added choice to generate any particular slice in 2D form or complete

volumetric (3D) representation [11], [12]. MRI offers various modalities, in

addition, namely T1, T2, PD (Proton Density) modality, shown in figure 2.1.
4

Figure 2.1: Sample images from the simulated BrainWeb Database [13]

Although a versatile technique, the quality of the image is often affected

during the image acquisition process. The artifacts can mainly be classified

as:

• Hardware Related: such as power supply instability, thermal noise etc.

• Software Related: such as error in decoding pulse sequence, intensity

inhomogeneity etc.

• Patient Related: such as body movement, holding breath for a long time,

blood flow etc.

• Physics Related: such as magnetic susceptibility, Gibbs ringing artifacts

etc.

Many of the artifacts mentioned above are taken care of by the MR scanner

available. Some noise/artifacts still remain in the scan which needed to be

removed. Otherwise, it may affect post-processing step, which involves

tissue identification, tissue segmentation, and other diagnostic decisions

In the reconstruction step, there is always uncertainty involved due to

sampling of the Fourier domain to the spatial domain, interpolation

techniques used, etc. This uncertainty can be defined as whether a spatial

location represents actual tissue information of the subject, or if a proper

signal may be affected by some encoding scheme or effect of

neighbourhood. This uncertainty leads to some undesirable visual effects,


5

commonly referred to as noisy images, which are needed to be overcome by

some software/mathematical modelling (referred to as Image Denoising

Problem). Here, Figure 2.2 shows two real sample images of different

subjects (human) from benchmark databases [14], [15] where noise is clearly

Figure 2.2: Sample images from Real Databases. On the left Oasis [14], and right BRATS [15]

visible. The image denoising problem is, in fact, an inverse problem that

tries to reconstruct a true noise-free image [9]; hence, it can ease the image

segmentation, disease identification, etc.

The acquisition process of medical images is sensitive to noise or undesired

signals. Since noise is an inherent part of MRI data, denoising becomes a

crucial ingredient of the medical image analysis process. Hence, there are

two sets of problems: (a) estimation and analysis of noise model/parameter

and other artifacts such as intensity inhomogeneity, bias correction, etc., and

(b) construction of adaptive models for denoising purposes. However, these

can be considered independent problems, or one can use the first one as

guided input for the other. An inaccurate noise model may lead to doubt on

the reliability of the denoising method. Traditionally, the Gaussian model is

preferred at high SNR locations in MRI [16]. A lot of efforts have been put

into building a statistical noise model in MRI [16], [17], [18], [19]. Similarly,
6

efforts have been made to estimate parameters of models in [20], [21], [22],

[23].

On the other side, to develop denoising methods according to noise model

in MRI is highly sought. In this regard, many conventional methods have

been modified accordingly to adjust the nature of MRI data [24], [25], [26],

[27]. However, one needs to take care of the tissue information and

boundary information in image and keep them intact at the end of

denoising process. In fact, Cerebrospinal Fluid (CSF), Gray Matter (GM)

and White Matter (WM) play significant role in differentiating healthy brain

from an abnormal one and also in clinical examinations [28]. So, even a

small change in it may produce a wrong clinical decision. Hence, any

preprocessing part must preserve the structure and properties of tissue as in

the human subject. A large review on denoising methods in MRI can be

found in [1].
7

3. Denoising and Rician noise


Figure 3.1 explains the process of acquiring an MR image in the frequency

domain (k-space); k-space is the 2D or 3D Fourier transform of the MR

image measured. Its complex values are sampled during an MR

measurement in a scheme controlled by a pulse sequence, i.e., an accurately

Figure 3.1: MRI is acquired on frequency domain first, then through


Inverse Fourier transformed to geometric space. And then the
magnitude image is is obtained by the calculation pixel-by-pixel of the
complex x-space image.

timed sequence of radiofrequency and gradient pulses. In practice, k-space

often refers to the temporary image space, usually a matrix, in which data

from digitized MR signals are stored during data acquisition; it is provided

through a quadrature detector that provides the real and imaginary part of

the signal. Each part of the signal is assumed to be affected by white noise,

the main source of the noise is the RF coil resistance [29], and the final effect

on the quality of the images depends on a variety of factors such as the

pixel dimension, the duration of the acquisition, and the receiver

bandwidth.

The real and imaginary parts from k-space are reconstructed through the

complex Fourier transform in the x-space. The noise in x-space is still

Gaussian, and the real and imaginary part can be assumed to be

uncorrelated since the Fourier transform is a linear and orthogonal

transform [17]. The magnitude image is then acquired by the calculation


8

pixel-by-pixel from the complex image. The nature of noise in magnitude

space is no longer additive and no longer Gaussian.

The images usually obtained in MRI are magnitude images, but others

which are derived from the phase of the complex image could also be

found, but still, the most common ones are the magnitude images, that

study will be based on, since discarding the phase information can make

avoid phase artifact.

Magnitude images can not be divided into a part of the signal and a part of

noise since, as said earlier, the noise is no longer additive. Thus, the

probability distribution of intensities in a noisy magnitude image M,

reconstructed from an image of signal I and gaussian noise 𝜎, is denoted

by :

(I 2 + M 2)
M − IM
p(M ) = 2 e 2σ 2 I0( ) (3.1)
σ σ2

Where I0 is the modified 0-th order Bessel function. It is also called the Rice
distribution. A Gaussian approximation of this distribution can be made

only if I /σ > > 1(which is the Signal to noise Ratio in the x-space). So a
magnitude image with a high level of noise will be far from this Gaussian

approximation of the signal, and it would suffer from what is called the

“Rician bias”.

As a consequence, clinical MRI with low SNR, also being hard to be read

and interpreted, can lead to the erroneous quantification of the physical

quantity. According to [30] , in T2 relaxation images, the accuracy and

precision of the measured T2 may be substantially impaired by the low

signal-to-noise ratio of images available from clinical examinations or in

diffusion-weighted images the decreasing SNR at increasing diffusion

weighting causes systematic errors when calculating obvious diffusion


9

coefficients [31]. In morphological scan, the effect can be less important

since new generation scanners can achieve excellent imaging quality, but

the effect of Rician noise still causes problems in many new acquisition

modalities [32] or when the low signal is given by the low concentration of

the excited nuclei as in the case of fluorine magnetic resonance imaging[33].

3.1 The problem and approach

Image denoising is the task of removing the effect of noise from an image,

which means that denoising an image should restore (the noisy image) to

the condition it was before the application of the noise (the original image),

the performance of the denoiser that performs this task is evaluated on how

much the restored image is close to the original one. Still, the denoised

image can inevitably lose some details in the process of denoising since

noise, edge, and texture are high-frequency components [34]. Therefore,

image denoising is considered to be a classic problem, and many solutions

as described in the introduction were proposed. Furthermore, the features

of a good denoiser can be defined as; Flat areas should be smooth, edges

should not be blurred, textures should be preserved, and new details not

present in the original image (called artifacts) should not be present.

In the past few years, several deep learning methods proposed to denoise

MR images by training different architectures with pairs of noisy and noise-

free training patterns used for supervised learning.

Our experiment seeks to evaluate the performance of a feed-forward

denoising convolutional neural network (DnCNN) that is applied to the

denoising of magnitude images both on the image and on the raw data in k-

space.
10

The method is inspired by the work done by Zhang and his collaborators

[35] that uses a strategy-based deep architecture (Deep Learning and

regularisation method into image denoising). Residual Learning and Batch

Normalization are both used to speed up the training process and also to

boost the denoising performance, RL aims to gradually remove the latent

clean image in the hidden layers to separate the noise from the original

image.

As a beginning test, the network consists only of convolutional operations,

detector stages, and batch normalization, as explained in the next section. It

will be composed as follows:

• First Layer : 2D Convolutional with ReLU activation, 64 filter of size

3x3x1.

• Layer 2 to (D-1) : 2D Convolutional with Batch Normalization, ReLu

activation, 64 filter of size 3x3x64.

• 2D Convolutional with linear activation with a residual layer.

More details about neural networks and residual learning will be available

in the next section, and a schematic representation will be available in

section 6.3.

In this test, it is required to compare the performance of the denoiser in the

task of denoising additive Gaussian noise, to the performance obtained in

denoising Rician distributed noise for the same quality of image, in terms of

PSNR.

So first the two equations of the images affected by both Gaussian and

Rician noise are denoted by:

Mgauss = I + ϵ (3.2)

Mrician = (I + ϵ1)2 + ϵ22 (3.3)


11

Where the resulting equations are the images after adding Gaussian and
Rician noise, I is the original image and ϵ is a zero mean Gaussian noise

with standard deviation σ, extracted randomly between [35,60), the model

is trained on 400 images from the train dataset, the dataset that was used is

the"BSDS500" dataset that is described in [36] and it is often used as a

benchmark for Denoising tasks. Results for the Denoising additive

Gaussian noise are reported in figure 3.2 and 3.4a, and for the Rician noise

in figure 3.3 and 3.4b, the metrics used to evaluate the performance are

defined in section 6.2. The results are compared to a standard image

filtering technique called the wiener filter, highly effective for white noise

removal [37].

The results show that comparing the restoration of images at the same level

of degradation expressed as PSNR with respect to the original image,

Gaussian denoising is more effective than Rician denoising, and the

Figure 3.2 : Performance of the DnCNN on the Gaussian blind denoising. From left clockwise :
The noisy version of the image, the processed image by DnCNN, the original image, the
processed image with Wiener filter. Each point is an image in the test set, the same color refer to
the same level of noise applied, and the dotted red line means that no improvement of
performance in PSNR after the application of denoiser is recorded [PSNR(processed) =
PSNR(noisy)].
12

Figure 3.3 : Performance of the DnCNN on the Rician blind denoising. From left clockwise : The
noisy version of the image, the processed image by DnCNN, the original image, the processed
image with Wiener filter. Each point is an image in the test set, the same color refer to the same
level of noise applied, and the dotted red line means that no improvement of performance in
PSNR after the application of denoiser is recorded [PSNR(processed) = PSNR(noisy)].

Figure 3.4 : Average PSNR for the test dataset after the application
Of the DnCNN (blue color) and Wiener Filter (Orange Color) in function of
the standard deviation of the noise for the Gaussian and Rician models.
For blind Gaussian denoising, DnCNN is always better than a wiener filter,
In the case of Rician noise Wiener filter outperforms on PSNR at high noise levels.

DnCNN performs always better than a wiener filter applied on same image

affected by Gaussian noise.


13

In the next section, the theory of neural networks will be explained and

focused deeply into the architecture of CNNs. Then a test of the

performance of the new proposed Denoiser method is performed on

simulated data that is easily accessible. And later validation on already

collected data from an open dataset.

3.2 Proof of concept

In order to perform supervised learning, a network with pairs of corrupted

and noiseless images to be used as a ground truth needs to be provided.

The ground truth will be used to provide an example of the expected

output. Discussion of supervised learning in more detail is available in

Section 4.1.2.

First, methods were validated on simulated data. This means that the work

should be considered a "proof of concept," It may not be directly applicable

to real data but grants the possibility to control every step of the pipeline.

To generate the dataset, an MRI simulation was performed in MRiLab [38],

a comprehensive simulator for large-scale realistic MRI simulations.

MRiLab combines realistic tissue modeling with numerical virtualization of

an MRI system and scanning experiment to assess a broad range of MRI

modalities.

Realistic simulation can be performed with plausible biological phantoms

modeled as large 3D objects with biologically relevant tissue models. The

computational power needed for the simulation is gained using parallelized

execution on GPU.

Shape, position, rotation, and dimensions of organs can vary in a

predefined interval; an example can be seen in Figure 3.4.


14

Figure 3.4 : Examples of phantom generated.


Each phantom is a 3D model with organs of different sizes, shape, orientation and position.
(a) and (b) are two examples of phantoms used in training

3.2.1 Training and noise generation


When the noise free dataset is formed, noised data is generated by adding

complex white noise in the frequency domain, the noise standard deviation

in k-space σk is chosen to generate images with the SNR value between 5


and 2.

Figure 3.5 shows the effect of the addition of noise in k-space in the

magnitude images.

DnCNN was trained for the task of denoising directly on k-space, this

network is referred to as Kspace-Dn, the training is performed on pairs of

clean and noisy images by minimization of the loss defined in section 3.2.2,

and to compare the results with a network of the same complexity, the

DnCNN was also trained on the noisy images on magnitude space, this

network is called M-Dn.

Both the networks are trained for 300 epochs with Adam [47], and a

learning rate lr = 10−3 , then, the learning is reduced to lr = 10−4 and the
network is trained for another 100 epochs.
15

Figure 3.5 : Noise effect on simulated MRI


data, Magnitude image reconstruction after
the addition of noise.

3.2.2 Loss Function in k-space


The network that processes data in k-space has to map the corrupted

version to the clean one. The mapping should be done according to k-space

and the final reconstructed magnitude image, that’s why a term is added to

the loss function to help concentrate on the final image.

The loss equation used for the k-space denoiser is :

LK = MSE(SY , S ) + β * MSE(reco(SY ), M ) (3.4)


16

With SY the two channel output of the network, S = (SR, SI ) represents the

real and imaginary parts of the signal and ground truth in k-space , M the
ground truth in magnitude space, and reco(SY ) the reconstruction of the

output two channel signal by taking the modulus of the 2D Inverse Discrete

Fourier Transform (iDFT), MSE is the mean square error.

3.2.3 Testing
To test the networks, data from a realistic brain phantom is used, available

in the simulation software [38], to check the noise removal capabilities on

fine details never seen during training. This test will be a measure of the

generalization capability of the networks.

3.2.4 Results

Figure 3.6 : Denoising results for (a) axial and (b) coronal views of
the brain phantom both on k-space and magnitude space.
Denoising in magnitude seems to create smoother surfaces at the
cost of losing details.
17

In this test, as described previously in Section 3.2.1 two networks were

trained, Kspace-Dn and M-Dn, with pairs of original and noisy simulated

MRIs of the simple phantom; figure 3.6 shows few compared images

denoised both with Ks-Dn and M-Dn for visual inspection, it is noticed that

M-Dn favors smoother surfaces at the cost of losing details, the reason of it

may be an incorrect noise estimation during the blind denoising task, it can

be seen how the networks that work on the k-space always outperform the

network trained with magnitude images in terms of image quality. This was

expected from the experiences achieved before in section 3.1.


18

4. Deep learning and Convolutional Neural Networks

The aim of this chapter is to shed light about basic concepts of Deep

Learning, and particularly on Convolutional Neural Networks (CNN).

CNNs became increasingly important in the past years for their huge

capacity to create hierarchical representation and have been used in many

computer vision problems like classification, super-resolution and object

localisation.

CNNs are now successfully implemented in many tasks in the common

computer vision domain due to the breakthrough in image classification in

ImageNet LSVRC-2010 classification challenge [39], CNN are also applied

on the medical field and nowadays applied often to diagnostic imaging,

they are also one of the most used models in computer aided diagnosis

(CAD).

This field is in continuous development and every month best models for a

given task are being changed, so the focus will be laid on the fundamentals

of this concept in what comes next.

A good overview of the application of CNN in the medical field can be

found in [40].

The reason of success of CNNs is because of many reasons that make them

superior to the traditional machine learning algorithms, for example the

scalable feature learning architecture that for a given task, optimises the

model parameters without and the small reliability on feature-engineering.

The technical details needed will be provided in order to understand the

implementation of the CNN based denoiser proposed in the approach.

4.1 Neural Networks


Neural networks are more commonly called Artificial Neural Networks

(ANN), ANNs are computers whose architecture is modeled after the brain.
19

They typically consist of hundreds of simple processing units which are

wired together in a complex communication network. Each unit or node is a

simplified model of a real neuron which sends off a new signal or fires if it

receives a sufficiently strong Input signal from the other nodes to which it is

connected.

Feedforward network is the simplest form of artificial neural network

where input data travels in one direction only, passing through artificial

neural nodes and exiting through output nodes; the aim of this section is to

give an overview of this widely used model. The goal of an FNN or an


ANN in general is to approximate a function f *. The simplest example that

could be given is a classifier that maps an element x to its label y, and it can

be described as y = f *(x). FNN define mappings between inputs and their

outputs such as y = f (x; W ) which models the original function f * and the

parameters or weights W are being learned in order to give the best

function approximation.

They are called feedforward networks because the information flows from the

input x to the output y, without feedback between layers, and they are an

ensemble of simple functions built in a chain structure. For example, let us


have N function f n with n = 1,...,N, these functions are subsequently

applied to the input x and they are chained together to


y = f N ( f N−1( . . . f 2( f 1(x)) . . . )).

f 1 is called the first layer of the network, f 2 until the last layer which is
called f N, and the length of the chain is called the depth of the model, since
the architecture of the network can be complex and composed of many

layers they are referred to as deep.

The objective of the training is to match our function f to the original f *, the

layers can be shaped freely to better approximate f *, the role of the training
algorithm is to select the best parameters or weights for these layers, they
20

are called hidden because during training they are hidden, and composed by

several hidden units that perform the basic computation in a neural network,

the number of hidden units is called the width.

One way to build an intuition of how a neural network works is to imagine

that the final layer is a simple linear model that operates not on the input

itself x but a transformation of it φ(x) created through the other layers. It is

possible to describe the action of the hidden layers as the formation of a

synthetic description of the input shaped by the training algorithm to be

beneficial to represent it correctly. For this reason, these structured

intermediate descriptions of the data are called representations or

(complex) features.

In Figure 4.1 there is a schematic representation of a network with two

hidden layers: An input layer x of width 12 is connected to the first hidden

layer of width 8 connected to the second layer of width 6 that, finally, is

connected to the output layer. The direction of the connections is always

Figure 4.1 : Schematical representation of a fully connected neural


network with two hidden layers. The first layer on the top is the input
layer. Connections go from the input layer to the bottom, the arrows
represent the weights.
21

from input (top) to output (bottom) without loops between elements:

Connections come only from the previous layer, and elements in the same

layer are not directly connected. This architecture is called a multilayer fully

connected neural network since it is formed by multiple layers in which

every element is connected to all the elements of the previous layer. This is

the most basic example of a neural network, but it still finds applications.

From a biological point of view, the structure of these networks resembles a

biological neural network: The elements of the layers can be seen as

neurons and the parameters of the chained functions as synapses. As a

biological neuron, which activity is modulated by many signals coming

from synapses with other neurons, the value of an element of a layer is

given by the many inputs that it receives from the previous layer.

4.1.1 Activation Functions and Hidden Units

To finish the description of the network architecture (i.e., the overall

structure of the network) in the hidden layers, the concept of activation

functions is introduced; In an artificial neural network, the sum of products

of inputs and their corresponding weights are calculated and finally an

activation function is applied to it to get the output of that particular layer

and supply it as the input to the next layer. The purpose of the activation

function is to introduce non-linearity into the output of a neuron. Since

values in the input layers are generally centered around zero and have

already been appropriately scaled, they do not require transformation.

However, these values, once multiplied by weights and summed, quickly

get beyond the range of their original scale, which is where the activation

functions come into play, forcing values back within this acceptable range

and making them useful.


22

One of the most used activation functions is the Rectified Linear Unit

(ReLu), which outputs 0 if the input is negative and linear when the input is

positive.
relu(x) = m a x(x,0) (2.2)

The major benefit of ReLU is the reduced likelihood of the gradient to

vanish. This arises when x > 0. In this regime the gradient has a constant

value. In contrast, the gradient of sigmoids becomes increasingly small as

the absolute value of x increases. The constant gradient of RELUs results in

faster learning.

The other benefit of ReLUs is sparsity. Sparsity arises when x ≤ 0. The more

such units that exist in a layer, the more sparse the resulting representation.

Sigmoids, on the other hand, are always likely to generate some non-zero

value resulting in dense representations.

4.1.2 Learning from examples, loss function and training

The goal of machine learning is to gain experience from data, usually, a

collection of features that is called the dataset composed of many data

points is available, in image recognition tasks for example this dataset may

be a collection of images that are represented as a matrix of pixel intensity

values.

In general, machine learning can be divided into two classes: supervised

and unsupervised; the approach in this thesis will be using both.

Unsupervised Learning algorithms allow users to perform more complex

processing tasks compared to supervised learning. Although, unsupervised

learning can be more unpredictable compared with other natural learning

methods. Unsupervised Learning algorithms include clustering, anomaly

detection, neural networks, etc.


23

Today, most practical machine learning models utilize Supervised Learning,

which applies an algorithm to map one input to one output. For supervised

learning to work, a labeled set of data that the model can learn from to

make correct decisions is needed. Data labeling typically starts by asking

humans to make judgments about a given piece of unlabeled data. For

example, labelers may be asked to tag all the images in a dataset. The

tagging can be as rough as a simple yes/no or as granular as identifying the

specific pixels in a specific image.

In machine learning, a properly labeled dataset that you use as the objective

standard to train and assess a given model is often called “ground truth.”

The accuracy of our trained model will depend on the accuracy of our

ground truth, so spending the time and resources to ensure highly accurate

data labeling is essential.

Then, in order to evaluate how good our algorithm performs the task, a

performance metric is needed. The choice of the metric is crucial since it is

the key to drive the learning phase and grant the ability to generalize the

task to other data that will be unlabelled in supervised learning. The metric

should be generic enough to be well defined for all our data examples and

precise and straightforward criteria to check if our objectives are reached.

Our learning goal should be also evaluating the performance of our model

on data that is not present in the training set to check if it has learned

adequately, thus, a different data set called the test set, will be used for this

purpose and measured performance on this set will be the indicator of how

good our model is.

In conclusion, machine learning methods would be evaluated by their

ability to :

• Have a good level of performance on the level of training examples


24

• Generalize the learning in order to be good enough also on test set, and

make the gap of performance between the training and test sets as small

as possible.

One interesting task that can be solved with Machine Learning is denoising,

the subject of our thesis. In denoising tasks the algorithm is given as input a

corrupted instance x which was previously not corrupted and a clean

version y, the model will try to restore x into a state similar to y. In this case

the metric that needs to be used has to calculate the similarity between the

two images, in the case of images pixels are usually worked with, so the

metric could be a distance calculator of pixels between the two images.

Now the process of optimisation is discussed, the objective function tends

to be optimised, this optimisation is done most likely by minimisation and

the objective function is usually called the error function, or the loss

function. The model's level of performance depends on the behaviour of the

loss function, which means that if it is managed to grant minimisation of

the loss value, a performance improvement can be induced.


The argument that minimises the loss function is referred to as L(x) , the

exact form of L(x) will be problem dependent but in general a low value of

it will result in high performance for the algorithm. The argument that
minimises the loss function is denoted by x* such that:

x* = argmin L(x) (4.1)

Solving the equation through the method of gradient descent is the one
generally adopted to solve the equation ∇x L(x) = 0 , more talk about is

available in [41].
25

Gradient descent is an optimization algorithm used to minimize some

function by iteratively moving in the direction of steepest descent as

defined by the negative of the gradient :


x′ = x − ϵ ∇x L(x) (4.2)

With ϵ the learning rate, a positive constant that defines the size of the step,

choosing the best learning rate is also a matter of discussion, but in the

future works, it will be chosen as a small value. The gradient descent

converges when all the components of the gradients are 0 or close to 0.

Particularly, ReLu units contribute in the learning phase and gradient

computation since they have a large derivative in every points they are

active, mentioning also the low computation cost of their activation as

described in equation 2.2.

After giving this overview about Neural Networks, a specific and very

popular type of neural networks and the main subject of this thesis,

convolutional neural networks, is examined.

4.2 Convolutional Neural Networks

In the Machine Learning taxonomy, Convolutional neural networks (CNNs)

are a subset of deep learning algorithms. CNNs can be both supervised and

unsupervised. CNNs are a type of neural network which are used to

analyze data with a grid-like structure. A good example of this is image

data, represented as two dimensions of RGB values.

Thus, CNNs solve the problem of understanding images using networks of

more manageable complexity. The special neural network considers that

physical closeness between pixels bears a meaning and that elements of

interest can appear anywhere in a picture. This is accomplished by using a

linear convolution operation, which will be discussed in the next section.

The use of this operation in one or more layers is what defines a CNN.
26

4.2.1 Convolutional Operation

In a theoretical sense, the convolutional operator in its most primitive form

can be considered an operation on two functions of a real-valued input [42].

The convolution operation is defined as :


s(t) = (x * w)(t) = x(a)w(t − a) d a (4.3)

where x is the function mapping to a specific value in the input data, and w

represents the kernel. The output function s(t) is usually called a feature

map of x respect to the kernel w.

If the input values are discrete, the convolution operation can be rewritten

using summation:


s(t) = (x * w)(t) = x(a)w(t − a) (4.4)
a

The input is commonly multidimensional. In that case, the functions can be

replaced with multivariable functions, operating on tensors. Consider an

example of applying convolution to a two dimensional image I as input.

One can then use a two-dimensional kernel K, and the operation can be

written as follows:

∑∑
S(i, j ) = (I, K )(i, j ) = K(i − m, j − n) (4.5)
m n

That is, for a given pixel in the input, positioned in row i and column j, the

convolution is computed by ”placing” the centre of the kernel over the

input pixel, and summing over the product of overlapping kernel

parameters and input pixels to produce the output value for i and j.

The convolutional operator can be seen as a matrix operation between the

kernel and a small portion of a larger image. Usually, the kernels adopted

for image processing in CNNs are significantly smaller than the image they
27

are applied to because it is assumed that in images the information is local,

so an object (or a part of an object) will be made of spatially close pixels. In

a convolutional layer the same kernel will be applied to all elements of the

input, meaning that the same operation is repeated in the image space

connecting groups of close hidden units.

The effect of a linear kernel can be seen on the figure 4.2, a Kernel K with

Figure 4.2 : Example of an application of a convolutional kernel that performs an


affine transformation. It is applied on the input image and a linear combination of
input elements with coefficients given by the Kernel parameters is stored in the
output matrix. The stride indicated is 1 so the filter is moved by one element.

2x2 pixel size is applied to an image I, the kernel slides on the image with S

pixels step and it is called the stride, if the value of the stride is changed the

number of output units of the convolutional operation changes as well, for

example if the value of the stride is 1, then the kernel would process every

element of the input, however is the value of the stride is bigger then the

input would be subsampled proportionally to its length. The obtained


28

output from the convolution is a linear combination of the input elements,

and the coefficients of this combination are learned during the training.

4.2.2 Pooling Layer


A limitation of the feature map output of convolutional layers is that they

record the precise position of features in the input. This means that small

movements in the position of the feature in the input image will result in a

different feature map. This can happen with re-cropping, rotation, shifting,

and other minor changes to the input image.

A common approach to addressing this problem is called pooling, this

operation reaches this scope by replacing the output of the net in a certain

location with aggregate information over all the nearby input units. Max

pooling is one of the most used pooling operations [43].

Maximum pooling, or max pooling, is a pooling operation that calculates

the maximum, or largest, value in each patch of each feature map. The

results are down sampled or pooled feature maps that highlight the most

present feature in the patch, not the average presence of the feature in the

case of average pooling.

4.2.3 CNN architecture

CNNs have been used for several tasks, so there is no typical architectures

that define the best CNN models for every task. But when processing

images, there are some guidelines that are valid for most of the times.

One of the best vision model architectures to date is VGG from the paper

(Simonyan and Zisserman, 2015), which was a popular solution for image

classification, and many successive approaches to this task took inspiration

from it. However, the unique thing about VGG16 is that instead of having a

large number of hyper-parameters, they focused on having convolution

layers of 3x3 filter with a stride 1 and always used the same padding and
29

max pool layer of 2x2 filter of stride 2, which made the idea of small kernels

become the base idea of most modern implementations.

A convolutional layer is composed usually by a set of M convolution

operators that perform a linear combination of the KxK (with K) the size of

the kernel elements of the inputs. As described earlier the number of

transformations that a network can learn is related to the number of

kernels. The size of the kernel on the other controls how many elements are

combined together. Thus, the number of the parameters of the layer will be :

PW = N * M * K * K (4.6)

Where N denotes the size of the input.

After the convolutional operations, the output elements are the inputs of an

activation function to introduce the nonlinearity in the learned

transformations. Finally, a pooling operation is performed, and as already

mentioned, the layer size is reduced to a factor of the pooling operator's

size.

A convolutional block is the ensemble of the operations described above,

combination of convolutions, activations and poolings. The CNN is then

composed by multiple blocks, each of them with an input size reduced due

to the pooling stage. This step is called the subsampling path as the input

dimension is being compressed continuously allowing more convolutional

layers to fit in the same amount of memory.

The convolutional blocks get flattened in order to get the output of the

network, and after some connected layers the output is passed to a

classifier, the flattened vector of the last convolutional block is usually

called the feature vector and it is the input to the classification task.
30

As an example for the architecture explained, figure 4.3 is reported with the

same structure. From left to right : an input matrix of 128 by 128 units, that

can represent the analysed data or the output of the previous convolutional

Figure 4.3 : Schematic description of a convolutional network with


convolution and pooling operations, after the max pooling operation
the output is flattened to an array that is passed to a fully connected network.

block, processed by a series of 8 convolutional operators. The convolutional

maps are pooled with a value of 2 so the dimensions will be divided by 2

and would be 64 by 64. Next, a convolutional stage with 24 filters is applied

with a filter of stride 2, and after the pooling the output is reshaped as

vector and used an input for a two layer fully connected classifier.

The CNN described here is one of the most basic implementations of the

concept, and it is probably not used anymore in real applications. However,

this is not a problem since our objective is not to survey CNN architectures

but to highlight the motivation behind their success.

4.2.4 The fully convolutional architecture


This is a particular class of CNNs where the output is structured as a map

that has the same shape as the input. It is called (FCNN) Fully

Convolutional Neural Network and it was introduced first for the task of

image segmentation (Long, Shelhamer, and Darrell, 2015). The meaning of


31

the segmentation task is the assignment of each pixel of the input to the

class of the object it is a part of. So it is possible to say that FCNN

transforms pixels into pixel categories.

In a brain image, the task could be the separation of a white matter zone

from a grey matter.

The difference between the FCNN and the architecture of CNN previously

introduced is that a FCNN transforms the height and width of the

intermediate layer feature map in the subsampling path back to the size of

the input image through the transposed convolution operation, so that the

predictions have a one-to-one correspondence in shape with the input

image. By consequence every pixel in the original input will be associated

with the output unit.

Unlike the CNN with a fully contractive path, in FCNN it is possible that

each output unit depends only on a part of the input. This area is usually

called the receptive field, every pixel outside of this area will not contribute

to determine the unit output.

In the case there are no pooling operation, the case of the receptive field is

interesting, the ability to process larger areas depends only by the deepness

of the layers the network. The receptive field then grows from the initial

size of the kernel K by the stride S of the operator in each direction for each

consecutive layer, then after D layers the patch size would be :

PS = K + 2S(D − 1) (4.7)

Figure 4.5 shows an example for a network with a kernel of size 3 by 3 with

a stride of 1 and depth d, which will have an effective patch width of 2d+1.

This kind of architecture has a small receptive field but it holds the spatial

information at the same level of detail that is present in the input. It is

possible because of the lack of the pooling operation which helps to


32

Figure 4.5 : Receptive field in a CNN without a pooling


operation with a kernel size of 3 and a stride of 1. The layers
operations are performed from left to right. A unit on the
output layer will depend on 7 by 7 pixel in the first layer. The
size of the receptive field is given by 2D+1 with D the
deepness of the network.

improve translation invariance but it makes also at the same time the

output less dependent on the exact spatial position. This network will

perform extremely good for a task which the definition and sharpness of

the output is important.

4.4 Denoising using Residual Learning

Residual learning is adopted along with batch normalization as an image

denoising strategy in the work of Zhang and collaborators [35]. When both

utilized, they speed up the training process and boost the denoising

performance.
33

Interest is particularly given to residual learning that aims to remove the

latent clean image gradually in the hidden layers to separate the noise

contribution from the original image.

Residual Learning can be schematized as follows: Focus on an image and its

noisy approximation (y, X) where the noise model is assumed to be additive


X = y + ϵ. The exact form of ϵ may vary, but in this example it may be
thought as Gaussian with zero mean.

A neural network can learn in discriminative denoising to map the noisy


example to the original by matching its output ypred = NNW (X ) to the

original image :

NNW (X ) ∼ y (4.8)
The residual learning formulation instead aims to map the output of the

network to the noise part of the input. This is performed by subtracting the

output to the noisy input :

NNW (X ) = X − ypred
(4.11)

The loss function can also be written, using a pixel-wise mean squared error

for simplicity, of a denoising neural network with a fully convolutional

architecture like described in section 4.2.4 as :

1
(yipred − yi )2

L(ypred , y) = (4.9)
n i

In order to simplify the notation, the sum over the pixels will not be written

explicitly in the next steps. In normal learning for which the output is
ypred = NNW (X ) the loss function is :
34

LW = | ypred − y |2 (4.10)

While for the residual learning approach X − ypred = NNW (X ) the loss

becomes :
L(ypred , y) = | X − ypred − y |2 (4.11)

For the additive noise model where it is possible to write X = y + ϵ, the loss
reduces to :

L(ypred , y) = | y + ϵ − ypred − y |2 (4.12)

And this leads to ypred ∼ ϵ . This slight modification in the loss function

helps the neural network to find a solution that focuses on the noise part of

the problem instead of learning features that depends on the image.

The original paper's authors show that a simple neural network that applies

this strategy can decrease training time and has a more remarkable

generalization ability (the training can be translated to the related task). Its

training converges with a relatively small dataset, is more stable, and

outperforms many traditional algorithms in blind denoising, a task where

the noise level is unknown.

It is not surprising that adding a residual layer means using prior

information, in this case, our knowledge of the additivity of the noise

model, as an assumption to simplify the network task.

4.5 Batch Normalization

Normalizing the input data of neural networks to zero-mean and constant

standard deviation has been known for a long time [44], to be beneficial to

neural network training. Batch Normalization (BN) naturally extends this

idea across the intermediate layers within a deep network [Szegedy et Al,

2015]. Unfortunately, the activations and gradients in deep neural networks

without BN tend to be heavy-tailed. In particular, during an early on-set of


35

divergence, a small subset of activations (typically in the deep layer)

“explode.” The typical practice to avoid such divergence is to set the

learning rate to be sufficiently small such that no steep gradient direction

can lead to divergence. However, small learning rates yield little progress

along with flat directions of the optimization landscape and may be more

prone to convergence to sharp local minima with possibly the worst

generalization performance.

BN avoids the activation explosion by repeatedly correcting all activations

to be zero-mean and of unit standard deviation. This “safety precaution”

makes it possible to train the networks with large training rates, as

activations cannot grow uncontrollably since their means and variances are

normalized. As a result, SGD with large learning rates yields faster

convergence along with the flat directions of the optimization landscape

and is less likely to get stuck in sharp minima.

The Batch Normalization Algorithm is denoted by the following equation :

Ib,c,x,y − μc
Ob,c,x,y ← γc + βc ∀b, c, x, y . (4.13)
σ2 + ϵ

With Ib,c,x,y and Ob,c,x,y are four dimensional tensor input and outputs of a

BN layer, the dimensions corresponding to examples within a batch b,

channel c, and two spatial dimensions x,y respectively. For input images the

channels correspond to the RGB channels. BN applies the same

normalization for all activations in a given channel.


1
|β| ∑
BN subtracts here the mean activation μc = b,x,y Ib,c,x,y from all

input activations in c, where β contains all activations in channel c across all

features b in the entire mini-batch and all spatial x,y locations. Subsequently,
BN divides the centered activation by the standard deviation σc .
36

Normalization is followed by a channel-wise affine transformation


parametrized through γc, βc which are learned during training.
37

5. Dataset and Preprocessing


5.1 The MRI dataset

In this section, light is shed on the original MRI dataset, called the fast MRI

dataset [45]. This dataset originally is not conceived as a denoising task: It is

used to test reconstruction algorithms in parallel acquisitions with

subsampling go the k-space.

Still, it is a very good candidate for the denoising test because it consists of

fully sampled, HD K-space data of images with generally high SNR to

which simulated noise can be added to train our Dn-CNN, some of the

acquisitions can be noisy, but they form just a minority, and usually, the

resulting images are of great quality.

In any case, the present noise in the acquisition will be negligible compared

to the artificial noise added for training purposes. The presence of noise in

real data is even a perk so that any denoising method would be robust to

labels that are not perfect.

To extract our images parallel MR imaging was used, which means a

multiple receiver coil, this instrument is usually placed in proximity to the

area to be imaged (brain or knee), and during imaging a sequence of

spatially and temporally varying magnetic field called “pulse sequence” is

applied by the MRI machine. Multiple receiver coil implies that each of

them produces a separate k-space measurement matrix, and each of these

matrices will be different (see figure 5.1), because each of the coils will

provide a different view of the imaged volume modulated by the

differential sensitivity that coil exhibits to MR signal arising from different

regions.
38

Figure 5.1 : Multiple coil acquisition (8 coils) of MR data.

The dataset among the diverse datasets available in FastMRI that is worked

on in this thesis for denoising the DnCNN is the Knee k-space Data. It is a

Multi-coil raw data that was stored for 1,594 scans acquired for the purpose

of diagnostic knee MRI. A single fully sampled MRI volume was acquired

for each scan on one of three clinical 3T systems (Siemens Magnetom Skyra,

Prisma, and Biograph mMR), or one clinical 1.5T system (Siemens

Magnetom Aera). Data acquisition used a 15 channel knee coil array and

conventional Cartesian 2D TSE protocol employed clinically at NYU School

of Medicine. The dataset includes data from two pulse sequences, yielding

coronal proton-diversity weighting with (PDFS, 798 scans) and without

(PD, 796 scans) fat suppression (see figure 5.2). Sequence parameters are, as

per standard clinical protocol, matched as closely as possible between the

systems. The following sequence parameters were used: Echo train length 4,

matrix size 320 × 320, in-plane resolution 0.5mm×0.5mm, slice thickness

3mm, no gap between slices. The timing varied between systems, with a

repetition time (TR) ranging between 2200 and 3000 milliseconds and echo

time (TE) between 27 and 34 milliseconds.


39

Figure 5.2 : Proton density weighted image (a) with fat suppression (PDFS), (b) without fat
suppression (PD). [45]

The total number of patients used for training and testing are shown in

Table 5.1.

The dataset also provides 6970 fully sampled brains scans. A portion of 255

of them was later used for additional testing of the solution but no training

was performed on it. A future version of our denoiser will be trained on the

whole dataset. This dataset provides examples from multiple sequence and
FastMRI patients Patients Slices
Fast MRI Train Set 973 Used in Train : 350 10236
Used in Test : 100 2959

FastMRI Val. Set 197


Used in Validation: 2380
Table 5.1: Description of the FastMRI dataset. In the left column is
reported the total size of the dataset. In the Right column is reported
the number of volumes (patients) used to train, validate, and test the
model. The results shown for the supervised and unsupervised train-
ing are based on the same data split.
40

acquisition modalities. Both T1 and T2 weighted images are present and

there are also contrast medium enhanced acquisitions.

Sl(x) is expressed as the complex signal at the lth coil in the x-space, which
corresponds with the inverse Fourier transform of sl (k), such as:

Sl(x) = ℱ−1{sl(k)} (5.1)

With sl the acquired signal at coil l in k-space.

It is important to remind that, in a single-coil system, the final magnitude

image is obtained by simply taking the absolute value of the complex

signal, while in the multiple-coil case, one complex image is available per

coil, and in order to get the real image, it is necessary to combine all that

information. In this case the last reconstructed image is called Composite

Magnitude Signal CMS (see figure 5.3).

Figure 5.3 : Test of reconstruction using a simple unweighted SoS, the green looking
pictures show k-space data up to 15 coils, below them the Individual coil spatial
images from fully sampled data, and on the right the reconstructed image from the
total coils.
41

The most popular approach that will be adopted for the reconstruction of

CMS in the multiple-coil acquisition is the Sum of Squares (SoS), and it has

been proven that it is one of the most efficient and Spatial Matched filters

(SMF). The advantage of using SoS is that it does not require a prior

estimation of the coil sensitivity, and thus, the CMS will be directly

reconstructed from the signal in each coil:


L
| Sl(x) |2

MT (x) = (5.2)
l=1

It is important to note that there are other techniques proposed to

reconstruct the CMS from multiple signals, but for the sake of simplicity the

most efficient and straightforward would be used.

5.2 Noise and signal model for multiple correlated coils

As explained in the previous sections, the acquisition of MRI happens in k-

space, and given the fact that the noise affects equally all frequencies (all the

samples in k-space), it is concluded that it is signal and source independent,

and it can be modeled as a complex Additive White Gaussian Noise

(AWGN), in this case, the acquired signal in the lth coil in k-space can be

modeled as :

sl(k) = al(k) + nl(k; 0,σK2l(k)) (5.3)

With al (k) noise-free signal in the l-th coil of a total of L, and sl (k) is the
received noisy signal at the coil l. This is the assumption of noise in MRI,

the noise in each coil is considered to be stationary in k-space. In order to

get the complex image domain, the inverse Fourier transform of sl (k) is
used in each slice and in every coil.
42

In modern acquisition systems comprising up to 32 or 64 coils, the receivers

show a particular coupling [16], which means that the noisy samples at each

k-space location are correlated from coil to coil. Under the assumption that

the correlation is not frequency-dependent, (i.e., same for all k-space

samples), the correlation between coils will be extended to the complex


image domain and then becomes a covariance matrix which is non-

diagonal, symmetric, and a positive definite matrix, where the non-diagonal

elements are the correlations between each pair of coils.

Given the assumption that each coil has some Gaussian noise initially with

the same variance σ02, the covariance matrix



is defined as :

1 p ... 1
ρ 1 ... ρ
= σ02

(5.4)
⋮ ⋮ ⋱ ⋮
ρ ρ ⋯ 1

Usually, ρi, j is significant in the multi-coils system, but their value is defined

by how the antenna is built, and it will be a specific characteristic on the

particular model of MRI scanner.

The exact value of the correlation matrix for the scanners present in the

dataset is unknown, although it is possible to estimate it from the few

background scans present, so a dummy coils geometry and a real

correlation matrix are proposed.

A circular geometry is chosen where correlation between coils is ρi, j = 0.3

if the coils are first neighbours and ρi, j = 0.15 if they are second

neighbours. In all other cases ρi, j = 0.05, in figure 5.4 there is a

representation of the chosen geometry, and the correlation matrix is

reported. As mentioned before, the system has 15 coils which are


43

Figure 5.4 : Covariance matrix for the correlated acquisition, the non-diagonal elements
are the correlations between each coil, the correlation between neighbour pairs is bigger
than the distant ones.

subsampled to 8 coils. In the dataset, 3 different scanners are present, but

they will have the same correlation matrix in the approximation.

5.3 Data preprocessing

5.3.1 Low-pass filtering and reduced number of coils

One of the advantages of the method employed is that it can be applied

with minimal data preprocessing. In order to decrease the computational

load during training that turns out to be heavy without preprocessing, the

following operations are performed: A low-pass filter is applied to the

frequency data, and the number of coils is reduced.

Low pass filtering (aka smoothing) removes high spatial frequency noise

from a digital image. The low-pass filters usually employ a moving window

operator, which affects one pixel of the image at a time, changing its value

by some function of a local region (window) of pixels. Thus, the operator

moves over the image to affect all the pixels in the image.
44

When Low-pass filtering is applied in the implementation, the acquisition

size is reduced from 640x372 or 640x368 to 320x184. The motivation of this

filtering is that k-space is sampled at a high frequency that would not be

needed in a noisy acquisition required to stimulate; this gives cleaner

ground truth images and a minor example with reduced memory needed

for training.

This filtering does not produce some significant effects on the final image.

The complex k-space signal for each coil is normalized to the maximum

modulus value on a slice by slice basis. This technique allows convergence

during training and the standardization of the examples. Since not every

coil has the same sensitivity in each volume point, a good solution would

be to normalize by the max value over the whole volume. The obtained

difference is minor in contrast to image reconstruction.

Moreover, not all the coils are used in the end, so only 8 are selected out of

15 fixed for all examples; this method reduces the overall quality of the final

reconstructed image since less data is used but needed to perform the

training in a more reasonable time.

5.3.2 Mixed precision

In order to speed up training during the development phase of this study,

learning is performed with the Tensorflow Mixed Precision.

The benefits of using Mixed Precision are as follows:

• Speeds up math-intensive operations, such as linear and convolution

layers, by using Tensor Cores.

• Speeds up memory-limited operations by accessing half the bytes

compared to single-precision.

• Reduces memory requirements for training models, enabling larger

models or larger mini-batches.


45

It uses both 16bit and 32bit variables during training; lower precision data

types in the model weights use less memory and exploit the presence of

specialized hardware in GPU for faster operations: The modern accelerators

run operations faster in 16bit, as they specialize hardware to run 16bit

computations and 16bit data types can be read from memory faster.

The powerful thing about this method is that it is possible to double the size

of the mini-batch at the same memory, and thus, double the rate of

examples processed at each training step.

The decrease in performance is very little, and training time by

implementing this method is also reduced.


46

6. Implemented Methodology

6.1 Optimizing the input pipeline

In order to fully utilise the GPU capacity, it is crucial to ensure the

achievement of optimal performance and efficiency in our input pipeline.

The tf.data1 API enables to build complex input pipelines from simple,

reusable pieces. Tf.data also makes it possible to handle large amount of

data, reading from different data formats, and perform complex

transformation.

In a naive approach, a training step includes opening a file, fetching a data

entry from the file and then using the data for training. When the model is

training, clear inefficiencies can be seen, the input pipeline is idle and when

the input pipeline is fetching the data, the model is idle.

Prior to training the DnCNN several data configuration methods were

used, such as Prefetching, Parallelising data extraction, Parallelising data

transformation, Caching, and Vectorized Mapping. In what follows the

concepts that were included in the implementation are cited:

• Interleave : Used to process many input files concurrently, good solution

to parallelise the task, it also supports tf.data.AUTOTUNE which prompts

the tf.data runtime to tune the value dynamically at runtime, that level of

parallelism is declared inside num_parallel_calls to specify the level of

parallelism.

• Caching : to cache a large dataset in local storage, this saves operation like

file opening and data reading from being executed during each epoch, the

next epochs reuse the data cached by cache transformation.

• Shuffle : The dataset was shuffled by the buffer size parameter, it affects

the randomness of the transformation, poorly shuffled data can result in

lower training accuracy.

1 https://www.tensorflow.org/guide/data
47

• Prefetching : solves the inefficiencies as it aims to overlap the

preprocessing and model execution of the training step, for example when

our model will be executing the training step n, the input pipeline will be

reading data for step n+1. Again AUTOTUNE is used to tune the value

dynamically at run time.

• Batch : Takes first the batch size entries, that were set to 64, and make a

batch out of them.

6.2 Metrics
The following metrics will be used to assess the performance of the DnCNN

for the fastMRI dataset :

• SNR : In general, the metric used to assess the quality of magnitude

images with respect to noise in the acquisition in Signal-Noise-Ratio

(SNR), For MR images usually it is defined as the mean over the

intensity of pixels in a region of signal divided by the standard

deviation:

μsignal
SNR = (6.1)
σbackground

• For MRI the Peak Signal to Noise Ratio (PSNR) will be used. PSNR is

defined from the mean squared error (MSR), and for a pair of NxM real
images I and I* , can be written as :

1 N M
[I(i, j ) − I*(i, j )]2
N.M ∑ ∑
MSE = (6.2)
i=1 j=1

And finally PSNR is defined as :


M A XI
PSNR = 20 ⋅ log10( ) (6.3)
MSE
48

Which is expressed in decibel (db), the algorithm of the euclidean distance

between two images normalized to the maximum value that a pixel can

assume is for an 8bit image, 255.

• SSIM : is a perceptual metric that quantifies image quality degradation

caused by processing such as data compression or by losses in data

transmission, unlike PSNR (Peak Signal-to-Noise Ratio), SSIM is based on

visible structures in the image. The SSIM between the path of the original

image m and the patch n of the denoised image is defined as :


(2μm μn + c1)(2σmn + c2)
SSIM(m, n) = (6.4)
(μn2 + μm2 + c1)(σn2 + σm2 + c2)

Where : μ is the intensity mean of the patch, σ is the standard deviation,

σmn is the covariance, and c1,c2 are small regularising constants set to 0.01
and 0.03.

• Residual Maps : They are computed as the square differences between

pixels intensity of the original and restored image so each value of the
residual map Di, j is:

Di, j = (Ii, j − Ji, j )2 (6.7)

Both I and J are normalised between 0 and 1.

The average of the residual map over the image is also reported, this metric

is related to PSNR, that is proportional to the logarithm of its inverse.

SSI and residual maps are two complementary metrics since SSI is

computed over small regions and gives information about the relation

between pixels, while the residual maps depend only on a difference

between two single pixels.

6.3 Training the K-DnCNN to the FastMRI


The same model described in the second chapter is used, a FCNN (fully

convolutional neural network) as described in 4.2.4 with 18 convolutional

layers, detector stages, and regularisation with batch normalization,


49

without any downsampling path (no pooling), and with a residual learning

connection implemented in the output layer as in Eq. 4.14.

It is essential to notice that this network has no subsampling operation, so

its output units have a small receptive field, which means that only partial

local information and not the entire input is used to compute (and train) the

value of each output unit.

The network contains also a receptive field, of 41 x 41 pixels, and the input

is formed by the real and imaginary part of the complex k-space data for

every single coil. Since we’re using 8 coils, then the number of channels

would be 16, The diagram of the network is shown in figure 6.1.

Thus, the Dn-CNN are effectively denoising areas of the image-based only

on partial context. This helps to reduce the number of parameters of the

model to limit overfitting even when the network is trained on a small

Figure 6.1 : K-DnCNN model for multichannel


MRI denoising
50

dataset, and it will also contribute to bypass the learning of complex

anatomical structures since only a patch on the input is seen at each output

unit.

The network was trained after applying blind denoising to vary the
standard deviation of the noise between σ ∈ [5,20] ⋅ 10−3 in the k-space.
Considering that the starting quality of the original images is not

homogeneous and both scans acquired with and without fat suppression

that generate brighter or darker images are present in the dataset,

it is difficult to measure the effect of this noise on the whole dataset in terms

of SNR. So instead, PSNR was used to define the degradation of noised

images to the original ones and the PSNR gain as a measure of

improvement.

Overall, this random noise addition with this range of standard deviation

will generate corrupted images with PSNR in a range between 10 to 35,

which means their visual quality is distributed from a reasonable to deeply

damaged. The network will then learn to estimate the noise content and

remove it. In a multi-coil scan with a strong correlation between noise in the

coils, this task tends to be particularly difficult at the image level because

the noise is not stationary and will depend on the intensity of the image.

The supervised and unsupervised training will be discussed in the

following sections. The obtained results will be reported afterward.


6.3.1 K-DnCNN using Supervised Learning

The supervised learning approach requires ground truth images, and the

learning process is used to find the best set of weights that minimizes the

loss function. The network that processes data in k-space will map a noisy

version of an input to its groups truth.

The ground truth is the clean version of the k-space, so the


51

minimization of the loss should reproduce it, and since our metrics will be

based only on the reconstructed version of the image, a term will be added

in the loss function to help the network concentrate on approximating the

noisy input to the ground truth, the loss function used in the training is

defined as :
Lk = MSE(Sy, S ) + β * MSE(Sos(Sy ), M ) (6.8)

Where Sy is the 16 channels output of the network, S is the ground truth

signal in the k-space, and M is the ground truth magnitude image, the two

terms in the equation seek to compare by the mean of MSE both the

restored output to the clean image in k-space and the reconstructed,

restored image to the original magnitude one.

The parameter β is used to balance the two terms. It controls how much the
reconstruction in magnitude space is weighted in the loss function. It is not

a constant, it starts at zero and slowly increase to its maximum value.

The network is trained with the Adam [47] optimiser for 300 epochs with a

learning rate of 3 ⋅ 10−3, then the learning rate is reduced to 3 ⋅ 10−4 and the
network is trained until the validation loss decrease.

The loss coefficient β is initialised at zero and then linearly increased to

5.103 between epoch 50 and 200. The final value is chosen to set the the
order of magnitude if the reconstruction term to be of the same order of

magnitude with respect to the loss in k-space.

6.3.2 K-DnCNN using Unsupervised Learning

For unsupervised learning, the Noise2Noise framework is used. In this

case, there is no ground truth, so the loss function is modified to contain

only the corrupted version of the images.

In general, the denoising task with deep learning is performed by

constructing a map between a corrupted version of an image and the

corresponding clean one in the supervised learning approach.


52

However, when a clean image is not available, it is usually more

challenging to train a neural network with acceptable performance for the

task.

One possible strategy to overcome this limit is to use an unsupervised

learning approach: It was proposed by [46] It offers the possibility to train a

neural network using only multiple corrupted examples of the same target

without explicitly using the target itself. Their method allows to exploit the

general purpose of deep CNN model to unsupervised denoising and to

reach significant performance very close to supervised learning, but

without the problem of collecting ground truth images.

The idea is based on the fact that, if the loss function L is thought as a

generalized point estimator, and therefore an operator that involves the use

of sample data, to calculate a single value which serves as the best guess of

an unknown distribution parameter, then the expected value can be written

over the distribution of the pair of noisy /clean image (X,y) such as :

argmin E(X,y){L(NNW (X ), y)} (6.9)


W

For which a possible solution is the set of W that gives:

NNW(X ) = E( y){y} (6.10)

The conditional distribution over all the possible clean images and the

average over the sample of the noisy images can be written as :


argmin E(X ){E( y|x){L(NNW (X ), y)}} (6.11)
W

In this case, P(y | X ) denotes all the possible correct observations that can be
matched to a noisy example. An example of this visualisation can be all the

possible positions of an edge when the borders are noisy, or the exact

position of a boundary between two noisy surfaces.


53

It can be noticed that if the target is replaced with another noisy observation

that has the same expectation value (for example additive white noise), then
the solution still holds and, for example in this case of the L 2 it is possible to

train using target corrupted samples if their noise has zero mean.

The loss function can be written explicitly, remembering the output of the

network in residual learning in Eq. 4.14, as:

Lk = MSE(Sn − NW(Sn), Sn*) (6.12)

That with additive noise described in 5.6, it becomes :


Lk = | S + n − NW(Sn) − S − n * |2 (6.13)

Where S = s + n and Sn * = S + n * are both independently corrupted


versions of the same original k-space data S, the purpose of this task is to

create a map between two independent samplings of the noise.

For the blind denoising task, there is no need to use the same standard

deviation of the noise and the input to the network does not need to be of

better quality with respect to the target. So it is possible to ask the network

to map an image with higher PSNR to another with lower PSNR without

damaging the training. This is extremely important for true unsupervised

blind denoising since the noise level can also be unknown at every step.

In contrast to the training in the supervised approach, the aim here is not to

minimize the training loss since the task is impossible to complete. In

residual learning, this unsupervised task consists in transforming one

instance of the noise to another one independently sampled.

Figure 6.2 shows the loss function for the training and the validation of the

supervised and unsupervised training: In supervised training, the train and

validation loss decrease together and differ from a small amount that is
54

Figure 6.2 : Example of the qualitative differences in training for


the (a) supervised and (b) unsupervised approaches.
Train loss (blue) and validation loss (orange) during live
training.

called the training bias. In unsupervised learning, instead, there is a

training loss that is almost flat and a validation loss that rapidly decreases;

flat because the train task is effectively impossible, and rapidly decreasing

because the weight gradients are the correct ones so when the network is

validated for the original task the loss correctly decreases.

The network is trained with the Adam [47] optimizer for 300 epochs with a
learning rate of 3 · 10−3 , afterward, the learning rate is reduced to 3 · 10−4,
and the network is trained for another 100 epochs. Then the learning rate is

exponentially decreased to zero.


55

7. Experimental results

The Dn-CNN is trained for a blind denoising task, so to test its performance

different levels of image corruption are used. The test was performed using

a noise standard deviation of σ = 8 ⋅ 10−3,16 ⋅ 10−3 that, since the initial


image quality is not homogenous, will produce an average PSNR of the

noisy images respectively of (24.3 ± 2.9) and (18.9 ± 2.5) dB.

The experimental results obtained are measured only on the reconstructed

images. The denoising is performed in k-space, but since it is never shown

to the operator, it would be useless to base the results on it.

The final reconstructed image from the denoised k-space, without post-

processing (except for a 0 to 1 normalization of pixel intensities), is used in

all performed tests.

PSNR and SSI are the metrics used to evaluate the restored images, and, to

avoid that large background area may contribute too much in the

calculation, each image is processed entirely but the metrics are computed

only on a large central region of the signal as shown for example in Figure

3.4 for the whole image and in Figures 3.10 for the central patch.

This is crucial since the background is very easy to treat. Its presence will

give exceedingly high scores that are not representative of the real

performances.

The test was held on a portion of data of 100 patients and 2959 scans

comprising both acquisitions with and without fat suppression.

In Table 7.1 the results for the supervised and unsupervised training are

reported. Both approaches consistently improve the image quality both at a

high and low level of noise and remarkably, at a high noise level, their

performance matches.
56

Noise std PSNR(dB) ΔPSNR(dB) SSI Δ SSI

Supervised

σ = 8 ⋅ 10−3 29.3 ± 3.1 5.1 ± 2.1 0.80 ± 0.06 0.2 ± 0.1


σ = 16 ⋅ 10−3 25.6 ± 2.6 6.7 ± 1.8 0.60 ± 0.10 0.3 ± 0.1

Noise2Noise

σ = 8 ⋅ 10−3 28.3 ± 3.4. 4.0 ± 3.1 0.80 ± 0.06 0.2 ± 0.1


σ = 16 ⋅ 10−3 25.6 ± 2.8 6.7 ± 2.4 0.61 ± 0.10 0.3 ± 0.1

Table 7.1 : Average results on the test dataset composed by 100


Patients with 2959 slices that were processed separately. These results were obtained
on two levels of noise : σ = 8 ⋅ 10−3, 16 ⋅ 10−3 which produces an average PSNR of
the noisy images respectively of (24.3 ± 2.9) and (18.9 ± 2.5) dB. The results were
reported for both the K-DnCNN trained both with Supervised and Unsupervised
(Noise2Noise) for the PSNR and SSI. The gain Δ is also reported, at higher levels of
noise these two methods perform similarly, while in low levels Supervised training
seems to work better.

7.1 Qualitative results

From figures 7.1 to 7.4, whole knee images with and without fat

suppression, showing the noise effect and denoising with K-DnCNN for

supervised and unsupervised (Noise2Noise) training at σ = 8 ⋅ 10−3 (Low)


and σ = 16 ⋅ 10−3 (High) noise level are reported. Both methods removed
the noise successfully while preserving fine details of the image : Flat areas

look smooth, edges are not blurred, and textures are preserved, moreover,

new details absent in the original image (artifacts) are not generated.
57

Figure 7.1 : Noise effect and denoising with K-DnCNN for supervised and
unsupervised (Noise2Noise) training at σ = 8 . (10e-3) (Low) and σ = 16 . (10e-3)
(High) noise level.
58

Figure 7.2 : Noise effect and denoising with K-DnCNN for supervised and
unsupervised (Noise2Noise) training at σ = 8 . (10e-3) (Low) and σ = 16 . (10e-3)
(High) noise level.
59

Figure 7.3 : Noise effect and denoising with K-DnCNN for supervised and
unsupervised (Noise2Noise) training at σ = 8 . (10e-3) (Low) and σ = 16 . (10e-3)
(High) noise level.
60

Figure 7.4 : Noise effect and denoising with K-DnCNN for supervised and
unsupervised (Noise2Noise) training at σ = 8 . (10e-3) (Low) and σ = 16 . (10e-3)
(High) noise level.
61

7.2 Quantitative Results

7.2.1 Supervised Learning

Examples of supervised denoising results on knee central image patches are

reported in figures (7.5-7.8), both on high and low noise levels. Both

Residual and Structural Similarity maps were obtained between the

Ground Truth-Predictions and Ground Truth-Noised images. In residual

maps, each pixel is the square difference between pixel intensities of the

ground truth and corrupted or filtered image. In figure 7.5, for example,

where a high level of noise was applied, most of the pixels in the GT-

prediction residual map are close to black color, which, in the map refers to

an identity mapping, this proves that the filtered images using K-DnCNN

in supervised learning are very close to the original ones. Furthermore,

given the average value of the residual map at 7.5 (4,25 ⋅ 10−3 ≃ 0,004), it
can be seen that, on average, that this particular restored image at a high

noise level is very similar to the original magnitude image.

Conversely, the GT-noised map has overall red pixels, which implies that

the pixel intensities between ground truth and noised image are not similar.

SS map aims to compare the luminance, contrast, and structure factors

between the GT and recovered image. The higher the SSIM, the better.

Values close to 1 mean identical sets of data. In figure 7.6, for example, the

pixels are mostly red and yellow, which means that the similarity is high.

Even better results are obtained with denoising images at low level of

noises with supervised learning as can be seen in figures 7.7 and 7.8, the

average value of residual map is even lower as shown in figure 7.7

(7,54 ⋅ 10−4 ≃ 0,0007 ≃ 0). The SS index on the other hand, is close to 1
(0,871 ≃ 1).
62

FIGURE 7.5 : a-c) Example of denoising on the central image patch.


d-e) Residual maps.
f-g) Structural similarity maps.
Results for the supervised training at noise level σ = 16

FIGURE 7.6 : a-c) Example of denoising on the central image patch.


d-e) Residual maps.
f-g) Structural similarity maps.
Results for the supervised training at noise level σ = 16
63

FIGURE 7.7 : a-c) Example of denoising on the central image patch.


d-e) Residual maps.
f-g) Structural similarity maps.
Results for the supervised training at noise level σ = 8

FIGURE 7.8 : a-c) Example of denoising on the central image patch.


d-e) Residual maps.
f-g) Structural similarity maps.
Results for the supervised training at noise level σ = 8
64

The distribution of the PSNR and SSI of the value computed at slice level on

the noisy and recovered image, along with the slice-wise gain, are also

reported in figures 7.9 and 7.10. The distribution shows that at a high noise
level (16 ⋅ 10−3) the average PSNR of noisy images is 18.9 ± 2.5, as around
1200 slices belong to this range, the average results of the PSNR for most of

the slices fall into the range of 25.6 ± 2.6, which implies that the average

PSNR gain for denoising at high levels of noise is 6.7 ± 1.8. This gain value

proves that the supervised denoiser performs extremely well at high levels

of noise. For SSI index predictions, values close to 1 are reported at slice

level, and a remarkable gain of 0.3 can be noticed.

On low levels of noise (8 ⋅ 10−3), the K-DnCNN has a slightly less good
performance but, remarkable overall gains can be seen as well.

FIGURE 7.9: Distribution of the a) PSNR and b) SSI of the value com-
puted at slice level on the noisy image (Left), on the restored image
(center),and the slice-wise gain. Results for the supervised training
with at noise level σ = 16
65

FIGURE 7.10: Distribution of the a) PSNR and b) SSI of the value com-
puted at slice level on the noisy image (Left), on the restored image
(center),and the slice-wise gain. Results for the supervised training
with at noise level σ = 8

7.2.2 Unsupervised Learning

Examples of unsupervised denoising (Noise2Noise) results on knee central

image patches are reported in figures (7.11-7.12), both on high and low

noise levels.

On high levels of noise, the unsupervised learning denoising method gives

the same results as the supervised learning one. On figure 7.11 for instance,

the average residual map value is (5,03 ⋅ 10−3 ≃ 0,005) while for the same
image patch filtered by the supervised learning method an average value of

(7.54 ⋅ 10−4 ≃ 0,0007) is obtained, it can be said that for high levels of
noise, the two methods are comparable. Same thing is valid in figure 7.12
66

FIGURE 7.11 : a-c) Example of denoising on the central image patch.


d-e) Residual maps.
f-g) Structural similarity maps.
Results for the supervised training at noise level σ = 16

FIGURE 7.12 : a-c) Example of denoising on the central image patch.


d-e) Residual maps.
f-g) Structural similarity maps.
Results for the supervised training at noise level σ = 16
67

The distribution of the PSNR and SSI of the value computed at slice level on

the noisy and recovered image by Noise2Noise, along with the slice-wise

gain, are also reported in figures 7.13 and 7.14.

In figure 7.13, where high noise level is applied, the average PSNR for the
restored images is (6.7 ± 2.4) which is very similar to the average PSNR

obtained for supervised learning, the average SSI value obtained is also
similar (0.3 ± 0.1).

However, in figure 7.14, where a low noise level is applied, less favourable

results were reached at the level of the PSNR gain, the structural similarity

gain is same as it was for the supervised approach, but the average PSNR
gain reported is (4.0 ± 3.1) < (5.1 ± 2.1).

FIGURE 7.13: Distribution of the a) PSNR and b) SSI of the value com-
puted at slice level on the noisy image (Left), on the restored image
(center),and the slice-wise gain. Results for the supervised training
with at noise level σ = 16
68

FIGURE 7.14: Distribution of the a) PSNR and b) SSI of the value com-
puted at slice level on the noisy image (Left), on the restored image
(center),and the slice-wise gain. Results for the supervised training
with at noise level σ = 8

7.2.3 Application on the Brain Dataset

The possibility to perform the denoising task on data of the same kind but

with essential differences with respect to the one used during training is a

sought-after feature for a method that may be implemented in actual

practice.

In order to test if a good performance is reached on k-space data derived

from different NMR acquisition sequences from a different body location,

the Dn-CNN trained on the knee dataset was applied to the denoising of

brain data present in the FastMRI dataset.

Compared to the knee dataset, which is mostly homogeneous in acquisition

parameters, the brain dataset is more challenging for denoising, especially

when not directly used for training. Also, the shapes, contrasts, and average
69

intensities of a brain scan are very different visually from those commonly

found on an arthroscopic acquisition, such as the one on the knee.

For these reasons, this preliminary task of denoising based on training on a

different dataset is difficult to overcome. Nevertheless, the results are an

important test of the generalizability of the method.

To test the Dn-CNN denoiser trained previously for the blind denoising

task with supervised learning, 637 slices of brain scans from 255 patients

present in the validation set of the Brain FastMRI dataset were selected, and
noise with a σ = 16 ⋅ 10−3 and with the correlation between coils defined in
section 5.2 for simulating a highly noisy acquisition was applied.

Since the initial quality of the scans is very different because it depends on

the acquisition used and the noise already present, this noise injection

produces a diverse effect on the images in terms of quality.

The range of PSNR and SSI of the corrupted version of the images and of

the predictions is shown in Figure 7.16. The mean PNSR is (23.6 ± 4.5)dB

that usually corresponds to highly noisy images with clearly visible

variations in intensity areas and background, as shown in Figures 7.15c.

After applying the denoiser, the average gain of image quality is 4.6 ± 2.7

dB for the PSNR and 0.2 ± 0.1 for the SSI which signifies that, on average,

the image is improved both in its original intensity restoration and pixel

correlation.

An example of processed images and residual and SS maps can be found in

figures 7.15d-f.
70

Figure 7.15: Brain dataset denoising on central image patch


std noise sigma = 16

Figure 7.16: Results for the brain dataset: Distribution of the a) PSNR
and b) SSI of the value computed at slice level on the noisy image
(Left), on the restored image (center),and the slice-wise gain. Results
for the supervised training with at noise level σ = 16
71

7.3 Comparison with a state-of-the-art method

7.3.1 Non Local Means

NLM has shown to be an effective image denoising technique. In image

denoising, an image is often divided into multiple small patches which

repeatedly appear. The noise can be removed by taking profit from the

redundant patches information while simultaneously preserving images'

small structure. By taking advantage of the redundant patches, the nonlocal

means (NLM) image denoising method [48] could accomplish effective

performance, regarded as the most popular denoising method.

The key principle of nonlocal means is to denoise a pixel by averaging its

local neighborhood pixels with the clues of similarities of the redundant

patches. It has shown to be a useful image denoising technique. But the

definition of similarities between the patch of the noisy pixel and its

spatially local neighborhood patches in NLM is not strict, it’s just calculated

by a block matching process.

The basic principle of the nonlocal means denoising is to replace the noisy

value I(i) of pixel i with a weighted average of all the pixels on the image.

Because it needs too much computation, it is more practical to average the

pixels in a smaller scope. The pixel to be denoised is indicated by i, and the

pixels in the neighborhood of i by j, used to denoise i. The estimated value


̂ ) for a pixel i is computed based on the weighted average of the
I(i
neighborhood pixels j around pixel i:

∑s
I(i ) = wij I( j ) (7.1)
j∈Ni

Where Nis is the search window of size (2n + 1) * (2n + 1) centered at i and

wij is the weight of two pixels i and j which is calculated depending on the
similarity of their patches and is defined as :
72

1 | N d (i) − N d ( j) |2

wij = exp h2 (7.2)
Zi


Where Zi is a normalising term, Zi = wij, and h acts as a filtering
j

parameter.

7.3.2 Denoising the fastMRI and results


For the test comparison with NLM the reconstructed magnitude images
were corrupted with 4 levels of noise [5e −3,1e −2,2e −2,5e −2] in K-space. The

resulting SNR is critically dependent on the initial image quality so it may

vary a lot at the same sigma level.

The nlmeans function was imported from the Dipy Library2 ; this library

grants the possibility to import the nonlocal means function to denoise 3D

or 4D images and boost the SNR of datasets. It is also possible to decide

between modeling the noise as Gaussian or Rician.

Few parameters are available inside the function and can be cited :

• estimate_sigma : Standard deviation estimation from local patches. It is

also possible to choose the number of used coils of the receiver array. In

this case 8 was chosen in order to get 16 channels.

• patch_radius : the similar patches in the non-local means are searched for
locally, inside a cube of side 2 * v + 1 centered at the voxel of interest.

• block_radius : the size of the block to be used in the blockwise non-local

means implementation.

• rician : if True the noise is estimated as Rician, otherwise Gaussian noise is

assumed, in this case Gaussian noise is chosen.

By brute-forcing the choice of these values in order to get an optimal

performance, the (patch-radius = 1, block-radius = 2) values were used.

2https://dipy.org/documentation/1.4.1./examples_built/denoise_nlmeans/#example-denoise-
nlmeans
73

The NLM denoising result on a single slice from the brain dataset is shown

in figure 7.17.

Figure 7.17 : NLM denoising on whole brain dataset image.


std noise sigma = 8

The scatter plot by groups of the three denoising algorithms used on the

knee dataset (K-DnCNN/ K-N2N/NLM) is reported on figure 7.18, it

Figure 7.18 : PSNR of the denoised images in function of the


starting PSNR of the noisy images. Each point is a slice in the
knee dataset. Different color represent the three algorithms
used for denoising. Blue : K-DnCNN, Green : K-N2N, Orange :
NLM
74

shows the PSNR of the denoised slices given the 3 different algorithms, in

function of the starting PSNR of the noisy images.

The blue color refers to K-DnCNN, green to K-N2N, and orange to NLM, it

can be seen from the plot that most of the processed images by K-DnCNN

and K-N2N have higher PSNR values than the ones processed with NLM

for multiple levels of noise.

In figures 7.19 and 7.20, the PSNR and SSI gains concerning multiple noise

levels collected from the three algorithms are also compared. It is clear that

(with the chosen tuning for the NLM) DnCNN and N2N show superior

performances with respect to NLM both using PSNR and SSI. It is also

interesting to notice that both the DL algorithms behave similarly with

respect to the level PSNR/SSI on the noisy image.

Figure 7.19 : PSNR gain of the three algorithms in function of noise


level.
75

Figure 7.20 : SSI gain of the three algorithms in function of noise


level.
76

8. Conclusion

The denoising task was performed on the k-space raw data of the FastMRI

dataset, the largest and most complete type. In addition, a method was

proposed based on residual learning denoising, applied to the frequency

data instead of denoising directly the reconstructed images.

The advantage was taken of the power of the residual learning framework

given the presence of additive noise which allows the network to

concentrate on building a high-level representation of the noise component

instead of the clean image.

In MRI, particularly in the multi-coil acquisition, the noise is not additive to

the reconstructed image. Chapter 3 discussed how in a simple simulation

performing the denoising task with residual learning over frequency data

produces superior results than applying a network of the same complexity

directly on the image data. Then the same method was applied to the

denoising of multi-coil data for morphological imaging. The most

important result, in this case, is the correct reproduction of the anatomical

parts of the knee such as : the muscle, the bones and the cartilage.

Results were quantified using PSNR and SSI: the first measures the pixel-

wise restoration of the true intensity and second the correct reproduction in

a small window of the correlation between and contrast between original

and predicted image.

In the test the metrics improved after the action of the denoiser in a blind

denoising task both at high and low level of noise. This is important since

our denoiser seems to be generalizing the task of denoising to both low and

high noise levels, usually also because the starting quality of MRI is not

clear, so an algorithm that performs well with blind denoising would be of

good use.
77

The obtained results were encouraging. The ability to generalize to different

sequences and anatomical subjects is one of the most important perks to

develop in the application's clinical setting.

Another important note is that both supervised and unsupervised learning

achieved comparable and similar performances while dealing with the task;

the metrics showed performance similarity between the two methods. The

reason is also the flexibility of the Noise2Noise algorithm in working with a

dataset even of small size, since it is decided to work in k-space where noise

is simpler than the one present in images.

The applicability of N2N is very trivial since it is often impossible to get the

ground truth images to perform training; this framework enables the

possibility to perform a training having only the corrupted images.

Details were usually restored in the low noise case in the results, but still,

they could be improved. However, at the same time, it was necessary to be

careful about maintaining the vital aspect of the solution, which is retaining

the quantitative information of the original image. This opportunity should

not be neglected at the cost of providing a better visual effect, as in our

analysis, quantitative results are the ones that matter the most.

Another good point to consider would be to make use of all coils instead of

only 8; this will surely improve the results since more signal from the same

volume is present in each coil acquisition, and a neural network would

exploit this information. 8 coils were taken in the first place instead of 15 to

obtain a number of channels multiple of eight in the input images to take

advantage of the Tensor Cores. One solution would be to add a dummy coil

in order to get 16, in order to introduce additional information and a fair

number of channels.
78

One of the non-realistic points of the unsupervised learning approach may

be the existence of an infinite number of noisy examples of a real exam

since the pairs of generated examples are generated automatically,

practically the number of noisy acquisitions of a subject is usually finite and

relatively small (dozen of copies). In this case, it is possible to try N2N with

the finite noise examples availability, and it may still be a viable approach.

Furthermore, in this context, it is essential to remember that SL is even more

constrained with the number of noisy examples if the noise is not generated

from a model.

Coming to the choice of the neural network's architecture, it has many

advantages, as deeply discussed earlier in chapter 2:

It works in partial context since its receptive field is smaller than the image,

and it effectively processes patches of the input; by consequence, it reduces

the leak of information derived from learning to the denoised example.

It uses a residual learning approach that, in addition to helping in the

performance increase, contributes to the reduction of unwanted artifacts in

the final image.

Moreover, the network has a small size in terms of numbers of parameters

and a small architecture which makes it very robust to overfitting.

Nevertheless, there are still many ways to enhance the chosen architecture

while keeping the original shape and parameters; choosing the correct

parameters makes it possible to get the optimal result needed for

denoising.

A new idea could be implementing a layer block derived from ResNet [49],

which is often employed with success in denoising and can be the

replacement perhaps to our convolutional layer; this would keep the same

used strategy but would deploy a more profound solution.


79

Comparing the K-DnCNN to the state-of-the-art NLM was the perfect

benchmark and a trivial proof for this thesis, that denoising in K-space is an

original idea that would outperform even famous denoising algorithms,

both on the level of PSNR and SSI the image quality and the similarity to

the original image showed to be consistent, in general the goal of this work

was to show that taking advantage of the additivity of the noise in K-space

is indeed a better approach than learning from the noise in magnitude

space.
80

References
1. Mohan, J., Krishnaveni, V., and Guo, Y. (2014). A Survey on the Magnetic

Resonance Image Denoising Methods. Biomed. Signal Process. Control. 9, 56–

69. doi:10.1016/j.bspc.2013.10.007Accessed November 19, 2020)

2. Tomasi, C., and Manduchi, R. (1998). “Bilateral Filtering for gray and Color

Images,” in Proceedings of the Sixth International Conference on Computer

Vision (IEEE Computer Society), ICCV ’98), 839.

3. Lakshmi Devasena C, Hemalatha M. Noise removal in magnetic resonance

images using hybrid KSL filtering technique.

4. Phophalia A, Rajwade A, Mitra SK. Rough set based image de-noising for

brain MR images. Signal Processing.

5. Rajeesh J, Moni RS, Palanikumar S, Gopalakrishnan T. Noise reduction in

magnetic resonance images using wave atom shrinkage. International Journal of

Image Processing (IJIP).

6. Xu, J., Huang, Y., Cheng, M.-M., Liu, L., Zhu, F., Xu, Z., et al. (2020).

Noisy-as-clean: Learning Self-Supervised Denoising from Corrupted Image. IEEE

Trans. Image Process. 29, 9316–9329.

7. Akar SA. Determination of optimal parameters for bilateral filter in brain

MRimage denoising. Applied Soft Computing. 2016;43:87-96

8. Dey N, Ashour AS, Beagum S, Sifaki Pistola D, Gospodinov M,

Gospodinova Е, Tavares RS. Parameter optimization for local polynomial

approximation based intersection confidence interval filter using genetic algorithm:

An application for brain MRI image denoising. Journal of Imaging.

2015;1(1):60-84
81

[9] TM Hudson, DJ Hamlin, WF Enneking, and H Pettersson. Magnetic

resonance imaging of bone and soft tissue tumors: early experience in 31 patients

compared with computed tomography. Skeletal radiology, 13(2):134–146, 1985.

[10] WD Zimmer, TH Berquist, RA McLeod, FH Sim, DJ Pritchard, TC

Shives, LE Wold, and GR May. Bone tumors: Magnetic resonance imaging

versus computed tomography. Radiology, 155(3):709–718, 1985.

[11] Timothy G Feeman. The mathematics of medical imaging: a beginners guide.

Springer Science & Business Media, 2010.

[12] K Kirk Shung, Michael Smith, and Benjamin MW Tsui. Principles of

medical imaging. Academic Press, 2012.

[13] D. L. Collins, A.P. Zijdenbos, V. Kollokian, J.G. Sled, N.J. Kabani, C.J.

Holmes, and A.C. Evans. Design and construction of a realistic digital brain

phantom. IEEE Trans. on Medical Imaging, 17(3):463–468, June 1998.

[14] Daniel S Marcus, Tracy H Wang, Jamie Parker, John G Csernansky, John

C Morris, and Randy L Buckner. Open access series of imaging studies (oasis):

cross-sectional mri data in young, middle aged, nondemented, and demented older

adults. Journal of Cognitive Neuroscience, 19(9):1498–1507, 2007.

[15] B. Menze, A. Jakab, S. Bauer, M. Reyes, M. Prastawa, and K. V.

Leemput. Multimodal brain tumor segmentation challenge. In MICCAI

Conference, October 2012.

[16] Santiago Aja-Fernandez and Antonio Tristan-Vega. A review on statistical

noise models for magnetic resonance imaging. LPI, ETSI Telecomunicacion,

Universidad de Valladolid, Spain, Tech. Rep, 2013.

[17] Hakon Gudbjartsson and Samuel Patz. The rician distribution of noisy mri

data. Magnetic Resonance in Medicine, 34(6):910–914, 1995.


82

[18] ER McVeigh, RM Henkelman, and MJ Bronskill. Noise and filtration in

magnetic resonance imaging. Medical physics, 12:586, 1985.

[19] Pierre Gravel, Gilles Beaudoin, and Jacques A De Guise. A method for

modeling noise in medical images. IEEE Transactions on Medical Imaging,

23(10):1221– 1232, 2004.

[20] Alessandro Foi. Noise estimation and removal in mr imaging: The variance

stabilization approach. In ISBI, pages 1809–1814, 2011.

[21] Ranjan Maitra and David Faden. Noise estimation in magnitude mr

datasets. Medical Imaging, IEEE Transactions on, 28(10):1615–1622, 2009.

[22] Pierrick Coupe, Jose V Manjon, Elias Gedamu, Douglas L Arnold,

Montserrat Robles, D Louis Collins, et al. Robust rician noise estimation for mr

images. Medical image analysis, 14(4):483–493, 2010.

[23] Jose V Manjon, Pierrick Coupe, and Antonio Buades. Mri noise

estimation and denoising using non-local pca. Medical image analysis, 22(1):35–

47, 2015.

[24] Guido Gerig, Olaf Kubler, Ron Kikinis, and Ferenc A Jolesz. Nonlinear

anisotropic filtering of mri data. Medical Imaging, IEEE Transactions on,

11(2):221–232, 1992.

[25] Tim McInerney and Demetri Terzopoulos. Deformable models in medical

image analysis: a survey. Medical image analysis, 1(2):91–108, 1996.

[26] Jan Sijbers, Arnold J. den Dekker, Paul Scheunders, and Dirk Van Dyck.

Maximum likelihood estimation of rician distribution parameters. IEEE

Transaction on Medical Imaging, 17(3):357–361, 1998.

[27] J. V. Manjon, J. C. Caballero, G. G. Marti, L. Marti-Baonmati, and M.

Robles. Mri denoising using non local means. Medical Image Analysis, 12:514–

523, 2008.
83

[28] Yasuyuki Taki, Benjamin Thyreau, Shigeo Kinomura, Kazunori Sato,

Ryoi Goto, Ryuta Kawashima, and Hiroshi Fukuda. Correlations among brain

gray matter volumes, age, gender, and hemisphere in healthy individuals. PloS

one, 6(7):e22734, 2011.

[29] Edelstein, W. A. et al. (Aug. 1986). “The intrinsic signal-to-noise ratio in

NMR imaging”. In: Magn. Reson. Med. 3.4, pp. 604–618. ISSN: 0740-3194.

DOI: 10.1002/mrm.1910030413.

[30] Raya, José G. et al. (Jan. 2010). “T2 measurement in articular cartilage:

Impact of the fitting method on accuracy and precision at low SNR”. In: Magn.

Reson. Med. 63.1, pp. 181–193. ISSN: 0740-3194. DOI: 10.1002/mrm.22178.

URL: https://doi.org/10.1002/mrm.22178.

[31] Dietrich, Olaf, Sabine Heiland, and Klaus Sartor (Mar. 2001). “Noise

correction for the exact determination of apparent diffusion coefficients at low

SNR”. In: Magn. Reson. Med. 45.3, pp. 448–453. ISSN: 0740-3194. URL:

https://doi.org/10.1002/1522-2594(200103)45:3<448::AID-

MRM1059>3.0.CO;2-W.

[32] Glenn, G. Russell, Ali Tabesh, and Jens H. Jensen (2015). “A simple noise

correction scheme for diffusional kurtosis imaging”. In: Magnetic Resonance

Imaging 33.1,pp. 124–133. ISSN: 0730-725X.

[33] Taylor, Alexander J. et al. (Oct. 2016). “Probe-Specific Procedure to

Estimate Sensitivity and Detection Limits for 19F Magnetic Resonance Imaging”.

In: PLOS ONE 11.10, e0163704. DOI: 10.1371/journal.pone.0163704.

[34] Fan, Linwei et al. (Dec. 2019). “Brief review of image denoising techniques”.

In: Visual Computing for Industry, Biomedicine, and Art 2.


84

[35] Zhang, K. et al. (2017). “Beyond a Gaussian Denoiser: Residual Learning of

Deep CNN for Image Denoising”. In: IEEE Transactions on Image Processing

26.7, pp. 3142–3155. ISSN: 1941-0042.

[36] Arbelaez, Pablo et al. (May 2011). “Contour Detection and Hierarchical

Image Segmentation”. In: IEEE Trans. Pattern Anal. Mach. Intell. 33.5, pp.

898–916. ISSN:0162-8828. DOI: 10 . 1109 / TPAMI . 2010 . 161.

[37] Orieux, François, Jean-François Giovannelli, and Thomas Rodet (June

2010). “Bayesian estimation of regularization and point spread function

parameters for Wiener Hunt deconvolution”. In: Journal of the Optical Society

of America. A Optics, Image Science, and Vision, p. 1593.

[38] Liu, F. et al. (2017). “Fast Realistic MRI Simulations Based on Generalized

Multi-Pool Exchange Tissue Model”. In: IEEE Transactions on Medical

Imaging 36.2, pp. 527–537. ISSN: 1558-254X.

[39] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton (2012).

“ImageNet Classification with Deep Convolutional Neural Networks”. In:

Proceedings of the 25th International Conference on Neural Information

Processing Systems - Volume 1. NIPS’12. Lake Tahoe, Nevada: Curran

Associates Inc., 1097–1105.

[40] Lu, Le et al. (2017). Deep Learning and Convolutional Neural Networks for

Medical Image Computing: Precision Medicine, High Performance and Large-Scale

Datasets.

[41] Chong, Edwin and Stanislaw Zak (2001). “An Introduction to

Optimization”. In: 2nd. SERIES IN DISCRETE MATHEMATICS AND

OPTIMIZATION. WILEY-INTERSCIENCE. Chap. 8th.

[42] Goodfellow, Ian, Yoshua Bengio, and Aaron Courville (2016). Deep

Learning. http://www.deeplearningbook.org. MIT Press.


85

[43] Zhou and Chellappa (24-2). “Computation of optical flow using a

neural network”. In: IEEE 1988 International Conference on Neural

Networks, 71–78 vol.2.

[44] LeCun, Haffner, Bottou and Bengio (1998). Object recognition with

Gradient-Based Learning.

[45] Zbontar, Jure et al. (2019). fastMRI: An Open Dataset and Benchmarks for

Accelerated MRI. arXiv: 1811.08839 [cs.CV].

[46] Lehtinen, Jaakko et al. (2018). “Noise2Noise: Learning Image Restoration

without Clean Data”. In: Proceedings of the 35th International Conference on

Machine Learning. Ed. by Jennifer Dy and Andreas Krause. Vol. 80.

Proceedings of Machine Learning Research. PMLR, pp. 2965–2974.

[47] Kingma, Diederik P. and Jimmy Ba (2017). Adam: A Method for

Stochastic Optimization.

[48] A. Buades, B. Coll, and J.-M. Morel, “A non-local algorithm for image

denoising,” in IEEE Conference on Computer Vision and Pattern

Recognition, vol. 2, 2005, pp. 60–65

[49] He, Kaiming et al. (2016). Identity Mappings in Deep Residual Networks.

arXiv: 1603.05027 [cs.CV].

[50] ] P. Wang, H. Zhang, V.M. Patel, SAR Image despeckling using a

convolutional neural network, IEEE Signal Process. Lett. 24 (12) (2017) 1763–

1767.
86

View publication stats

You might also like