0% found this document useful (0 votes)

29 views88 pages

Master's Thesis on Dental Imaging AI

Uploaded by

Roudayna Jhinaoui

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views88 pages

Master's Thesis on Dental Imaging AI

Uploaded by

Roudayna Jhinaoui

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Republic of Tunisia

Ministry of Higher Education and Scientific Research University of Kairouan

Higher Institute of Applied Sciences and Technology of Kairouan

Reference : IPR-FOR-08
Final project internship report
Approval date: 2/10/2024
Version: Final

N° ……..
END OF STUDIES THESIS
In order to obtain the Master of
Research Diploma
In Computer Science
Speciality : Data Science
By
JAOUADI Yasmine

Defended on : …………..…………

Jury members :
…………………………. President

…………………………. Examiner

Dr. Imen JDEY Supervisor

…………………………. Reviewer

Project in collaboration with (ReGIM-Lab. Research groups in intelligent machines (LR11ES48))

Academic Year : 2023-2024

ISSAT of Kairouan - Address: University Campus Ring Road Dar El Amen Kairouan 3100
Tel: +216 77 27 38 04/+216 77 27 37 96- Fax: +216 77 27 38 06 Website: https://issatkr.rnu.tn Page 1 of 1
Acknowledgments

First and foremost, I wish to express my deepest gratitude to Allah, the Almighty, for granting
me the strength, patience, and perseverance to complete this work. Without His guidance, this
achievement would not have been possible.

I would like to extend my sincere thanks to my supervisor, Ms. Imen Jdey, whose unwavering
support and expertise have been a pillar throughout this journey. Her insightful feedback, critical
perspectives, and constant encouragement helped shape this work into its final form. Her avail-
ability and willingness to guide me through every challenge were invaluable and instrumental in
achieving this result.

My deepest appreciation goes to my family, particularly my parents and my brother, who

have been my rock during this entire process. Their unconditional support, both emotionally
and financially, has been a source of strength and motivation throughout my academic journey.
Without their belief in me, none of this would have been possible.

I would like to extend a special thank you to the esteemed members of the jury. Your time and
effort in reviewing and evaluating this work are greatly appreciated. The honor of your presence
is truly humbling, and I am deeply grateful for your valuable insights and feedback, which will
contribute significantly to my continued growth.

A heartfelt thank you also goes out to the entire pedagogical and administrative team at
ISSAT. Your dedication and commitment to providing a high-quality education have been essen-
tial in shaping the foundations of our knowledge and professional competence. Your efforts do
not go unnoticed, and I am deeply grateful for the environment of learning and development that
you have fostered.

Lastly, I would like to extend my gratitude to all those who, in one way or another, contributed
to the completion of this project. Whether through direct involvement or offering a kind word
of encouragement, your support has played an important role in the successful realization of this

i
work. Thank you !

ii
Contents

List Of Abbreviation ix

General Introduction 1

0.1 Context and Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

0.2 Objectives and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

0.3 Structure of the Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1 State of The Art 4

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Dental Problem Diagnosis through Medical Imaging . . . . . . . . . . . . . . . 4

1.2.1 Structural anatomical imaging . . . . . . . . . . . . . . . . . . . . . . . 5

1.2.2 Functional anatomical imaging . . . . . . . . . . . . . . . . . . . . . . . 5

1.2.3 X-ray imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2.4 Computed Tomography (CT) . . . . . . . . . . . . . . . . . . . . . . . . 7

1.2.5 Magnetic Resonance Imaging (MRI) . . . . . . . . . . . . . . . . . . . . 8

1.2.6 Ultrasound imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.2.7 Nuclear medicine imaging . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.2.8 Fluoroscopy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

iii
CONTENTS

1.3 Deep Learning in Dental Imaging . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.4 Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.4.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.4.2 Categories of Data Annotation Based on Format . . . . . . . . . . . . . 14

1.4.3 Main Types of Image Annotation . . . . . . . . . . . . . . . . . . . . . 15

1.4.4 Annotation modalities . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.4.5 Importance and Objectives of Image Annotation . . . . . . . . . . . . . 21

1.4.6 Challenges of Image Annotation . . . . . . . . . . . . . . . . . . . . . . 21

1.5 Dental X-Ray Imaging and Teeth Anatomy . . . . . . . . . . . . . . . . . . . . 22

1.5.1 Overview of Dental X-Ray Imaging . . . . . . . . . . . . . . . . . . . . 22

1.5.2 Anatomical Structure of Teeth and Naming Convention . . . . . . . . . . 24

1.6 Implications for Dental Annotation . . . . . . . . . . . . . . . . . . . . . . . . . 25

1.6.1 Diagnostic Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

1.6.2 Treatment Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

1.6.3 Challenges in Dental Annotation . . . . . . . . . . . . . . . . . . . . . . 26

1.7 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

1.7.1 Research Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

1.8 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

1.9 Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

1.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2 Methodology 35

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.2 Instance Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.2.2 Technical Components of Instance Segmentation . . . . . . . . . . . . . 36

iv
CONTENTS

2.2.3 Key Challenges in Instance Segmentation . . . . . . . . . . . . . . . . . 37

2.3 Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.3.1 Historical Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.3.2 Mask RCNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

2.4 Proposed Automatic Image Annotation System Overview . . . . . . . . . . . . . 48

2.4.1 Image acquisition and preprocessing . . . . . . . . . . . . . . . . . . . . 49

2.4.2 Automatic image segmentation . . . . . . . . . . . . . . . . . . . . . . . 50

2.4.3 Image attribute extraction and calculation . . . . . . . . . . . . . . . . . 50

2.4.4 Annotation by automatic image classification . . . . . . . . . . . . . . . 51

2.5 Performance of the automatic image annotation system . . . . . . . . . . . . . . 51

2.5.1 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

2.5.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

2.5.3 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

2.5.4 Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

2.5.5 Score-F1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

2.5.6 Specific performance metrics . . . . . . . . . . . . . . . . . . . . . . . . 53

2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3 Experiments and Results 55

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.2 Working environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.2.1 Kaggle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.2.2 Roboflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.3 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.4 Proposed automatic Teeth segmentation Workflow . . . . . . . . . . . . . . . . . 59

3.4.1 Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

v
CONTENTS

3.4.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.4.3 Data Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

3.4.4 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

3.4.5 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.4.6 Model Optimization with Attention . . . . . . . . . . . . . . . . . . . . 62

3.5 Results and Dicussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.5.1 Comparison between both models . . . . . . . . . . . . . . . . . . . . . 64

3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

Conclusion 70

References 72

vi
List of Figures

1.1 X Ray teeth Imaging. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.2 CT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.3 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.4 Manual annotation using LabelMe . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.5 Dental imaging modalities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

1.6 Keywords Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.1 Instance Segmentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.2 History. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.3 Faster RCNN’s Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

2.4 Global Architecture of Mask RCNN. . . . . . . . . . . . . . . . . . . . . . . . . 43

2.5 Architecture of ResNet 50. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

2.6 Customized Architecture of Mask RCNN. . . . . . . . . . . . . . . . . . . . . . 47

2.7 Automatic Image Annotation System overview . . . . . . . . . . . . . . . . . . 49

2.8 Confusion matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.1 Working Platform : Kaggle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.2 Image Annotation Platform :Roboflow. . . . . . . . . . . . . . . . . . . . . . . . 57

vii
LIST OF FIGURES

3.3 Dataset Xray Teeth images examples . . . . . . . . . . . . . . . . . . . . . . . 58

3.4 children Xray image example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.5 Automatic Teeth annotation - proposed pipeline. . . . . . . . . . . . . . . . . . . 60

3.6 Image Annotation using Roboflow tool . . . . . . . . . . . . . . . . . . . . . . . 61

3.7 Mask RCNN architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.8 output1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

3.9 output2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

3.10 output3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

3.11 output4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

3.12 Output images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

3.13 Results before Adding attention. . . . . . . . . . . . . . . . . . . . . . . . . . . 65

3.14 Results after Adding attention. . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

viii
List of Abbreviation

3D Three-Dimensional
AI Artificial Intelligence
ANNs Artificial Neural Networks
BERT Bidirectional Encoder Representations from Transformers
CBCT Cone Beam Computed Tomography
CCTA Coronary Computed Tomography Angiography
CT Computed Tomography
DNNs Deep Neural Networks
DSCT Dual-Source Computed Tomography
fMRI functional Magnetic Resonance Imaging
GPT Generative Pre-trained Transformer
HIPAA Health Insurance Portability and Accountability Act
LSTM Long Short-Term Memory
MRI Magnetic Resonance Imaging
MRS Magnetic Resonance Spectroscopy
NLP Natural Language Processing
NER Named Entity Recognition
PET Positron Emission Tomography
RNNs Recurrent Neural Networks
SPECT Single-Photon Emission Computed Tomography
X-rays X-radiation

ix
List of Tables

1.1 Performance Metrics Description . . . . . . . . . . . . . . . . . . . . . . . . . . 28

1.2 Benchmarking of Related Works. . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.1 Mask RCNN Hyperparameters Settings . . . . . . . . . . . . . . . . . . . . . . 62

x
Introduction

0.1 Context and Motivations

The field of medical imaging has advanced significantly over the last few decades, especially
concerning automated image annotation systems (Zhou et al., 2021). Experts have historically
been needed to mark or label a wide range of anatomical structures and pathological charac-
teristics for dental radiography, which is crucial for dental diagnostics (Yazdanian et al., 2022).
However, doing it by hand is labor-intensive, prone to inaccuracies and human mistakes, and
more challenging when dealing with large data sets (Shafi et al., 2023) (Schwendicke & Krois,
2022). There is a rising need for automated systems that can accurately and efficiently support
clinicians in making decisions as tests become more precise.

In the realm of dental radiography, in particular, automatic image annotation offers a poten-
tially viable solution to these issues. Through the use of convolutional neural networks (CNNs),
which are powerful tools for deep learning, these systems may be trained to accurately iden-
tify and classify features found on radiographs. Creating a system that will expedite and stan-
dardize the radiological diagnosis procedure in dentistry is the aim of this research. This drive
stems directly from my doctoral research, which was supported by the National Institutes of
Health in 2000. The initiative is also sponsored by the Editorial Office of the Chinese Journal of
Prosthodontic Dentistry, which also offers suggestions on how to improve software in the future.
In the end, this initiative aims to give dento-medical professionals straightforward and efficient
help, freeing up their time to concentrate more on patient care or other professional matters while
working fewer hours at regular occupations.

1
Introduction

0.2 Objectives and Contributions

The main goal of this research is to create a dependable and strong system that can automatically
identify and label dental abnormalities and structures in panoramic radiographs. The foundation
of this system will be sophisticated deep learning methods, specifically CNNs, which have shown
to be incredibly successful in image identification applications. Among the contributions made
by this work are:

• The creation of a multi-phase pipeline for instance segmentation that incorporates cutting-
edge deep learning models.

• Enrichment of the global database through the collection of real data from radiologists.

• The use of transfer learning to make use of the knowledge of pre-trained models, which
reduces the requirement for large amounts of labeled data and enhances the generalization
skills of the model.

• A comprehensive assessment of the system’s sensitivity, specificity, and accuracy that of-
fers information about its potential application in clinical settings.

• An optimization of the pretrained model through adding attention mechanism.

• The goal of this research is to develop a workflow that can help dental practitioners in their
diagnostic processes and to add to the expanding body of knowledge in the field of medical
image analysis.

0.3 Structure of the Report

Six sections that cover different aspects of the research comprise the report:

• Chapter 1: General Introduction — This chapter provides an overview of the research’s

goals, motives, and context, as well as its scientific contributions. It also outlines the
contents of this report.

• State of the Art — Chapter 2 An analysis of the body of research on annotation of pictures,
particularly for dental radiography, concentrates on automating this procedure.

• Chapter 3: Suggested Approach- This chapter explains the study methods, including the
Mask R-CNN model architecture, parameters, and hyperparameters, as well as perfor-
mance optimization techniques.

2
Introduction

• Chapter 4: Experiments and Results - Explains the experiments that were conducted, in-
cluding data preparation, model training, and evaluation, and discusses the results.

• Chapter 5: Discussion - In this chapter, the research results are interpreted, and contrasted
with current methodologies, and the important contributions of this work are highlighted.

• Chapter 6: Conclusion and Additional Research - This final chapter offers a summary of
the results, considers their implications, and offers suggestions for additional study.

3
Chapter 1
State of The Art

1.1 Introduction

Dental health is an essential part of overall well-being, yet it often doesn’t receive the attention
it deserves until serious issues arise (Volgenant, Persoon, de Ruijter, & de Soet, 2021). One
of the most common dental problems is caries, or cavities, which are caused by bacteria that
erode tooth enamel and lead to decay (Warreth, 2023). If not addressed, caries can cause pain,
and infections, and eventually result in tooth loss. Another major concern is wisdom teeth, or
third molars, which frequently emerge in improper positions. These teeth can grow sideways
beneath the gums, leading to discomfort and often necessitating surgical removal. Misaligned
wisdom teeth can exert pressure on neighboring teeth, resulting in overcrowding, impaction, or
infections. Furthermore, a serious risk, particularly for elderly individuals, is periodontal disease,
which affects the gums and the underlying bone structure. It can result in tooth loss and other
systemic health issues.

1.2 Dental Problem Diagnosis through Medical Imaging

Treatment for these disorders must be initiated with accuracy. To see tooth structures, medical
personnel frequently rely on imaging methods like cone beam computed tomography (CBCT),
CT scans, and X-rays. While CT scans and CBCT provide three-dimensional images of teeth and
bones, X-rays only provide a two-dimensional view, which makes it difficult for dentists to fully
comprehend the problem. In modern dental diagnostics, imaging is essential for determining the
presence of cavities, assessing the location of wisdom teeth, and detecting bone loss.

4
Chapter 1 : State of The Art

Since Wilhelm Röntgen’s discovery of X-rays in 1896, medical imaging has advanced sig-
nificantly. A new era in diagnostic medicine began with this finding. Medical imaging refers to
a group of methods that make it possible to see organs or bodily components without requiring
surgery. Among these methods are:

1.2.1 Structural anatomical imaging

The term ”structural anatomical imaging” refers to techniques used to visualize the structure,
orientation, and configuration of the body’s organs, tissues, and anatomical structures. These
techniques include magnetic resonance imaging (MRI), computed tomography (CT), ultrasound,
and radiography. They produce finely detailed images that aid medical personnel in the diagnosis
of illnesses, the creation of treatment programs, and the tracking of their progression (Burrowes
et al., 2021).

1.2.2 Functional anatomical imaging

An important area of medical imaging is called functional anatomical imaging, which combines
the assessment of an organ’s or bodily region’s function with the depiction of its anatomical struc-
ture. This functional method combines techniques like functional MRI (fMRI), PET (positron
emission tomography), and magnetic resonance spectroscopy (MRS) to map metabolic, neuro-
logical, or physiological activity. Conventional anatomical imaging, on the other hand, concen-
trates only on form. Therefore, it provides a deeper comprehension of biological processes and
illnesses linked to a range of medical conditions, particularly in the fields of neurology, cancer,
and cardiology. An important area of medical imaging is called functional anatomical imag-
ing, which combines the assessment of an organ’s or bodily region’s function with the depiction
of its anatomical structure. This functional method combines techniques like functional MRI
(fMRI), PET (positron emission tomography), and magnetic resonance spectroscopy (MRS) to
map metabolic, neurological, or physiological activity. Conventional anatomical imaging, on the
other hand, concentrates only on form. Therefore, it provides a deeper comprehension of biolog-
ical processes and illnesses linked to a range of medical conditions, particularly in the fields of
neurology, cancer, and cardiology (Burrowes et al., 2021).

5
Chapter 1 : State of The Art

There are numerous modalities for medical imaging (Guo et al., 2022), each with specific
uses and guiding concepts:

1.2.3 X-ray imaging

When it comes to medical diagnostics and industrial nondestructive inspection, X-ray imaging
is a potent and affordable technique that is frequently utilized to see into items, including the
human body. The method uses X-rays, a type of electromagnetic radiation that can pass through
bodily tissues and provide noninvasive imaging of internal structures. This makes the technique
useful for a variety of applications (Aghayev, Murphy, Keraliya, & Steigner, 2016).

Significant advancements have been made in the creation of high-performance X-ray detec-
tors and associated imaging technologies since the discovery of X-rays in the 1890s (Wu, Zheng,
He, Chen, & Yang, 2023). Applications for X-ray imaging systems range from industrial and
safety inspections to medical diagnostics including diagnosing bone fractures. These systems
normally consist of an X-ray source and a detector.

In a traditional X-ray, an X-ray-sensitive detector is positioned on the opposite side of the

body to record the X-rays traveling through the tissues, and the patient is subjected to an X-ray
source, which is often located behind him or her. Dense regions, like bone, absorb more X-rays
and seem brighter on the image; soft tissues, on the other hand, permit more X-ray passage and
look darker.

Numerous medical disorders, including infections, cancers, lung ailments, bone fractures,
and many more, can be diagnosed and evaluated via X-rays. However, the resolution and soft
tissue differentiation capabilities of X-rays are limited, which may occasionally need the use
of more sophisticated imaging methods like computed tomography (CT) or magnetic resonance
imaging (MRI).

6
Chapter 1 : State of The Art

Figure 1.1: X Ray teeth Imaging.

1.2.4 Computed Tomography (CT)

Advanced imaging technology known as computed tomography (CT) has become a vital tool in
contemporary diagnostic medicine. It utilizes X-rays to produce comprehensive cross-sectional
photographs of the body’s internal architecture, allowing healthcare experts to study tissues,
organs, and bones with amazing clarity. In contrast to conventional X-rays, which produce two-
dimensional, flat images, CT scans combine a number of X-ray images obtained of the body
from various angles. A computer then processes these pictures to produce a thorough, three-
dimensional picture of the area under investigation. Because CT can produce extremely precise
and detailed images, it is an invaluable tool for identifying a variety of disorders, including
vascular diseases, malignancies, and fractures (Hsieh & Flohr, 2021).

Since its beginnings in the 1970s, CT technology has advanced remarkably, especially in
the last several years. At first, Magnetic Resonance Imaging (MRI) presented competition for
Computed Tomography (CT), with many speculating that MRI might eventually surpass CT in
clinical applications. But CT has not only remained relevant, but it has also developed into one
of the most widely used diagnostic tools in hospitals across the globe.

CT technology has continued to improve and succeed due to a number of important elements.
Among the most important advancements is coronary CT angiography (CCTA), a technique that
needs wide coverage, fast data gathering, and high spatial resolution to obtain finely detailed
pictures of the heart. In addition to facilitating CCTA, technology breakthroughs have improved
CT performance more broadly.

7
Chapter 1 : State of The Art

Three main areas of CT improvements can be distinguished: higher temporal resolution,

isotropic volume coverage, and spectral information for material classification. The original goal
was to use multi-slice CT and helical/spiral data gathering to achieve complete organ coverage
in a single breath-hold. Efforts to enhance temporal resolution—which enable dynamic imaging
and the freezing of patient motion—came next. With the advent of spectral CT more recently, it
is now possible to distinguish between different materials inside the body by obtaining dual- or
multi-energy data.

Innovations in five major areas have also contributed to the advancement of CT technology:
spectral CT, wide-cone CT, multi-slice CT, dual-source CT (DSCT), and helical/spiral acquisi-
tion. CT has become a more flexible and potent diagnostic tool as a result of these advancements,
which have also increased picture quality and speed and broadened the spectrum of clinical ap-
plications for it.

Looking ahead, deep learning and artificial intelligence (AI) will likely influence CT tech-
nology. These technologies are increasing the precision of disease detection and quantification,
optimizing CT operations, and automating image interpretation. Furthermore, radiologists’ in-
teractions with patients and other medical professionals are being completely transformed by the
amalgamation of 3D printing, virtual reality, and augmented reality.

The third-generation rotate-rotate geometry is still the foundation of modern CT systems,

despite all these developments. But novel X-ray tube designs and photon-counting CT are ex-
amples of upcoming technologies that could usher in the next generation of CT scanners, which
would offer even more advanced imaging capabilities and keep pushing the envelope in medical
imaging.

1.2.5 Magnetic Resonance Imaging (MRI)

Nuclear magnetic resonance is the basis for the advanced medical imaging method known as
magnetic resonance imaging (MRI). When a strong magnetic field is applied to the hydrogen
atoms in the body’s water molecules, the process starts. The hydrogen atoms align with the
magnetic field as a result of this exposure. These atoms gradually revert to their initial orientation
when the magnetic field is removed, producing tiny electrical signals along the way. The density
and makeup of the surrounding tissues and organs affect how quickly these atoms realign (Bruno
et al., 2022).

These little electrical signals are picked up by specialized detectors and processed to produce
finely detailed images of the body. These photos provide a thorough perspective of the interior
structures of the body and can be viewed in two dimensions (as slices) or three dimensions (by
combining multiple slices).

8
Chapter 1 : State of The Art

Figure 1.2: CT.

With the addition of a contrast agent, usually based on the chemical gadolinium, MRI pictures
can be even more clear and detailed. In contrast to contrast agents used in X-rays or CT scans,
this agent is specifically useful for highlighting soft tissues, including the brain, spinal cord,
internal organs, muscles, and tendons. When compared to other medical imaging methods, MRI
frequently yields better image quality for these kinds of tissues.

Owing to the high expense and intricacy of MRI equipment, it is typically saved for situ-
ations in which images obtained using less expensive imaging techniques, such as CT scans,
ultrasounds, or X-rays, are insufficiently detailed. It is devoted to the study of soft tissues, in-
cluding the heart, liver, and muscles, as well as soft tissues like the brain and spinal cord.

1.2.6 Ultrasound imaging

imaging using ultrasound Ultrasound imaging, also referred to as sonography, uses high-frequency
sound waves to provide real-time images of the internal organs and structures of the body. It is
commonly used to see the abdomen, pelvis, blood vessels, heart, and fetal development during
pregnancy.

9
Chapter 1 : State of The Art

1.2.7 Nuclear medicine imaging

Using minuscule quantities of radioactive materials, or ”radiopharmaceuticals,” injected into the

body, nuclear medicine imaging creates images that are then recorded by specialized cameras.
Nuclear medicine imaging techniques including single-photon emission computed tomography
(SPECT) and positron emission tomography (PET) are widely used to diagnose and track a
number of illnesses, such as neurological diseases, cancer, and cardiac issues.

1.2.8 Fluoroscopy

Fluoroscopy is a real-time imaging procedure that employs X-rays to reveal moving areas of the
body, such as arteries, joints and the digestive system. It is commonly used to guide patients
through surgical procedures as well as therapies like angiography and orthopedic surgery.

Although X-ray imaging provides a good visual understanding of dental anatomy, correct and
effective interpretation of these pictures necessitates sophisticated technologies. This is where
computer vision and deep learning are useful. We can improve X-ray picture interpretation,
automate the diagnostic procedure, and guarantee accurate dental condition identification by
utilizing artificial intelligence. Innovations in dental care are made possible by the astonishing
accuracy with which we can extract important information from photos thanks to the combination
of deep learning and computer vision. By the Other Hand manually analyzing dental pictures can
be labor-intensive and subject to human mistake. The need for Deep Learning automated systems
that can precisely identify and diagnose dental disorders is growing because of the complexity of
dental structures and the variations in patient anatomy.

1.3 Deep Learning in Dental Imaging

The development of artificial intelligence (AI), especially in the domains of deep learning and
computer vision, offers a substantial chance to raise the accuracy and efficiency of medical im-
age processing (Elyan et al., 2022). Deep learning is a branch of artificial intelligence that allows
models to learn from large datasets, exposing characteristics and patterns that might not be im-
mediately visible to the human eye. It is possible to train deep learning algorithms to recognize
dental imaging issues such as cavities, misplaced wisdom teeth, and bone loss.

One notable application of DL in this domain is teeth segmentation, which involves identify-
ing and dividing each tooth in a picture. This process is necessary to perform a more comprehen-
sive inspection, such as locating decaying areas or pinpointing the exact position of a wisdom

10
Chapter 1 : State of The Art

tooth. In tooth segmentation, DL can drastically cut down on the time needed for manual label-
ing, allowing for quicker diagnosis and more precise treatment planning.

Machine learning’s subset of DL has completely changed how we tackle challenging compu-
tational issues by allowing computers to learn from massive datasets without the need for explicit
programming in every situation. It draws inspiration from the composition and operation of the
human brain, namely from the idea of neural networks, which are made up of layers of connected
nodes, or neurons. In the field of deep learning, models learn to automatically discover patterns,
make predictions, and classify information through numerous levels of change. Deep learning
models learn directly from the raw data, in contrast to typical machine learning techniques, which
call for hand-crafted features. DL is incredibly effective for a variety of tasks, especially those
involving audio, pictures, and natural language because of its capacity to automatically extract
features and gradually abstract information through hidden layers. Artificial neural networks
(ANNs) are the main architecture that powers deep learning; these networks are referred to as
deep neural networks (DNNs) when they include multiple layers. Using a method known as
backpropagation, these networks are trained on big datasets. The model iteratively modifies its
internal parameters, or weights, in response to prediction errors.

Deep learning’s appeal arises from its ability to surmount the conventional constraints that
machine learning algorithms encounter, especially in domains like computer vision, natural lan-
guage processing, speech recognition, and medical imaging. Deep learning has many more po-
tential uses now that specialized neural network topologies like Transformers, Recurrent Neural
Networks (RNNs), and Convolutional Neural Networks (CNNs) have been introduced. These
applications go well beyond simple categorization tasks. For example, CNNs are built to handle
spatial data, such as photographs, where pattern recognition depends on the interaction between
pixels. Because they can handle sequential data well, RNNs and its variations, such Long Short-
Term Memory (LSTM) networks, are perfect for applications like language modeling and time
series forecasting. Meanwhile, because of their capacity to manage long-range dependencies
in textual data, transformers have revolutionized natural language processing (NLP), as demon-
strated by models like BERT and GPT.

Deep learning has shown to be revolutionary in the field of medical imaging, especially when
it comes to problems involving picture segmentation, classification, and detection. Complex
patterns can be found in medical pictures like X-rays, MRIs, and CT scans that are frequently
challenging for conventional image processing algorithms to decipher. With previously unheard-
of accuracy, deep learning models—particularly CNNs—have shown to be an effective tool for
evaluating these photos. Deep neural networks can be trained to recognize minute patterns that
can be signs of illness or anomalies by using enormous databases of annotated medical photos.
This is especially helpful in the dental sector, where precise duties including diagnosing cavities,
misaligned teeth, and periodontal disease are necessary.

11
Chapter 1 : State of The Art

Within the domain of dental image analysis, deep learning models have the potential to signif-
icantly improve diagnostic precision and efficacy. For instance, dentists typically rely on medical
imaging, such as dental X-rays and cone beam computed tomography (CBCT), to see the interior
architecture of teeth and surrounding bone. But manually evaluating these images can be labori-
ous and error-prone, especially when handling complicated situations like advanced periodontal
disease or wisdom teeth impaction, which are conditions where teeth grow at irregular angles
and frequently behind the gums. In these situations, deep learning models—more especially,
CNNs—can help by automatically identifying and classifying teeth and other anatomical fea-
tures in medical photos. In order for the model to learn hierarchical features—from basic edges
and textures to more intricate patterns, like tooth borders or decayed areas—CNNs operate by
applying filters to images.

Furthermore, deep learning can be used for tasks other than detection, such as segmenting and
annotating teeth in dental photos, which is crucial for precise diagnosis. The process of segment-
ing teeth in a picture is a prerequisite to identifying problems such as cavities or misalignments.
Dentists can concentrate more on clinical decision-making while the algorithm automates labor-
intensive procedures by using CNN-based models to separate teeth in X-rays. These models
can also be modified to identify additional abnormalities, like wisdom tooth location for surgical
planning or bone resorption in the event of periodontal disease. Therefore, incorporating deep
learning in this situation not only speeds up the workflow but also increases diagnostic accuracy,
which lowers the possibility of human error.

The capacity of deep learning to generalize across various data types is one of its main advan-
tages. With minor adjustments, a model developed for X-rays in dental imaging may potentially
translate well to CBCT images, enabling multi-modal diagnostic capabilities. Furthermore, as
additional data becomes available, deep learning models can get better and better, making them
a dynamic tool that advances along with the body of medical knowledge. A significant problem
in dental imaging, where it can be difficult to gather a high amount of labeled data, is the adop-
tion of data augmentation techniques, such as flipping or rotating images, to help overcome the
constraints of small datasets. This is especially important in cases that are extremely specialized
and may not have many annotated datasets, like orthogonal wisdom teeth or sophisticated oral
operations.

To sum up, deep learning—more especially, CNNs—offers a reliable way to improve and au-
tomate dental image analysis diagnosis. Deep learning models can greatly enhance the detection,
segmentation, and classification of dental problems by utilizing big datasets and sophisticated
neural network architectures. This can help doctors make choices more quickly and accurately.
We may anticipate even more AI integration into dental diagnostics as these technologies ad-
vance, resulting in more effective and individualized patient care (Alzubaidi et al., 2021).

The capacity to analyze dental photos using advanced techniques such as deep learning and

12
Chapter 1 : State of The Art

Figure 1.3: Deep Learning

computer vision is contingent upon the availability of annotated data. The basis for deep learning
models’ ability to identify particular dental traits, segment structures, and even generate diagnos-
tic predictions is accurate annotation. The importance of annotation in developing reliable and
accurate dental imaging systems is discussed in the following section, along with how to es-
tablish annotations that guarantee high-quality output from the models. We will examine the
annotation in more detail in the following part, so let’s get started.

1.4 Annotation

1.4.1 Definition

An essential step in the creation of deep learning and artificial intelligence models, data annota-
tion provides the framework for teaching these systems to provide precise predictions. Funda-
mentally, data annotation is the process of explicitly identifying the aspects of interest in many
types of data, including text, photos, audio, and video, by labeling or tagging them. The model
uses these labels as guidelines to determine what to concentrate on while processing newly dis-
covered data. In essence, data annotation gives a model the framework and context it needs to
identify trends, derive significant conclusions, and carry out operations that would normally need

13
Chapter 1 : State of The Art

human interpretation.

Consider instructing a young youngster on how to identify items in images. You would
continually name and point out each object until the youngster started to connect the word to its
picture. Similar to this, annotations in machine learning act as those recurring cues that enable
the model to ”learn” from the given samples. Inadequate annotations would make it difficult
for the model to discern between pertinent and unimportant data, which would result in subpar
performance and inaccurate predictions (Hanbury, 2008).

1.4.2 Categories of Data Annotation Based on Format

When discussing data annotation, it is crucial to acknowledge that the nature of the data being
annotated has a major impact on the annotation process itself. Annotation methods for different
data formats must be customized to meet the unique requirements and features of each format
(Hanbury, 2008).

• Image Annotation :
The process of annotating visual data, like pictures or illustrations, to draw attention to
certain items or features is known as image annotation. For tasks like object detection,
where a model must recognize and categorize different elements inside an image, this
kind of annotation is essential. Annotated photos can be custom-generated, particularly in
specialist domains like autonomous driving, or they can be sourced from public databases,
which are widely available and pre-labeled. Depending on the individual needs of the
project, such as the demand for certain objects or surroundings that might not be well-
represented in current datasets, the decision between creating new ones and using public
datasets must be made.

• Text Annotation :
Labeling text data is a necessary step in text annotation so that models can comprehend
and interpret natural language. This can involve adding labels to particular words, phrases,
or sentences that indicate things like their attitude, named entities (such as people or loca-
tions), or part of speech. For example, text annotation allows the automation of operations
that would normally need considerable manual review in a subject like law or medicine
that relies heavily on documents, including recognizing essential legal phrases or extract-
ing pertinent patient information from medical records (Erdmann, Maedche, Schnurr, &
Staab, 2000).

• Audio Annotation:
The technique of labeling sound recordings to make them comprehensible for machine

14
Chapter 1 : State of The Art

learning models—especially in the context of natural language processing (NLP)—is known

as audio annotation. This could entail annotating particular sounds or intonations, desig-
nating various speakers in a conversation, or transcribing spoken phrases. These anno-
tations are critical for developing increasingly complex applications where precise speech
recognition is required, such as voice-activated assistants or chatbots for customer support.

• Video Annotation :
The concepts of image annotation are extended to a succession of frames in video annota-
tion, necessitating the labeling of objects and actions in a moving sequence. The intricacy
and amount of data required for this kind of annotation make it especially difficult because
every frame in a video has to be precisely marked. For applications ranging from entertain-
ment to monitoring, video annotation is crucial in fields like computer vision for tasks like
action identification, behavior analysis, and scene interpretation (Gaur, Saxena, & Singh,
2018).

1.4.3 Main Types of Image Annotation

The methods used to annotate data differ due to its various nature. The fundamentals of annota-
tion—labeling data to train models—remain the same, but the approaches vary based on the kind
of data and the project’s particular goals (Murrugarra-Llerena, Kirsten, & Jung, 2022).

• Bounding Boxes:
Drawing rectangles around interesting things in an image is a straightforward yet effective
method of annotating it with bounding boxes. This technique works especially well for
object detection since it enables the model to recognize the position and size of things in
an image.

• Polygons:
Polygons are more flexible than bounding boxes for things with irregular shapes. Polygons
may precisely represent the outlines of intricate objects, such as natural landscapes, build-
ings, or oddly shaped products, by tracing the borders of an object with a set of connected
points.

• Polylines:
Roads, wires, and boundaries are examples of linear characteristics that can be annotated
inside photographs using polylines. In applications like mapping and navigation, where
comprehending the arrangement and interconnectivity of various elements is crucial, this
kind of annotation is quite helpful.

15
Chapter 1 : State of The Art

• Key-Points:
Marking particular places of interest on an object is known as key-points annotation, and
it is frequently employed in monitoring small, distinguishing features, human position
estimate, and facial recognition. When the emphasis is on specific details rather than the
object as a whole, this strategy is essential.

• 3D Cuboids:
3D cuboids extend annotation into the third dimension, giving details on the height, breadth,
and depth of an object, much like bounding boxes do. This approach is especially help-
ful in fields where comprehending an object’s spatial characteristics is essential, such as
robotics or autonomous cars.

• Semantic Segmentation:
By segmenting an image into distinct regions or clusters and giving each one a label,
semantic segmentation does more than just identify objects. A finer grasp of an image
can be achieved using this technique, which is crucial for tasks requiring the distinction of
several objects belonging to the same category.

• Instance Segmentation:
In contrast to semantic segmentation, instance segmentation labels things and makes dis-
tinctions between various instances of the same type of item. This enables the model
to distinguish and count objects inside a picture, even if they are in the same category
(Murrugarra-Llerena et al., 2022)

Creating a structured foundation that allows Deep learning models to learn and produce
highly accurate predictions is the goal of data annotation, which goes beyond simply identifying
data. The type of data and the particular requirements of the model determine which annotation
strategy is best. You can make sure that your annotations are accurate, consistent, and ultimately
successful in enhancing the performance of your AI systems by carefully evaluating the format
and type of data, along with the desired results. Better-trained models result from high-quality
annotations, and better-trained models produce more accurate predictions and overall success
when using AI technologies.

1.4.4 Annotation modalities

There are various methods available for annotating data, each with pros and cons of their own.
The three categories of annotation—manual, automatic, and semi-automatic—are explained be-
low.

16
Chapter 1 : State of The Art

a) Manual Annotation

Humans manually annotate data by looking over it and making annotations following pre-established
rules. This could include identifying particular passages in a text, creating outlines around items
in a picture, or classifying data based on predetermined standards. Although precise, manual
annotation can be labor-intensive and slow at times.

Consider the online manual annotation application LabelMe, which allows users to manually
divide photographs into regions and then annotate these regions by choosing keywords. The
suggested selection of keywords differs from image to image. It is built based on the keywords
that other users have already chosen for that image. This reduces the amount of errors brought
on by word ambiguity. In a similar vein, you could notice that certain photographs already have
designated zones. This indicates that this image has already been divided by another user, and
you can view the zones they have designated.

Additionally, all you have to do is click on a region to see the terms that have already been
indexed in it. Consider the illustration in Figure 1.4, where the LabelMe1 tool is being used to
annotate an image. Areas that have previously been noted by other users are outlined in different
colors or highlighted. The keywords that have previously been used to annotate this image are
located to the right of it. After that, the user can choose to crop off specific sections or items
from the image by circling them and naming them with either new or existing keywords (Lu &
Young, 2020).

17
Chapter 1 : State of The Art

Figure 1.4: Manual annotation using LabelMe

The accurate classification and labeling of medical data, including texts, photos, and record-
ings, is a critical and challenging activity in the field of medicine that requires manual annotation.
The following paragraphs describe the various facets and difficulties associated with manual an-
notation in this field:

Limits The medical field presents some special obstacles for manual annotation. First of all,
annotation might be challenging and subjective due to the complexity and unpredictability of
medical data. Another layer of complication is added by data protection and adherence to privacy
laws like RGPD in Europe and HIPAA in the US. High expenses and delays may also result from
the lack of skilled medical specialists available to do annotation.

Solutions To tackle these issues, new technologies such as machine learning and artificial in-
telligence are increasingly being employed to automate or help hand annotation. Annotation is
made easier by specialized software solutions that include quality control, online collaboration
features, and user-friendly interfaces. Nevertheless, human experience is still necessary to verify,
edit, and ensure the quality and accuracy of automated annotations, even with recent advance-

18
Chapter 1 : State of The Art

ments.

b) Semi-automatic annotation

Annotation that is partially automated blends aspects of both automatic and manual processes.
Usually, an automatic annotation is made first, and then a human reviewer makes any necessary
corrections or adds more annotations. In natural language processing, for instance, a trained
model can be used to automatically identify entities in Named Entity Recognition (NER). A
human annotator can then review and edit the annotations as needed. The accuracy of manual
annotation combined with the efficiency of automatic annotation is known as semi-automatic
annotation.

Each style of annotation has its own uses and advantages, and the choice typically depends
on individual project needs, such as resource availability, needed accuracy and dataset size. To
increase the effectiveness and quality of annotations, it is frequently possible to combine various
annotation types.

c) Automatic Annotation

The practice of automatically identifying and classifying medical pictures, such as X-rays, MRIs,
or CT scans, with pertinent clinical data is known as ”medical automatic annotation.” This entails
analyzing the visual content of medical images using machine learning algorithms, especially in
the field of computer vision, and assigning predefined labels, such as recognizing particular
organs, tissues, abnormalities, or pathological changes.

Reducing the need for manual annotation by radiologists or other medical professionals is
the goal of medical automatic annotation, which aims to improve the efficiency and accuracy of
medical image interpretation. For large-scale medical picture databases, where hand annotation
would be laborious and prone to error, this procedure is essential. Healthcare providers can
enhance clinical decision-making, streamline diagnostic workflows, and promote research by
automating this task and providing structured, consistent annotations across large datasets (Yao,
Zhang, Antani, Long, & Thoma, 2008).

• Automatic Annotation and Natural Language Processing

Named Entity Recognition (NER) is a vital technique for automatically annotating text in
Natural Language Processing. Consider the task of organizing thousands of documents
based on dates, locations, or people. This would have needed a significant amount of
manual labor in the past. NER, on the other hand, automates this difficult operation by

19
Chapter 1 : State of The Art

recognizing and classifying these things in the text. To begin, the text is divided into
smaller units called tokens. These tokens are then examined for characteristics that aid in
the identification of the entities. The result is an intelligent system that, without the need
for human intervention, can process enormous volumes of text and precisely identify and
classify key pieces of information. This not only saves time but also enhances the accuracy
of data processing in various applications like customer service, medical records, and even
legal documents (Jehangir, Radhakrishnan, & Agarwal, 2023).

• Automatic Annotation and Computer Vision

One of the best uses of automatic annotation in computer vision is object detection. The
primary objective is to autonomously recognize and find particular items in pictures or
videos without requiring human input. Deep learning or advanced machine learning algo-
rithms that learn from pre-labeled data are used to do this.
Several essential procedures that are closely related to automatic annotation techniques are
necessary for object detection. To make sure the photos are in the ideal format for anal-
ysis, they are first pre-processed. Creating a labeled dataset requires some initial manual
annotation, however this is the first step towards training object identification algorithms.
These labels act as a guide, helping the system learn to detect comparable items in new,
unseen photos.These models, which are frequently driven by convolutional neural net-
works (CNNs), use the manually annotated data they were trained on to identify and locate
objects in fresh photos.
After being trained, these models can be used to automatically identify objects in fresh
photos, allowing for accurate and effective annotation. Notwithstanding, several obstacles
persist, including the requirement for superior training data, computational capacity, and
ongoing verification to guarantee the precision of the annotations, especially in crucial
applications.
Imagine the time and effort saved when thousands of photos or video frames can be au-
tomatically labeled by a machine with little assistance from a human. But the adventure
doesn’t stop there. The strength of the employed algorithms and the quality of the original
labels have a major impact on the machine’s accuracy. Object detection models, which
are frequently intricate and resource-intensive, need to be adjusted and tested frequently to
guarantee that the findings they provide are trustworthy. Despite these difficulties, the use
of automatic annotation in object detection has transformed domains where it is critical to
comprehend visual input rapidly and precisely, such as autonomous driving, security, and
medical imaging (Agarwal, Terrail, & Jurie, 2018).

Advantages and Limits The utilization of automatic annotation presents numerous benefits
over manual annotation, such as substantial time and financial savings and increased scalability

20
Chapter 1 : State of The Art

when managing substantial data quantities. It does, however, come with certain drawbacks, such
as the requirement for superior training data and the complexity of deep learning models, which
demand substantial computer power. Furthermore, human validation may be necessary for even
the best models to ensure optimal accuracy.

1.4.5 Importance and Objectives of Image Annotation

Especially when it comes to medical image analysis, image annotation is an essential step in the
creation of deep learning models. Providing a thorough and precise labeling of visual data is the
main objective of image annotation since it forms the basis for training models to identify, cate-
gorize, and segment different structures and anomalies in images. Precise annotation in medical
imaging is essential to the development of trustworthy diagnostic tools that help medical practi-
tioners diagnose and treat patients. The effectiveness of these models is primarily dependent on
how well-curated and complete the annotated data is that is utilized in training.

In medical imaging, image annotation serves a variety of purposes. First and foremost, the
goal is to produce a uniform dataset that can be utilized for model training in a variety of settings
and applications. Second, by giving models the data they need to make wise conclusions, it aims
to increase the precision and effectiveness of diagnostic procedures. Lastly, automating repetitive
processes seeks to lighten the strain on medical personnel and free them up to concentrate on
more intricate and important facets of patient care (Rebinth & Kumar, 2019).

1.4.6 Challenges of Image Annotation

Despite its significance, image annotation presents a number of difficulties that need to be re-
solved to guarantee the creation of trustworthy and accurate models. One of the most critical
issues is the sensor gap, which refers to the loss or distortion of information during the image
capture process. This may be the consequence of the capturing data being inaccurate due to the
resolution, sensitivity, and noise levels of the image sensors’ limits. Another difficulty is the
”digital gap,” which appears when the selected descriptor is unable to adequately represent the
pertinent visual elements required for the activity. Inappropriate descriptor selection or less-than-
ideal parameter settings may be the cause of this, which can distort the image’s content.

Perhaps the most difficult part of image annotation is the semantic gap. It illustrates the dis-
crepancy between the information that a human interprets from an image and what a machine
learning model can extract from it. This disparity emphasizes how challenging it is to convert
abstract, high-level ideas into low-level picture characteristics suitable for annotation. For au-
tomated annotation systems to perform better, especially in complicated domains like medical

21
Chapter 1 : State of The Art

imaging, the semantic gap must be closed. Although promising, current approaches in the lit-
erature still face difficulties with this problem, especially when working with large and diverse
datasets.

1.5 Dental X-Ray Imaging and Teeth Anatomy

1.5.1 Overview of Dental X-Ray Imaging

Because it can see both soft and hard tissues, X-ray imaging is one of the modalities utilized in
dentistry diagnostics the most frequently. X-rays enter the body and provide a shadow image of
the interior components, including the bones and teeth. X-rays in dentistry assist practitioners
in identifying a variety of problems, such as infections and cavities as well as improper tooth
alignment and bone loss.

a) Dental X-Ray Types

The figure 1.5, from (Kumar, Bhadauria, & Singh, 2021), illustrates various dental imaging
modalities, categorized into intra-oral and extra-oral types. Intra-oral imaging includes
periapical, bitewing, and occlusal X-rays, while extra-oral imaging consists of panoramic
X-rays, cephalometric X-rays, CBCT/CT scans, and sialograms.

Figure 1.5: Dental imaging modalities.

22
Chapter 1 : State of The Art

In dental care, there are various X-ray imaging techniques, each with a distinct function:

• Bitewing X-rays:
Frequently used to evaluate bone density and find cavities between teeth. These X-
rays are essential for identifying gum disease and tooth decay in their early stages
because they provide a view of both the upper and lower dental crowns.
• Panoramic X-Rays:
Take a single picture of the complete mouth, including the upper and lower jaws. This
is crucial for detecting issues like tumors, jaw abnormalities, and impacted teeth, par-
ticularly wisdom teeth. When it comes to orthodontics and oral surgery in particular,
panoramic X-rays are an indispensable tool for treatment planning since they offer a
thorough image of the tooth arch.
• Periapical X-rays: Concentrate on specific teeth, examining them from crown to base.
They are essential for the diagnosis of diseases such cysts and abscesses that impact
the surrounding bone structures and the root. This kind of imaging is beneficial for
endodontic treatments including as root canals.
• Occlusal X-rays:
These radiographs of the oral cavity are used to identify teeth that are impacted or
that have not yet erupted. Additionally useful in detecting anomalies in the palate or
floor of the mouth are occlusal X-rays.

b) X-Ray Imaging for Specific Dental Issues

A crucial use of X-ray imaging is to detect intricate dental diseases like:

• Caries:
While it is easy to miss early-stage tooth decay during ocular examinations, X-ray
images make it easy to identify.
• Wisdom Teeth:
Panoramic X-rays provide a clear image of wisdom teeth that are misaligned or im-
pacted. Orthogonal sous jencive teeth are especially dangerous since they can cause
discomfort, infection, and misalignment of surrounding teeth.
• Bone Loss:
By identifying the degree of bone loss brought on by periodontal disease, X-rays
facilitate early management. .

Because of its significance in dental care, X-ray imaging is a fundamental diagnostic and
treatment planning tool. However, because patient placement, imaging device quality, and
the intricacy of the oral anatomy can all alter X-ray quality, interpreting these pictures
needs skill.

23
Chapter 1 : State of The Art

1.5.2 Anatomical Structure of Teeth and Naming Convention

Dental X-ray analysis requires careful study of the architecture of the teeth. For an effective
diagnosis and treatment plan, it is essential to recognize that every tooth is different in terms of
its shape, function, and location inside the mouth.

There are four primary types of teeth:

1. Incisors: These teeth are at the front of the mouth and are mostly utilized for chopping
food. The incisors, which are made up of four upper and four lower teeth, are typically the
first teeth to come into contact with food.

2. Canines: Often referred to as cuspids, these teeth are employed in food tearing. Nestled
next to the incisors, canines are some of the strongest teeth and frequently have a pointed,
sharp appearance.

3. Premolars: Used for both ripping and crushing food, premolars are transitional teeth.
Premolars are flat-surfaced teeth that fall between canines and molars.

4. Molars: Used for crushing food, molars are found near the back of the mouth. These teeth
are the biggest and frequently have several roots. Wisdom teeth are a kind of molar that
can get impacted or misplaced, which can lead to serious problems.

Understanding these anatomical characteristics is necessary to read dental X-rays. Specific

teeth are identified and documented in clinical records using various dental numbering systems,
such as the FDI World Dental Federation notation and the Universal Numbering System (Santosh
& Jones, 2024).

It is a difficult but essential task to automate the detection and numbering of teeth in X-rays
because dental anatomy is so complicated and because there is always room for variation owing
to age, dental work, or pathology. Annotation precision is essential since incorrect diagnosis or
treatment may result from misidentifying a tooth or failing to recognize a dental problem.

A lot of the difficulties involved in manual dental X-ray interpretation can be avoided by
using deep learning models for automated detection and annotation, like Mask R-CNN. Even in
complicated scenarios with overlapping teeth or anomalies, these models can identify particular
anatomical features and give precise segmentation and labeling.

24
Chapter 1 : State of The Art

1.6 Implications for Dental Annotation

Precise annotation of dental pictures is essential for research, diagnosis, treatment, and teach-
ing in the field of dentistry. In contemporary dental care, medical imaging modalities like cone
beam computed tomography (CBCT), computed tomography (CT) scans, and X-rays are essen-
tial instruments. But only when these photos are painstakingly annotated—allowing for accurate
study of dental architecture and conditions—can their full potential be achieved. It is impossi-
ble to overestimate the significance of this procedure because it affects not only the treatment
of individual patients but also more general areas like dental research and teaching. The impor-
tance of dental annotations for research, treatment planning, diagnostic accuracy, and human and
automated annotation methods are discussed in the sections that follow.

1.6.1 Diagnostic Accuracy

Improving diagnosis accuracy in dental imaging requires accurate annotation. Dental experts
can more efficiently spot anomalies by using annotations to give a reference framework by des-
ignating essential elements like teeth, bones, and soft tissues. Accurately annotated X-ray or CT
scan images allow the identification of specific pathologies that could otherwise go unnoticed
in the context of diseases such as oral malignancies, periodontal disease, and dental caries. For
example, identifying impacted teeth or diagnosing alveolar bone loss need accurate identification
of bone components in dental CBCT images. In dental imaging, a misdiagnosis might result in
delayed or needless treatments, which is why accurate annotation is essential. Annotated im-
ages increase clinical decision-making by increasing visibility and comprehension of anatomical
structures, according to several studies.

1.6.2 Treatment Planning

Annotations on images are important not just for diagnosis but also for treatment planning in
dentistry. A thorough grasp of the dental and surrounding anatomical structures is required for
sophisticated procedures such as root canal therapy, orthodontic surgery, and dental implants.
Image annotations help dentists and surgeons during the planning stage by emphasizing impor-
tant locations that need to be taken into account while implanting dental implants, such as sinus
cavities, bone density, and nerves. Furthermore, precise annotation of photographs makes it pos-
sible to map out the locations of teeth correctly throughout orthodontic treatments, which makes
it easier to develop efficient realignment treatment plans. During surgical treatments, there is a
greater chance of injuring neighboring structures if correct annotation is not done. Annotated
image pre-surgical planning has been shown to improve patient outcomes by lowering complica-

25
Chapter 1 : State of The Art

tions and increasing surgical intervention precision, according to research on computer-assisted

surgery.

1.6.3 Challenges in Dental Annotation

Dental image annotation is a crucial task, but it also poses a number of practical and accurate
obstacles. The main challenge is the unpredictability and poor quality of dental pictures, which
might vary depending on the imaging modalities, the patient’s position, and the operator’s pro-
ficiency. For example, noise, distortions, or low contrast frequently influence X-ray pictures,
making it challenging to distinguish dental structures with accuracy. The complex and incred-
ibly varied structure of teeth, which includes variations in size, shape, and location, adds even
more complexity to the annotation process. It is difficult to create uniform annotation techniques
because of this diversity, especially when teeth are impacted, missing, or overlapped.

The manual annotation method itself poses a serious obstacle as well because it is laborious
and highly skilled. Dental picture annotation requires specific knowledge and close attention
to detail, particularly in complex instances. This can result in higher expenses and resource
allocation in clinical settings when dental practitioners have to spend time on annotation when
they could be spending that time on other important duties. Dental imaging is becoming more
and more in need of automated annotation systems due to these difficulties.

1.7 Literature Review

1.7.1 Research Methodology

We made significant use of scientific databases such as IEEE Xplore, ScienceDirect, and Google
Scholar to investigate current approaches and literature that closely matched the goals of our
project. These platforms are a useful resource since they give users access to a massive archive
of research articles in domains like medical imaging, computer vision, and artificial intelligence.
I was able to obtain a thorough understanding of the most recent developments and recommended
procedures for automatic annotation in medical imaging by reading publications from these trust-
worthy sources. The methodology employed in this project was greatly influenced by research
on models such as Mask R-CNN and U-Net, which provided insightful information and perfor-
mance standards against which our results could be evaluated. These resources made it possible
to thoroughly examine relevant works, supporting the methodological decisions and guarantee-
ing that our strategy adheres to the most reputable and advanced practices in the industry.

26
Chapter 1 : State of The Art

Keywords

We searched for pertinent literature and scientific publications using a range of keywords dur-
ing my research using databases including IEEE Xplore, ScienceDirect, Springer, and Google
Scholar. The purpose was to look into approaches that support the goals of my project, specifi-
cally in the areas of medical imaging and artificial intelligence. Phrases such as Mask R-CNN,
X-ray imaging, Deep Learning, and Feature Processing led me to sophisticated neural network
techniques, especially CNN-based methods for instance segmentation and classification. I was
able to narrow my emphasis to dental applications, where it is crucial to recognize and label teeth
in panoramic images and radiographs, thanks to other significant keywords like tooth detection
and Numbering, dental diagnostics, and automatic annotation. I also covered a wide spectrum of
image processing techniques by including terms like Feature Extraction, Numbering, and Neural
Networks, making sure to include both traditional and cutting-edge techniques (figure 1.6).

Figure 1.6: Keywords Used

Performance Metrics

In the following section, we will cite the different performance metrics used to assess the related
works’ performance.

27
Chapter 1 : State of The Art

Table 1.1: Performance Metrics Description

Metric Formula Description
Ncor(T P+T N)
Accuracy Ntotal(T P+T N+FP+FN)
Accuracy is the closeness of a given set of observed mea-
surements to their true value..
TP
Recall T P+FN
Measures the proportion of correctly identified positive ele-
ments among all actual positive elements.
2×Precision×Recall
F1 Score Précision+Recall
Harmonic mean of precision and recall. It is used to balance
precision and recall when classes are unbalanced.
TP
Precision T P+FP
Measures the proportion of positive elements correctly
identified among all elements identified as positive by the
model.
TN
Specificity T N+FP
Specificity measures the proportion of negative instances
correctly identified among all instances identified as neg-
ative.
IoU IoU calculates the spatial overlap between the area pre-
Po − Pe dicted by the algorithm and the ground truth area, by di-
1 − Pe viding the area of intersection between the two areas by the
area of their union.
R1
Average 0
p(r)dr Average Precision quantifies the accuracy of model predic-
Precision tions, taking into account the precision-recall trade-off.
1 PN
Mean N i=1 APi The average precision (mAP) is calculated by averaging the
Average average precision (AP) scores obtained for each class.
Precision
Receiver N/A The ROC curve is a graph illustrating the performance of a
Operating classification model at different discrimination thresholds.
Character-
istic(ROC)

1.8 Related Work

Farook et al. used the Clinical Annotation and Segmentation Tool (CAST) in their study, which
was published in Cureus (2023), to improve the automatic annotation and segmentation of dental
pictures. Yolov8 was used for object detection in the investigation, together with Meta’s Segment
Anything Model for semantic segmentation and X-Any Labeling for manual corrections. With

28
Chapter 1 : State of The Art

this method, dental features were detected and segmented with a mean Average Precision (mAP)
of 77.4%, demonstrating the tool’s potential to enhance dental diagnostics. However, issues like
class imbalance and picture distinction were identified, and reinforcement learning was suggested
as a way to improve (Farook, Saad, Ahmed, & Dudley, 2023).

Pal et al. developed the Attention UW-Net A fully connected deep learning model.Their re-
search aims to increase the precision of automatic segmentation and annotation of chest X-rays.
The model combines attention processes with U-Net to capture local and global characteristics
efficiently, improving the segmentation and annotation of chest radiographs. The model outper-
formed traditional models like as U-Net in lung segmentation, achieving a mean F1 score of 95.7
% on the NIH Chest X-ray Dataset, which consists of 112,120 pictures across 14 disease classifi-
cations. With a Dice Similarity Coefficient (DSC) of 0.915 and an average AUROC of 0.866 for
illness annotation, the Attention UW-Net likewise performed better than the other models; nev-
ertheless, more testing on bigger datasets and investigation into the interpretability of the model
is advised (Pal, Reddy, & Roy, 2022).

Hosntalab et al. developed a comprehensive multi-stage system for the automatic numbering
and classification of teeth in multi-slice CT (MSCT) images. This is an important task for foren-
sic medicine and quantitative dentistry. The three primary phases of the suggested algorithm are
segmentation, feature extraction, and classification. To precisely isolate individual teeth during
the segmentation step, a variety of methods including Otsu thresholding, morphological proce-
dures, panoramic re-sampling, and variational level set were used. The feature extraction phase
focused on the tooth slice with the biggest tooth tissue by employing a multi-resolution tech-
nique to compute the feature vectors for each tooth using a wavelet-Fourier descriptor (WFD)
in conjunction with a centroid distance signature.for additional improvement. During the clas-
sification phase, distinct teeth were identified and categorized using a feed-forward neural net-
work classifier. The approach was successful in accurately categorizing teeth after undergo-
ing extensive testing on 30 MSCT datasets totaling 804 teeth. Specifically, the WFD approach
demonstrated significant invariance properties and the capacity to precisely number teeth even
in the presence of missing ones, outperforming other descriptors such as Fourier and wavelet
descriptors. The study represents a substantial breakthrough in dental imaging by offering an in-
tegrated and anatomically independent method for classifying teeth in MSCT images (Hosntalab,
Aghaeizadeh Zoroofi, Abbaspour Tehrani-Fard, & Shirani, 2010).

Bilgir et al. introduced an innovative approach of using artificial intelligence (AI) to count
and identify teeth automatically in panoramic radiographs, a vital dental imaging tool that records
mandibular and maxillary teeth as well as their supporting structures. The Faster R-CNN Incep-
tion v2 model, a deep convolutional neural network, was used in the study to create the AI system
CranioCatch. A dataset of 2,482 anonymized panoramic radiographs from the archives of Es-
kisehir Osmangazi University was used to train and evaluate this AI system. A subset of 249

29
Chapter 1 : State of The Art

radiographs were used for rigorous testing of the algorithm, and its performance was compared
to human observation. The AI achieved a sensitivity of 95.59 %, precision of 96.52 %, and an
F-measure of 96.06 % across all quadrants, indicating a high degree of accuracy, according to the
data. These results demonstrate how AI systems can effectively and precisely identify and count
teeth in panoramic radiographs, implying that in the future, this technology may supplant con-
ventional human assessment and greatly facilitate clinical decision-making (Bilgir et al., 2021).

Estai et al. developed a sophisticated deep learning method to automatically identify and
label permanent teeth in orthopantomogram (OPG) pictures which is a critical function in both
forensic and routine dentistry. To complete this task with great accuracy, the study used a three-
step procedure with Convolutional Neural Networks (CNNs). The first step in the segmentation
process involved using a U-Net CNN to find regions of interest (ROIs) in panoramic pictures
that contained teeth. Next, individual teeth inside these ROIs were successfully identified using
a Faster R-CNN model, which is renowned for its effectiveness in object detection. The results
showed that the performance was strong: the tooth detection module obtained an F1 score of
0.98, the recall and precision of 0.99, and the ROI detection module achieved an Intersection
over Union (IoU) of 0.70. These findings highlight deep learning’s potential for automating
dental charting, which could have major advantages in both forensic and clinical settings (Estai
et al., 2022).

Li et al. introduced an inventive method for using convolutional neural networks (CNN) for
the automatic annotation of medical radiological pictures. The creation of an image gradient
information model, multi-resolution feature extraction, and the incorporation of CNNs to im-
prove annotation accuracy and efficiency are among the study’s principal accomplishments. The
suggested algorithm showed notable gains in radiological image segmentation and key point an-
notation by utilizing a diseased region detection based on texture regularity and a segmentation
pattern matching method. The system was tested on a heterogeneous dataset that combined the
INbreast dataset with COPD machine learning data. It obtained an impressive multi-resolution
feature extraction accuracy of 98.7%, demonstrating its potential to improve diagnostic precision
and develop medical technology in the future (Li, Wang, & Cai, 2021).

Gao and al explored the automatic annotation of Cone Beam Computed Tomography (CBCT)
images using the Grouped Bottleneck Transformer, a hybrid model that combines the advantages
of Transformers and Convolutional Neural Networks (CNNs). This technique is especially use-
ful for medical picture analysis since it makes use of the transformer’s capacity to capture global
dependencies and the CNN’s skill in extracting local characteristics. The MedMNIST3D dataset
was used to test the model after it had been trained and assessed on a dataset of CBCT pictures
from 26 patients. The model’s resilience was proven by the findings, which showed an astound-
ing accuracy of 91.3 % and an AUC score of 99.7%. These metrics demonstrate how the Grouped
Bottleneck Transformer can advance medical imaging research and enhance diagnostic accuracy

30
Chapter 1 : State of The Art

(Gao, Li, Li, Li, & Deng, 2022).

Xu et al. presented a sophisticated method that uses a U-Net-based DL model to automati-

cally segment dental features in panoramic radiographs. The exact demarcation of teeth and the
anatomical structures around them was the main focus of the study because it is essential to ap-
propriate diagnosis and treatment planning in dentistry. Using a large dataset of 6,046 panoramic
radiographs, the U-Net model was trained to segment complex dental images with great accu-
racy. With a mean Intersection over Union (mIoU) of 92% and precision and recall metrics of
97%, the model showed remarkable performance. These outcomes demonstrate the U-Net archi-
tecture’s efficacy in dental imaging and demonstrate its potential to greatly improve automated
dental diagnostics’ accuracy and aid in the creation of more productive clinical workflows (Xu
et al., 2023).

Table 1.2: Benchmarking of Related Works.

Article Dataset Model Performance

(Bilgir et al., 2,482 Faster R-CNN In- sensitivity of 95.59 %, pre-
2021) anonymized ception v2 model for cision of 96.52 %, and an
panoramic radio- CranioCatch F-measure of 96.06
graphs
(Tekin, Ozcan, 1200 images Mask R-CNN 94.35% precision and
Pekince, & Yasa, 91.51% as an mAP
2022)
(Estai et al., dataset of 17,135 U-Net CNN detection: a recall and
2022) teeth precision of 0.99 and for
the numbering module: a
recall, precision, and F1
score of 0.98
(Kılıc et al., 421 images , Faster R-CNN In- sensitivity = 0.9804, pre-
2021) 7999 labels ception v2 cision = 0.9571 , and F1
score = 0.9686
(Xu et al., 2023) 6,046 panoramic U-Net precision and recall ¿ 97 ,
radiographs mIoU = 92 %

31
Chapter 1 : State of The Art

Study Dataset Model Performance

(Kim, Kim, 6,046 panoramic SSD and RCNN teeth: (mAP) of 96.7 for
Jeong, Yoon, & radiographs (teeth detection) + tooth detection at an IOU
Youm, 2020) Faster RCNN and of 0.5, and 75.4 at an IOU
Inception V3 for of 0.7.
classification
(Gao et al., 2022) CBCT images Grouped Bottleneck accuracy of 91.3 % and an
from 26 patients Transformer (trans- AUC score of 99.7 %.
+ tested on former+CNN)
MedMNIST3D
(Li et al., 2021) 14 037 images Yolov7 mAP = 0.986
(Kong, Yoo, Lee, BCPO (train) + CNN Not mentioned
Eom, & Kim, INbreast (test)
2023)
(Li et al., 2021) Train : COPD CNN multi-resolution feature
Dataset Test : IN- extraction is as high as 98
breast Dataset %.
(Cui, Li, & Wang, private extension of the Detection accuracy 98.20
2019) Mask R-CNN % , identification accuracy
pipeline to 3D: 93.24 %
Deep supervised
network for edge
map extraction +
3D RPN followed
by four branches
for segmentation,
classification, 3D
bounding box
regression, and
identification.

32
Chapter 1 : State of The Art

Study Dataset Model Performance

(Tuzoff et al., 6,046 panoramic radio- Faster R-CNN for detec- * Detection: sensi-
2019) graphs tion + VGG-16 for num- tivity = 0.9941, pre-
bering cision = 0.9945. *
Numbering : sen-
sitivity = 0.9800 ,
specificity = 0.9994
(Farook et al., 569 radiographies + Yolov8+SAM mAP = 77.4 %, Pre-
2023) photographies intra- cision = 75.0 % Re-
orales call 72.1 %,
(Pal et al., 2022) NIH Chest X-ray UNet+attention mecha- av AUROC = 0.866
Dataset : 112 120 nism Dice moyen = 0,915
images

1.9 Synthesis

Even though they are quite sophisticated, the state-of-the-art models for dental image annotation
and tooth recognition now in use have a number of drawbacks that impair their usefulness in
clinical settings. The prevalence of mismatches, which include incorrect categorization, missed
detections, and incorrect predictions, is a significant problem. These mismatches are mostly
caused by elements such as missing, overlapping, or severely broken teeth. The intricacy of
tooth structures is frequently the cause of these mismatches, particularly when there are several
anomalies or damaged regions. Furthermore, dataset dependence is still a major problem; al-
though some models show promise for generalization, the quantity and diversity of the training
datasets frequently constrains how well they function. When used on more complicated, real-
world data with a variety of cases and imaging settings, models developed on more homogeneous
datasets perform poorly.

The application of bounding boxes in congested areas presents another difficulty since it
may result in imprecise segmentation. When this happens, models might have trouble telling
apart teeth that are overlapping or closely spaced, which would lead to inaccurate detection
or segmentation. Similar to this, while very useful for segmenting tooth morphology, models
such as SAM (Segment Anything Model) struggle to analyze radiographs that show overlapping
obturated root canals or intraoral pictures that show shadows and blocked areas, like the buccal
folds. These shaded regions are frequently mislabeled and incorrectly classified because they are
mistaken for anatomical features.

Moreover, class imbalance between radiography and oral images presents further difficul-

33
Chapter 1 : State of The Art

ties since it distorts the model’s learning process, making it more adept at handling some image
types while having difficulty with others. Specifically, intraoral pictures provide challenges in
precisely assessing the depth of carious lesions, which can result in **misinterpretations**, such
as mistaking cavitated lesions for corroded amalgam restorations. In order to address these prob-
lems, manual adjustments are frequently necessary. Additionally, researchers propose employing
reinforcement learning to continuously enhance the model’s decision-making abilities.

In light of these drawbacks, future research should concentrate on improving the robustness
of models, especially when dealing with situations including severe anomalies, severe damage,
or poor picture quality. For models to be applicable in actual clinical settings, they must to
be verified on a wider range of demographics and imaging systems. Integration with current
clinical procedures would also help close the knowledge gap between research and practice by
giving clinicians immediate support and increasing the effectiveness of dental diagnostics as a
whole. In order to solve these issues going ahead, our suggested method would make use of Mask
R-CNN with attention techniques to enhance segmentation accuracy and robustness in difficult
scenarios.

1.10 Conclusion

The theoretical and technological underpinnings that are necessary to comprehend the develop-
ments in the field of automatic annotation of dental pictures have been covered in this chapter.
We talked about the several medical imaging modalities that are employed in dentistry diagnos-
tics, emphasizing the value of methods like CT scans and X-rays for precise investigation of oral
disorders. We then examined current developments in computer vision and deep learning inte-
gration with dental imaging. In particular, convolutional neural networks (CNNs) have shown to
be effective instruments for the analysis and categorization of medical images. However, these
methods still have limits, especially when it comes to precise, automated annotation, even with
their promising performance. The difficulties in annotating dental photos were finally covered,
with special attention to the intricate anatomical structure of teeth and surrounding tissues. Ad-
ditionally, we emphasized the expanding significance of automatic annotation techniques, which
can enhance training and research applications while also increasing diagnostic accuracy and
treatment time. Thus, the state of the art offers a strong basis for comprehending current prac-
tices and laying the groundwork for the introduction of novel techniques, which will be covered
in detail in the upcoming chapters. The latter seek to increase the precision of automatic annota-
tion systems for dental pictures and close some of the gaps that have been found.

34
Chapter 2
Methodology

2.1 Introduction

This chapter delves into the methods used for the project, primarily emphasizing the use of the
Mask R-CNN model for instance segmentation and classification. In computer vision, instance
segmentation is an essential operation, especially when accurate object detection and delineation
is required, as in our dental picture annotation example. Mask R-CNN is an excellent solution
for this problem because it is a potent expansion of the Faster R-CNN framework that can handle
both object identification and pixel-level segmentation.

We will quickly go over the development of the region-based convolutional networks (R-
CNN) family, which provided the foundation for object detection and segmentation, in order to
comprehend how Mask R-CNN became a state-of-the-art model. Beginning with the original
R-CNN model, developments resulted in Faster R-CNN, which presented a method for produc-
ing region proposals that was more effective. Lastly, Mask R-CNN expanded this architecture
by adding a segmentation mask prediction branch, which allowed it to perform segmentation,
classification, and detection tasks at the same time.

The main features of the Mask R-CNN model, such as its architecture, the function of the
Region Proposal Network (RPN), and its utilization of backbone networks, will be covered in
this chapter. We will also go over the preprocessing procedures used to get the data ready, as
well as how hyperparameters were adjusted for best results in our dataset. We will also review
over the wide range of applications that Mask R-CNN has been effectively used in, emphasizing
how versatile it is in various contexts (He, Gkioxari, Dollár, & Girshick, 2017).

35
Chapter 2 : Methodology

2.2 Instance Segmentation

2.2.1 Definition

In computer vision, instance segmentation is an intricate and difficult process that needs precise
item recognition and delineation inside an image. Instance segmentation necessitates the produc-
tion of exact pixel-wise masks for each detected object, in contrast to classical object detection,
which just identifies the existence and placement of objects. This allows for a more nuanced and
comprehensive comprehension of the visual scene (He et al., 2017).

The need for a more detailed understanding of items inside an image is addressed by instance
segmentation. When it comes to object detection, two cars that belong to the same class are
considered to be in the same category. On the other hand, various instances of the same item
are distinguished via instance segmentation. For instance, instance segmentation distinguishes
between different instances of the same class in an image including multiple vehicles by assign-
ing distinct masks to each individual car in addition to classifying them as ”cars” (Hafiz & Bhat,
2020).

Figure 2.1: Instance Segmentation.

2.2.2 Technical Components of Instance Segmentation

Since instance segmentation necessitates exact pixel-by-pixel localization of the object’s bound-
aries, it is a more complicated process than ordinary object detection. Three main goals can be
included into the task: segmentation, classification, and object detection.

36
Chapter 2 : Methodology

• Object Detection: Finding the objects in a picture is the initial stage in the instance seg-
mentation process. In order to do this, region suggestions for potential locations of things
must be created. Proposal networks (RPNs) are a popular technique for generating candi-
date bounding boxes for possible objects in the image.

• Classification: The item within each proposed region must be classified after the regions
have been proposed. Is the object being detected, for instance, a tree, an automobile, or
a pedestrian? As part of this classification task, each object must be assigned to a certain
category.

• Segmentation: At this point, each object’s pixel-level boundaries are identified, enabling
the exact mask-making process that defines the contours of every object that has been
recognized. This stage adds the capacity to fully comprehend the shape and structure of
the item, setting instance segmentation apart from other object detection tasks.

A common instance segmentation model is usually based on Convolutional Neural Networks

(CNNs), one of the deep learning techniques. Architectures such as Mask R-CNN, which add
a parallel branch for pixel-level segmentation to object identification frameworks, are used in
modern instance segmentation models. This allows the model to recognize and classify the
object and also generate a mask for each instance (Hafiz & Bhat, 2020).

2.2.3 Key Challenges in Instance Segmentation

We cannot deny that instance segmentation is particularly challenging as it deals with a number
of problems that are not encountered in conventional object detection tasks.

• Occlusion:
It might be challenging to accurately detect and segment unique instances when objects
partially or totally overlap in many real-world circumstances. This is a typical problem in
busy scenarios when items obscure one another, such traffic or group photos. To accurately
distinguish each item even in the presence of occlusions, an instance segmentation model
needs to be resilient.

• Varying Object Sizes:

Handling objects of different sizes presents another difficulty. Various item scales can be
seen in many pictures, ranging from little objects (like far-off walkers) to enormous ones
(like foreground cars). The model needs to be able to accurately identify and segment
objects at various scales.

37
Chapter 2 : Methodology

• Complicated Object Shapes:

Segmenting objects in an image might be challenging if they have complicated or irreg-
ular shapes. Bounding boxes may not always be able to capture the complex boundaries
of things, such as trees or animals. A more detailed depiction of these things is possi-
ble thanks to instance segmentation’s pixel-level accuracy, however, reaching this level of
precision is extremely challenging.

2.3 Proposed Model

2.3.1 Historical Context

Evolution from R-CNN to Faster R-CNN to Mask R-CNN The progression of object recog-
nition and instance segmentation techniques from R-CNN to Faster R-CNN and, finally, Mask
R-CNN is a noteworthy development that reflects the ongoing improvement of deep learning
methods in computer vision.

R-CNN: The Foundation of Modern Object Detection The narrative starts with Ross Gir-
shick’s 2014 introduction of R-CNNs (Region-based Convolutional Neural Networks). One of
the first methods that helped establish region-based object detection was R-CNN. The technique
works by first applying an external algorithm called selective search to produce roughly 2,000 re-
gion proposals for each image. After that, each proposal is fixed in size and runs through a CNN
to extract features. These attributes are then used to fine-tune the bounding box coordinates and
identify the object inside the region using a series of classifiers (such as SVMs). Although R-
CNN produced notable improvements in accuracy, it was slow and computationally costly. The
model was unworkable since each area proposal had to be handled separately by the CNN, re-
sulting in repetitive calculations and lengthy processing times, making the model impractical for
real-time applications (Virasova, Klimov, Khromov, Gubaidullin, & Oreshko, 2021).

Fast R-CNN: Improving Efficiency and Speed To address the inefficiencies of R-CNN
The same group introduced Fast R-CNN in 2015. By offering a single-stage method in which the
complete image is run through a CNN to create a feature map, Fast R-CNN enhanced R-CNN.
The regions of interest (ROIs) are directly highlighted on the feature map, eliminating the need
to distort each area proposal and resulting in a large reduction in redundant computations. From
these ROIs, the RoI Pooling layer subsequently extracts fixed-size feature maps that are utilized
in bounding box regression and classification.

Fast R-CNN significantly reduced training and inference times without sacrificing accuracy
(Huang et al., 2022). It continued to rely on external region proposal techniques like selective

38
Chapter 2 : Methodology

search, which created a performance and efficiency bottleneck even though they were faster than
processing each suggestion separately.

Faster R-CNN: A Fully Integrated Approach Faster R-CNN, which completely incorpo-
rated the region proposal procedure into the neural network design in 2015, was the next sig-
nificant advancement. The Region Proposal Network (RPN), a lightweight network that shares
convolutional layers with the primary detection network, was introduced by Faster R-CNN. By
sliding a tiny network across the common feature map, the RPN effectively creates region pro-
posals, producing bounding boxes and objectness ratings. The object detection process became
faster and more efficient as a result of this integration, which removed the requirement for exter-
nal region proposal techniques.

Because the region recommendations were created in a way that was better tailored for the
downstream detection task, faster R-CNN increased both the speed and accuracy of item detec-
tion. Because of this breakthrough, Faster R-CNN has become one of the most popular object
identification frameworks and is used in a wide range of applications (Ren, He, Girshick, & Sun,
2016).

Mask R-CNN: Extending to Instance Segmentation Faster R-CNN performed incredibly

well in object detection, but it was not meant for pixel-level applications like instance segmen-
tation. This gap prompted the creation of Mask R-CNN in 2017, which added a branch for
segmentation mask prediction for every item to Faster R-CNN, along with bounding box de-
tection and classification. By avoiding quantization, Mask R-CNN introduced RoI Align, an
advance over RoI Pooling that allowed for more precise localization by preserving the precise
spatial locations of areas. The model can produce object masks with pixel accuracy thanks to
the mask branch, which is added in parallel to the other branches and predicts a binary mask for
each ROI. Mask R-CNN’s performance in instance segmentation tasks, where it is essential to
differentiate between overlapping objects and capture their exact boundaries, has improved as a
result of this development (Yang, Dong, Xu, & Gu, 2020).

The progression of object identification models, from R-CNN to Faster R-CNN to Mask
R-CNN, shows a step-by-step improvement, with each iteration resolving the shortcomings of
the previous one. Region-based object detection was made possible by R-CNN, but its process-
ing inefficiency was a drawback. Fast R-CNN enhanced speed by sharing computation across
regions, but still relied on external proposal methods. A quicker and more precise detection
framework was produced using faster R-CNN, which completely integrated the proposal gener-
ating process into the network. Lastly, Mask R-CNN added the capacity to generate pixel-level
object masks, extending these features to instance segmentation. The field of computer vision
has greatly evolved as a result of these developments, allowing for more accurate and effective
object detection and segmentation (Yang et al., 2020).

39
Chapter 2 : Methodology

Figure 2.2: History.

Faster R-CNN: Model Overview and Architecture

One of the main contributions to the development of object identification models is Faster R-
CNN, which offers a mechanism to produce region recommendations directly within the network
instead of depending on external algorithms. This is a major improvement over its predecessors.
A faster and more accurate object detection framework was produced as a result of this break-
through, and it was widely used in many different computer vision applications.

There are multiple important components that make up Faster R-CNN’s architecture. Its
fundamental component is a backbone network, which functions as a feature extractor and cre-
ates rich feature maps from input images. Typically, this CNN is a convolutional neural network
(CNN) like ResNet or VGG. The implementation of the Region Proposal Network (RPN), a fully
convolutional network that glides over these feature maps to suggest possible object regions, or
Region of Interests (RoIs), is the key innovation in Faster R-CNN. Bounding boxes and object-
ness scores, which represent the probability that each region contains an object, are produced by
the RPN for these regions.

The suggestions produced by the RPN are passed into a layer called ROI Pooling, which takes
the variable-sized proposals and extracts fixed-size feature maps from them. Following that,
these feature maps are run through two concurrent branches: one for bounding box regression,
which improves the coordinates of the proposed bounding boxes to better fit the observed objects,
and another for object classification, which labels each recommended region.(?, ?)

40
Chapter 2 : Methodology

Figure 2.3: Faster RCNN’s Architecture.

Limitations of Faster R-CNN

Faster R-CNN is successful, but it has several drawbacks, especially when used for more difficult
tasks like instance segmentation. Its incapacity to execute segmentation at the pixel level is its
main drawback. Faster R-CNN can recognize things with accuracy and draw bounding boxes
around them, but it is unable to segment out the precise pixels that correspond to each object
within the bounding box or distinguish between objects that overlap. It is not appropriate for
activities where exact object boundaries are essential due to its lack of granularity.

The ROI Pooling layer presents another drawback. ROI Pooling can produce fixed-size fea-
ture maps from proposals, but it does so by quantizing the coordinates of the feature map, which
may result in a loss of spatial accuracy. Even though this quantization technique is computa-
tionally efficient, it introduces errors that may impair the model’s capacity to accurately pinpoint
objects, particularly those that are smaller.

Furthermore, managing objects of different sizes is a little difficult due to the architecture
of Faster R-CNN. Although the RPN is good at suggesting regions, it occasionally has trouble
coming up with the best suggestions for things that are much bigger or smaller than the anchors
are intended to support due to their predetermined aspect ratios and anchor sizes.

How Mask R-CNN Addresses These Limitations

In order to overcome the shortcomings of Faster R-CNN, Mask R-CNN was created, especially
by increasing its capacity to carry out instance segmentation. The quantization of feature map
coordinates is eliminated when the RoI Align layer takes the place of the RoI Pooling layer

41
Chapter 2 : Methodology

in Mask R-CNN, which is one of the most important advances. By maintaining the precise
spatial coordinates of the suggestions, ROI Align improves performance on jobs requiring precise
boundary identification and improves object localization.

Additionally, Mask R-CNN creates a new mask branch concurrently with the bounding box
regression and classification branches already in place. For every item, this branch creates pixel-
level segmentation masks, which allow the model to distinguish between overlapping instances
and produce comprehensive segmentation outputs. The mask branch successfully overcomes the
segmentation restrictions of Faster R-CNN by operating at a lower resolution to control compu-
tational costs while still providing accurate masks.

Through the integration of these improvements, Mask R-CNN not only preserves the advan-
tages of Faster R-CNN but also expands its capabilities to manage increasingly intricate assign-
ments, transforming it into a more adaptable and potent instrument for an extensive array of
computer vision uses (He et al., 2017).

2.3.2 Mask RCNN

He et al.’s Mask R-CNN, which tackles this challenging problem, expands upon Faster R-CNN’s
strong architecture for object detection. By including a branch for segmentation mask predic-
tion, Mask R-CNN expands on Faster R-CNN and successfully unifies object identification and
semantic segmentation into a single architecture. Because of this dual ability, Mask R-CNN can
precisely capture the shape and outlines of each object in an image by creating high-resolution
masks for each one in addition to identifying and classifying them (He et al., 2017).

Architecture of Mask R-CNN

Mask R-CNN’s architecture is strong and adaptable, enabling it to handle a wide range of object
recognition and segmentation applications (He et al., 2017). To extract rich feature representa-
tions from the input image, a backbone network—typically a deep convolutional neural network
(CNN) like ResNet or ResNeXt—is utilized first. Following that, these characteristics are sent
via a crucial component that was carried over from Faster R-CNN: a Region Proposal Network
(RPN), which produces a set of candidate object proposals, or areas of the picture that are prob-
ably going to contain objects.

Following the identification of these regions, Mask R-CNN adds a branch

that is only used to forecast segmentation masks. This branch uses a tiny fully convolutional net-
work (FCN) to create a binary mask for each object based on the features of the region of interest
(RoI). Another crucial Mask R-CNN innovation, the RoIAlign layer, is essential to this proce-

42
Chapter 2 : Methodology

dure. To maintain spatial accuracy and produce superior mask predictions, it makes sure that the
features retrieved for each region are in line with the original input image. In comparison, Faster
R-CNN’s RoIPool may lead to misaligned features and, as a result, less precise segmentation.

Figure 2.4: Global Architecture of Mask RCNN.

More about the backbone : ResNet

Residual Network, or ResNet, is a kind of deep neural network architecture that was first pre-
sented in the 2015 publication ”Deep Residual Learning for Image Recognition” by Kaiming He
and colleagues. It gained a lot of traction and gained recognition for its ground-breaking abilities
in deep learning applications, particularly picture categorization. With a top-5 error rate of 3.57
%, ResNet emerged victorious at the 2015 ImageNet Large Scale Visual Recognition Challenge
(ILSVRC), marking a turning point for deep learning. Deeper neural networks are capable of
capturing more complicated features, but training them becomes more difficult.

The ”vanishing gradient” issue, which arises when gradients—which are utilized in back-
propagation to update weights—become extremely small, making it challenging to efficiently
update the previous layers, is one of the primary issues with deeper networks. This leads to
very deep networks performing poorly. Adding extra layers may not always improve the model’s
performance; this is referred to as ”degradation.”

The introduction of skip connections, also known as residual connections, is the main inno-
vation of ResNet. These connections send the input directly to a later layer, eschewing one or
more tiers in the process. The essential concept is that layers learn the residual of this mapping
rather than the intended underlying mapping. Put simply, ResNet expects each layer to learn the

43
Chapter 2 : Methodology

residual, or the difference between the layer’s input and output, rather than expecting it to fit the
full transformation. This facilitates better convergence of deeper networks and streamlines the
learning process. (KaimingHe, 2015)

Architecture of Resnet 50 ResNet-50 is a 50-layer deep neural network architecture con-

structed with leftover pieces. Its architecture mitigates the vanishing gradient problem by adopt-
ing skip connections to enable the network to learn residual mappings rather than direct map-
pings, in accordance with the ResNet design principle. conv1 (first convolution and max pooling)
is the first part of ResNet-50. There are then four phases, each with a number of residual blocks.
A bottleneck architecture consisting of three convolutional layers—1x1, 3x3, and an additional
1x1 convolution—is used in the creation of each residual block in ResNet-50. The 3x3 convolu-
tion carries out the real transformation, the first 1x1 convolution lowers dimensionality, and the
second 1x1 convolution raises dimensionality. A fully linked classification layer and a global
average pooling layer mark the network’s conclusion. These residual blocks enable ResNet-50
to retain accuracy while lowering training challenges related to deep networks, together with the
skip connections.

Figure 2.5: Architecture of ResNet 50.

Advantages of Resnet

• Facilitates Deep Network Training:

ResNet’s residual learning allows it to train networks with hundreds or thousands of layers
efficiently without experiencing performance deterioration or vanishing gradients.

44
Chapter 2 : Methodology

• State-of-the-Art Performance:
For many computer vision tasks, such as object detection, segmentation, and image classi-
fication, ResNet has set the standard. Numerous contemporary neural network topologies
have their roots in this architecture.

• Modular Design:
The architecture is adaptable to various applications and datasets since the residual blocks
may be simply layered to construct networks of varying depths (e.g., ResNet-50, ResNet-
101, etc.).

• Transfer Learning:
ResNet models that have already been trained on massive datasets such as ImageNet can
be refined on smaller datasets, which makes them ideal for transfer learning in a range of
contexts. (KaimingHe, 2015)

Mask R-CNN Parameters and Hyperparameters

Mask R-CNN’s parameter and hyperparameter selection has a significant impact on its perfor-
mance. The richness of the retrieved features, for example, depends on the backbone network’s
depth. Commonly used models that combine good performance and computational efficiency
are ResNet-50 and ResNet-101. The RPN’s usage of anchor scales and aspect ratios, which
specify the dimensions and forms of the areas the network views as possible objects, is also
essential.

The output size, which determines the spatial resolution of the features used for both mask
prediction and classification, is one of the parameters introduced by the RoIAlign layer. Hyper-
parameters in the mask branch itself, like the size of the filters and the number of convolutional
layers, affect the resolution and caliber of the final masks.

There are more factors to take into account when training Mask R-CNN, like selecting the
loss functions. Usually, a multi-task loss is applied, which combines mask prediction, bounding
box regression, and classification losses. Weight parameters regulate the ratio of these losses, and
they must be carefully adjusted to make sure the model learns every task efficiently and doesn’t
overfit to any one at the expense of others.

Strengths of Mask R-CNN

Mask R-CNN’s coupling of instance segmentation and object detection makes it very reliable
for use in medical applications. It simultaneously classifies and accurately locates the borders

45
Chapter 2 : Methodology

of individual tooth. This model uses a Feature Pyramid Network (FPN) to recognize objects at
many scales, making it well-suited for jobs requiring great precision. This is especially useful in
medical imaging since anatomical structures like teeth vary widely in size. Mask R-CNN’s seg-
mentation masks provide accurate depictions of tooth borders, and the classification part makes
sure the right label is applied to every tooth.

Limitations of Mask R-CNN

Mask R-CNN has limits despite its advantages, particularly when processing noisy data and com-
plicated anatomical structures that are prevalent in medical imaging. For example, the model
may have trouble segmenting data incorrectly or incompletely in cases of occlusions, or situ-
ations where neighboring teeth overlap. Furthermore, Mask R-CNN’s reliance on lower-level
convolutional features makes it challenging to capture delicate details in medical images. When
exact border demarcation is required, this may lessen its efficacy. Additionally, it’s possible that
Mask R-CNN’s architecture, which mainly uses convolutional neural networks (CNNs), falls
short of capturing the global context required for medical picture analysis.

Because of its limited ability to handle long-range connections between pixels in an image,
the model is less suitable for segmenting small objects or complex structures where the surround-
ing context is crucial to the classification process.

Optimization with attention mechanism

Self attention is a phenomenon that can dynamically focus on different portions of the input
data (pixels of the panoramic X-Ray) When making predictions, deep learning models, especially
neural networks. Each input component—such as a pixel in an image—considers every other
component and assigns weights according to how significant they are to one another in self-
attention. As a result, the model can represent linkages and long-range dependencies between
remote input components. Finally, self-attention dynamically switches its emphasis to the most
pertinent regions, as opposed to fixed filters in neural networks, providing more flexibility and
accuracy for tasks like segmentation. It facilitates the model’s ability to recognize and classify
items, especially in intricate situations including overlapping or small objects, by giving priority
to specific regions in an image. (Vaswani, 2017)

To overcome the model’s present drawbacks, I propose improving the Mask R-CNN archi-
tecture by adding attention mechanisms, especially after the ROI Align layer and before the Box
Head, allowing it to enhance feature representation for better classification and bounding box
predictions. It has been demonstrated that attention mechanism significantly enhance model per-

46
Chapter 2 : Methodology

Figure 2.6: Customized Architecture of Mask RCNN.

formance by enabling the network to concentrate on important regions of the image and compre-
hend spatial relationships more fully. We may enhance the model’s ability to focus on pertinent
characteristics, improving segmentation accuracy and object classification even in difficult con-
texts like dental X-rays, by incorporating layers into the segmentation and classification heads of
the model.

The emphasis of the model is constantly shifted towards the most significant portions of
the image via attention layers, in contrast to standard convolutional filters, which are fixed and
static. When handling problems like occlusions and overlapping features, like teeth in dental
pictures, this flexibility is especially helpful. The model will be able to identify minute details
and long-range relationships with the addition of attention, which is essential for more accurate
teeth classification and segmentation boundaries. Furthermore, attention adds useful contextual
information, which improves the model’s ability to distinguish between teeth that are aestheti-
cally similar. Higher annotation quality is ensured by this enhanced contextual awareness, which

47
Chapter 2 : Methodology

eventually improves the model’s robustness and accuracy in challenging, real-world dental imag-
ing tasks.

In particular, in tough settings like dental imaging, where complex anatomical structures and
unpredictability in image quality make correct segmentation and classification difficult, we hope
to address the holes left by Mask R-CNN by introducing attention. When attention is combined
with Mask R-CNN’s current capabilities, the outcome should be a more comprehensive annota-
tion tool that can yield more accurate and dependable results in the medical industry.

Applications and Techniques

Because of Mask R-CNN’s adaptability, it can be used in a variety of contexts, such as medical
image analysis, where accurate anatomical structure segmentation is crucial, and autonomous
driving, where safe navigation depends on knowing the precise location and shape of objects like
cars and pedestrians.

Several methods can be used to improve Mask R-CNN’s performance before using it. For
example, data augmentation is frequently employed to diversify the training set, which improves
the model’s ability to generalize to previously unseen images. By adding variability, methods
like flipping, random cropping, and color jittering can lessen overfitting and increase robustness.

Furthermore, transfer learning is often used, in which a target dataset is used to refine a
backbone network that has already been trained. This method makes use of the enormous amount
of knowledge that is already present in models such as ResNet, which speeds up training and
frequently improves performance, particularly in situations where labeled data is hard to come
by.

In conclusion, Mask R-CNN offers a strong and adaptable framework that can precisely rec-
ognize objects and segment them at the pixel level, marking a substantial breakthrough in the field
of instance segmentation. Its architecture, which combines cutting-edge elements like RoIAlign
and a special mask branch with well-considered parameters and hyperparameters, enables it to
perform exceptionally well in a variety of visual tasks, making it a useful tool for both academic
and commercial applications.

2.4 Proposed Automatic Image Annotation System Overview

The proposed system consists of three main steps: preprocessing, processing, and evaluation. As
shown in Figure 2.7, the preprocessing phase involves gathering the necessary data, followed by

48
Chapter 2 : Methodology

cleaning to remove any unwanted or duplicate entries. The data is then annotated, categorized,
and labeled. This results in two datasets: Dataset 1 and Dataset 2.

Figure 2.7: Automatic Image Annotation System overview

In the processing phase, the training is performed using Mask R-CNN with ResNet-101 as
the backbone model, generating initial results. These results are then optimized through the
integration of an attention function. The final step involves evaluating the performance of the
models.

2.4.1 Image acquisition and preprocessing

a) Image acquisition
The first crucial stage in developing any automatic annotation system is acquiring images.
Gathering the original photos for the model is part of this. Images are typically obtained
via specialized equipment, such as digital dental X-ray machines, in a medical context,
such as dental X-rays. To ensure patient privacy, these photos must be taken in compliance
with ethical and regulatory requirements. In addition to direct picture capture by imag-
ing instruments, image acquisition may also involve image retrieval from already existing
medical databases.

b) Image pre-processing
Pre-processing images is a crucial step in enhancing their quality and making it easier for
the segmentation model to analyze them. Grayscale normalization, noise reduction with
filters (such the Gaussian filter), contrast enhancement with histogram equalization, and
cropping photos to uniform sizes are some of the procedures that make up pre-processing.

49
Chapter 2 : Methodology

Reducing unnecessary changes in images is one of the primary goals of pre-processing,

which also helps machine learning algorithms perform better. Pre-processing, for instance,
can involve changing brightness and contrast in dental X-rays to highlight dental features.

2.4.2 Automatic image segmentation

The goal of automatic image segmentation is to separate an image into multiple regions or seg-
ments that correspond to distinct objects or areas within the image. This stage is essential for
identifying specific things in an X-ray, like teeth, and for giving a more thorough comprehension
of the picture. Segmentation allows for the isolation of individual teeth in dental X-rays, which
facilitates automatic labeling and identification of the teeth.

By including a mask branch in the Faster R-CNN object detection network, picture segmen-
tation is accomplished in the Mask R-CNN architecture. The task of this branch is to forecast
a binary mask for every object found in the picture. This mask allows the object to be properly
split by representing the region it occupies. A dental X-ray, for instance, can be divided into
multiple zones, each of which corresponds to a different tooth. The segmented teeth’s contours
are indicated on the photos by applying masks predicted using Mask R-CNN. A deep learning
model trained on manually annotated samples automates this process.

2.4.3 Image attribute extraction and calculation

After the photos have been divided into segments, the pertinent characteristics or attributes for
each segment must be extracted. These attributes, which include shape, size, texture, and pixel
intensity, are numerical data that describe different aspects of the segmented item. These features
may include tooth-specific details like the curvature of the dental crown or the distance between
neighboring teeth in the context of dental X-rays. Extracting characteristics is a vital step, as this
information is then utilized to classify segmented objects.

With Mask R-CNN, deep features are generated by a backbone (e.g., ResNeXt) to extract
characteristics at various layers of the picture. Additionally integrated to capture information at
numerous scales, the Feature Pyramid Network (FPN) is very helpful for recognizing objects of
varying sizes. Thus, each segmented item can be richly and thoroughly represented thanks to the
extracted features, which also supply the information required for the annotation and classifica-
tion phases.

50
Chapter 2 : Methodology

2.4.4 Annotation by automatic image classification

The process of automatically labeling objects or regions in photos is dependent on the system’s
capacity to do so using previously extracted attributes. This means that, in the case of dental
radiographs, every segment recognized as a tooth needs to be categorized based on its kind (mo-
lar, incisor, premolar, etc.). In order for the system to produce accurate annotations that medical
experts can use for diagnosis or therapy, this phase is essential. Mask R-CNN’s classification
branch, which forecasts each segmented object’s class, helps it do this task. Features retrieved
by the neural network’s backbone are used to fuel this process, and extra neural network layers
are used to improve class predictions. An annotated dataset with distinct labels for each tooth
is used to train the model. The model determines the labels for the segments it detects during
inference by comparing the features it derived from the fresh images with those it learned during
training.

2.5 Performance of the automatic image annotation system

2.5.1 Confusion Matrix

Figure 2.8: Confusion matrix.

• True positives (TP): The proportion of cases that the model correctly classifies as positive

51
Chapter 2 : Methodology

and are actually positive. The quantity of cases that are actually negative but are mistakenly
categorized as positive by the model is known as false positives (FP).

• False Negatives (FN): The quantity of cases where the model classifies something as neg-
ative even if the data is actually positive.

• True Negatives (TN): The quantity of cases that the model correctly classifies as negative
even though they are actually negative.

2.5.2 Accuracy

gives an overall evaluation of model performance by measuring the percentage of correctly cat-
egorized cases among all occurrences in a dataset. It is a model’s total performance.

2.5.3 Precision

The precision of a model indicates its capacity to prevent false positives and the accuracy of
its positive predictions. It is measured as the percentage of true positive predictions among all
positive predictions generated by the model.

2.5.4 Recall

The percentage of true positive predictions among all true positive cases in the dataset is called
recall, or sensitivity. This number indicates how well the model captures all positive instances.
This is how the real positive sample gets covered.

2.5.5 Score-F1

The F1 score is a useful metric in situations where there is an uneven distribution of classes or if
the costs of false positives and false negatives fluctuate. It is calculated as the harmonic mean of
precision and recall.

52
Chapter 2 : Methodology

2.5.6 Specific performance metrics

The following are common performance metrics, especially in computer vision, that are used to
assess object detection and recognition systems.

Intersection over union

is a metric that’s frequently used to assess how well object detection algorithms work, especially
when it comes to segmentation and bounding box detection. Intersection on union measures the
overlap between two bounding boxes or locations of interest. The degree of alignment between
predicted and ground truth bounding boxes is measured using this metric for evaluation. It offers
a numerical indicator of how accurate an item localization is for tasks like segmentation and
object recognition.

Average Precision

takes the precision-recall trade-off into account when calculating the model’s prediction accu-
racy.

An expansion of the Average Precision (AP) metric, which is frequently used to assess the
effectiveness of object identification and recognition algorithms, is

Mean Average Precision

. It offers a thorough evaluation of object detection systems’ accuracy for a variety of object
classes.

The Average Precision (AP) scores for each class are averaged to determine the Mean Av-
erage Precision (mAP). The AP is the area under the Precision-Recall curve (PR curve), which
measures the accuracy of object identification models across classes.

2.6 Conclusion

This chapter describes the approach we used for the project, with an emphasis on segmenting
and classifying dental X-ray images using Mask R-CNN. The comprehensive exposition of the

53
Chapter 2 : Methodology

model’s design, encompassing the Region Proposal Network (RPN) and its multitasking func-
tionalities, underscores its appropriateness in tackling the obstacles linked to dental image anno-
tation. Furthermore, we investigated a range of preprocessing methods, including data augmenta-
tion and normalization, which were utilized to improve the model’s robustness and performance.
Moreover, the incorporation of attention mechanisms was presented as a crucial advancement in
enhancing the identification of tiny structures and managing overlapping areas. This methodol-
ogy’s mix of sophisticated computer vision models and deep learning approaches offers a strong
basis for achieving high accuracy in challenging medical imaging tasks like dental annotation.
The results of the applied strategy will be covered in detail in the next chapter, where we will
assess the model’s effectiveness using the suggested metrics.

54
Chapter 3
Experiments and Results

55
Chapter 3 : Experiments and Results

3.1 Introduction

We show and evaluate the results of the approaches and strategies covered in the earlier sections
in this chapter. The objective is to assess the Mask R-CNN model’s performance in terms of
segmenting and classifying dental X-ray images. Various measures, including precision, recall,
and F1-score, will be used to evaluate the results and provide light on the model’s advantages
and disadvantages. Furthermore, the effect of integrating attention layers will be evaluated by
contrasting the model’s performance with and without attention mechanisms. The difficulties
experienced during deployment will also be highlighted in this chapter, along with some areas
for improvement that could result in a more reliable and accurate dental image annotation system.

3.2 Working environment

Several platforms and technologies were employed in this project to guarantee the effective and
organized development of the various phases of research and experimentation. Through the
usage of these resources, we were able to enhance and organize the results we got by accessing
scientific databases, sophisticated annotation tools, and robust computer environments.

3.2.1 Kaggle

Kaggle is a crucial platform that provides an extremely efficient workspace for implementing
intricate models. T4 GPUs were used in this project to speed up experimentation and processing,
especially during the deep neural network training phases. This platform’s primary benefit is its
ability to integrate with libraries like TensorFlow and PyTorch, which makes environment setting
quick.

However, Kaggle’s restriction on GPU usage to 30 hours per week is one of its main draw-
backs. This necessitates strict execution time management, which calls for effective training and
computing session scheduling to make the most of the resources at hand. Despite this limitation,
using the T4 GPU allows for much faster computation times compared to conventional CPUs.
This allows for the execution of complicated models up to ten times faster, which is a major
benefit for resource-intensive research on networks like Mask R-CNN.

56
Chapter 3 : Experiments and Results

Figure 3.1: Working Platform : Kaggle.

3.2.2 Roboflow

In supervised learning applications, image annotation is an essential first step, especially for seg-
mentation and object recognition. The best tool for this job is Roboflow. It is unique in that it
offers annotations with polygonal bounds, allowing for more accurate object delimitation, espe-
cially with dental X-rays. For medical photos, the ability to generate unique annotations from
polygonal shapes is particularly advantageous because bounding boxes are sometimes insuffi-
cient to accurately delineate complicated anatomical structures. This Roboflow feature ensures
a more faithful portrayal of tooth shapes and anomalies evident in radiographs. In addition, the
platform contains picture pre-processing techniques such as normalization, data augmentation
and format translation, enabling datasets to be easily suited to multiple algorithms.

Figure 3.2: Image Annotation Platform :Roboflow.

3.3 Dataset

In order to assure a representative and diversified collection of dental structures, we used a real-
world dataset of dental X-ray pictures that we collected from several sources for this research.
The meticulous selection of the data was a crucial step in this procedure. First, we received
a dataset of 100 dental X-ray pictures from a radiologist, which we added to our collection.
This dataset served as the foundation for our training set, along with extra X-ray pictures that
were obtained via the Kaggle marketplace. Kaggle, which is renowned for offering open-access
datasets, was a great help in growing our dataset and making sure we had enough pictures to

57
Chapter 3 : Experiments and Results

properly create and test the model. After being refined, the entire dataset used for the study
included about 100 photographs.

Figure 3.3: Dataset Xray Teeth images examples

A important stage in the dataset preparation was the deletion of photos containing children’s
dental X-rays. Since children’s teeth are still developing, these photographs were removed from
consideration since their inclusion could bring significant variability and inaccuracies into the
model 3.4. Children’s teeth often have very different structures from adult teeth since they are still
developing. This fluctuation may cause erroneous model training, which would have a negative
impact on segmentation and classification results. We secured a more consistent dataset that
would result in improved precision in the model’s predictions by concentrating only on fully
formed adult teeth.

Figure 3.4: children Xray image example

The next important step was to manually annotate the dental X-ray images after the dataset
had been refined. The Roboflow annotation tool was used to complete this procedure. Roboflow
is an efficient platform that enables advanced picture annotation with features such as polygonal
borders, enabling the exact labeling of dental structures. Each tooth’s shape and location within

58
Chapter 3 : Experiments and Results

the X-ray pictures were meticulously marked using polygonal boundaries. Compared to stan-
dard rectangular bounding boxes, this method offered more thorough annotations since accurate
labeling was necessary due to the irregular forms of teeth for training purposes. Additionally,
each tooth was assigned a particular class during annotation, guaranteeing that the model would
be able to distinguish between different types of teeth.

3.4 Proposed automatic Teeth segmentation Workflow

We followed a step-by-step strategy, starting with data gathering and progressing through prepro-
cessing, annotation, segmentation, and validation, to make sure the methodology was thorough
and organized. With the help of this methodical strategy, we were able to address every im-
portant project component in-depth and logically. We were able to prepare the raw data for the
machine learning pipeline by first gathering and enhancing it, and we were able to guarantee a
high degree of accuracy for our segmentation model by hand annotation. The results’ depend-
ability was further enhanced by the segmentation process, which included sophisticated deep
learning techniques and underwent stringent validation. Every stage was crucial in creating a
strong model that could manage the intricacies of dental X-ray imaging and provide accurate
findings for treatment planning and diagnostics.

3.4.1 Data Acquisition

In this step, we opted for a real-world dataset from which the dental X-ray images were taken.
It comprises seventy real-world dental X-ray photos with a resolution of 1024x512. An extra
sixty X-ray images with a resolution of a higher resolution 2939x1512 were gathered from a
radiologist (from Kairouan, Tunisia) to make sure there was the variety needed for good training.
As a result, there were 130 photos in all. But pictures with immature teeth, especially those
of young children, were later removed because they could affect the model’s accuracy. X-ray
scans made up the majority of the saved images because they could capture minute details of
the teeth that were essential for diagnosis. The dataset can be accessed via this link: https://
data.mendeley.com/datasets/73n3kz2k4k/2.

3.4.2 Data Preprocessing

Several preprocessing processes were carried out to improve the quality of the photographs and
standardize the input format before feeding them into the model:

59
Chapter 3 : Experiments and Results

Figure 3.5: Automatic Teeth annotation - proposed pipeline.

• Resizing: To ensure that all photos match the input size that the neural network expects,
all images were shrunk. This improves computational performance and guarantees consis-
tency across all inputs.

• Cleaning: All grainy or noisy photos that would have affected the way the model worked
were taken out. In order to avoid biasing the training process, artifacts or occlusions in the
photos were also removed in this step.

• Normalization: To lessen fluctuations in brightness and contrast, the image pixel values
were standardized to a common range, usually [0, 1]. This improves the uniformity of the

60
Chapter 3 : Experiments and Results

input distributions, which speeds up the neural network’s convergence.

3.4.3 Data Annotation

The dataset was manually annotated in order to create a reliable model. Every tooth in the X-ray
pictures was identified and segmented with polygonal bounds using the Roboflow tool. Every
tooth was categorized into a class according to its kind or place in the mouth (e.g., canines,
molars, etc.). Since it supplied the ground truth data required to train the segmentation and
classification model, this stage was crucial.

Figure 3.6: Image Annotation using Roboflow tool

3.4.4 Segmentation

Now that we have our Labeled dataset, we are ready to set up our Mask-RCNN model for seg-
mentation. In fact, the Mask R-CNN model is the central component of our research, used to
segment the individual teeth in the X-ray pictures. As the figure 3.7 shows, Mask R-CNN can
simultaneously conduct pixel-wise segmentation and classification, it was selected. Precise seg-
mentation of the dental features is made possible by processing each region of interest (RoI) that
the model proposes to produce a binary mask for each discovered tooth. Each tooth is segmented
separately to guarantee precise identification of even teeth with overlapping boundaries. The
information about the training parameters for each dataset can be found in the above table 3.1.

61
Chapter 3 : Experiments and Results

Figure 3.7: Mask RCNN architecture

Table 3.1: Mask RCNN Hyperparameters Settings

Hyperparameter Value
Optimizer AdamW
Backbone network RESNET
Number of epochs 240
Batch Size 2
Batch Size per Image for ROI Head 128
Steps per Epochs 1000
Learning Rate 0.001
Number of classes 32

3.4.5 Validation

A validation phase was taken to assess the model’s performance after it had been trained. The
model’s capacity for generalization was evaluated using a different validation set, which mea-
sured important parameters like accuracy, precision, recall, and intersection over union (IoU).
In order to make sure the model works effectively on both the training set and unobserved data,
this phase was crucial to fine-tune it. It revealed potential areas for model improvement, such
managing smaller items or fixing classification mistakes for specific tooth kinds.

3.4.6 Model Optimization with Attention

Attention mechanisms in deep learning have become effective tools to improve model perfor-
mance by enabling the network to concentrate on the most important portions of the input data.

62
Chapter 3 : Experiments and Results

Our Mask R-CNN architecture incorporates attention methods in an effort to increase the ac-
curacy of object recognition and segmentation, especially for small and overlapping objects in
dental X-ray images.Mask R-CNN has limits when handling small items or objects in cluttered
settings, despite its effectiveness in object detection and segmentation. The overlapping struc-
tures in dental X-rays, such as teeth or other anatomical features, present a major challenge
to conventional instance segmentation models.Herein lies the role of attention mechanisms.The
model may recognize and classify items more accurately by prioritizing more relevant portions
of the image by assigning distinct ”weights” to them.

In this instance, the model’s capacity to segment and identify teeth in intricate X-ray pictures
is enhanced by the addition of multi-head attention within each ROI (Region of Interest) head.
This optimization tackles important issues like:

• Small Object Detection: The model can zoom in on small, intricate areas of the image
thanks to multi-head attention, which makes it simpler to recognize smaller teeth or dental
structures.

• Handling Overlapping Objects: By concentrating on important characteristics particular to

each instance, the attention mechanism aids the model in distinguishing between overlap-
ping teeth.

In the end, attention mechanisms provide more accurate and dependable dental image analy-
sis, which is a major advancement over conventional Mask R-CNN techniques.

3.5 Results and Dicussion

In this section we are going to compare the result using Mask RCNN only and Mask RCNN after
attention . We are also going to take a look at the output images resulting from the prediction
done using the model. This is an example of the output images.

The result obtained in just 6000 steps, indicates a significant specialization of the model to
our task and also an improvement in the model’s performance already, particularly in the areas
of segmentation and classification of segmented teeth. This progress is mostly attributable to the
integration of attentional mechanisms, which enable the model to focus on the most important
details in the image. Even yet, while the current results are encouraging and promising, the model
may still benefit from one more finetuning . By fine-tuning the hyperparameters and further
adjusting the attention mechanisms, we expect ongoing improvements in prediction accuracy
and the ability to generalize on even more complicated data.

63
Chapter 3 : Experiments and Results

Figure 3.8: output1 Figure 3.9: output2

Figure 3.10: output3 Figure 3.11: output4

Figure 3.12: Output images

3.5.1 Comparison between both models

We’ll dissect each important measure and offer a comprehensive analysis so you can compare
the two sets of results with and without attention for your dental X-ray segmentation model. The
following explains how Average Precision and Average Recall, two important metrics, represent
performance on small, medium, and large items in our dataset.

64
Chapter 3 : Experiments and Results

Figure 3.13: Results before Adding attention.

65
Chapter 3 : Experiments and Results

Figure 3.14: Results after Adding attention.

66
Chapter 3 : Experiments and Results

Key Metrics Overview

• Average Precision (AP) :

- AP@[IoU=0.50:0.95] : This metric reflects the model’s ability to correctly detect objects
over a wide range of intersection over union (IoU) thresholds, which shows the overall
precision. A higher AP indicates that the model can detect more objects with fewer false
positives.
- AP@[IoU=0.50] : Measures precision when the IoU threshold is 0.50, commonly known
as the PASCAL VOC metric. It is less strict and evaluates how well the model identifies
objects without requiring exact overlap.
- AP@[IoU=0.75] : A stricter evaluation that requires more precise bounding boxes. This
metric assesses the model’s ability to capture the exact shapes and sizes of objects.
- AP by object size (small, medium, large) : These metrics provide insights into how well
the model performs when detecting objects of different sizes.

• Average Recall (AR) :

- AR@[IoU=0.50:0.95] : Measures the model’s recall over a range of IoU thresholds,
showing its ability to correctly find objects while minimizing missed detections. A
higher AR indicates a better ability to detect objects.
- AR@[MaxDets=1, 10, 100] : These metrics show recall when considering different num-
bers of allowed detections. MaxDets=1 means evaluating the recall based on the top single
detection; MaxDets=10 and 100 increase the number of detections considered, revealing
how well the model generalizes with more predictions.
- AR by object size (small, medium, large) : Provides recall performance specifically for
objects of varying sizes, which is critical when dealing with detailed structures like teeth.

Analyzing the Results (With Attention vs Without Attention)

- AP @ IoU=0.50:0.95 (Overall): With a score of 0.560, the non-attention model outperforms

the attention model by 0.555. This is a very little difference, but it indicates that the model with-
out attention performs somewhat better in general across all IoU thresholds in terms of precision
than the model with attention. This could be because the non-attention model is better at cap-
turing broader, more general traits across images, or it could be because attention layers are still
optimizing throughout this stage of training.

- AP @ IoU=0.50 (Overall) : In this case, the attention-aware model gets 0.829, while the
non-attention model gets 0.835. When the IoU threshold is made more flexible, the non-attention
model seems to be somewhat better at object detection. This may suggest that the model becomes

67
Chapter 3 : Experiments and Results

more cautious due to attention, which lowers its raw detection performance at lower IoU.

- AP @ IoU=0.75 (Overall) : With attention score 0.674 and without scoring 0.683, both
models exhibit comparable performance. The slight variation indicates that the model’s precision
may not be greatly impacted by the attention mechanism in situations when stricter localization
is necessary, even though the non-attention model performs better in this scenario.

- AP (small objects) :The performance on little objects is one of the most important observa-
tions. The model scores 0.211 when attention is given, and 0.256 when it is not. This disparity
suggests that the attention mechanism is not a major aid in the detection of tiny items, which may
represent a drawback. The attention layers may need to be adjusted more in order to effectively
capture the fine details of small things.

- AP (medium and large objects) :When it comes to medium and large items, the two mod-
els’ performance is more similar. For medium things, attention produces a score of 0.562 and for
large objects, 0.800; in the absence of attention, the values are 0.571 and 0.750, respectively. Be-
cause attention layers aid in focusing on global structures within the image, the attention model
performs well when it comes to identifying enormous items.

- AR @ IoU=0.50:0.95 (Overall) : Between thresholds, both models’ recall is rather similar,

with the non-attention model scoring 0.683 and the attention model scoring 0.681. Based on a
variety of IoU thresholds, this shows that both models are able to recognize objects with a similar
level of accuracy.

- AR @ MaxDets=1 : The model without attention beats the attention model by a small
margin, scoring 0.176 against 0.170. Because it assesses the model’s performance when just one
detection is permitted per image, this statistic is important. The modest decline could be ex-
plained by the attention mechanism still optimizing to focus better on the most prominent things.

- AR (small, medium, large objects) :

- Small objects : With a score of 0.230 in the non-attention model against 0.211 with attention,
the recall for small things is higher in the non-attention model. This confirms the previous finding
that the attention layers in the current arrangement might not be the best for recognizing objects
with finer details just yet.
- Medium objects : With objects of a moderate size, both models function quite similarly. The

68
Chapter 3 : Experiments and Results

attention model exhibits a marginal enhancement in AR (0.697 vs. 0.693), indicating that atten-
tion contributes to the model’s ability to stay focused on medium-sized features.
- Large objects : Remarkably, the attention model performs better in this instance, yielding 0.800
AR as opposed to 0.750 without attention. This suggests that attention layers probably aid in
capturing the whole context of larger items, such as complete jaw pieces or fully grown teeth.

It is evident from this comparison that the attention model suffers with small item identifi-
cation and precision, while it works well for larger objects and offers marginally improved con-
centration on medium objects. Perhaps because of its simplicity, which spares it from the extra
complexity of attention layers, the non-attention model is better at identifying minute features.

The attention model’s shortcomings are mostly related to how poorly it works with little
things, which is problematic for jobs like dental X-rays where accuracy is vital. By experiment-
ing with various attention heads or scaling factors, attention processes can be further improved to
solve this problem. Furthermore, multi-scale feature incorporation via methods such as Feature
Pyramid Networks (FPN) might assist the model in concurrently focusing on global and local
details.

The objective is to enable a more granular emphasis on small elements (such individual teeth)
while still capturing the larger context of the complete dental structure by integrating multi-head
attention layers in subsequent model iterations. This should assist in overcoming the drawbacks
of the existing method, especially when dealing with difficult segmentation tasks that call for
accurate object boundary delineation in addition to classification.

3.6 Conclusion

The results obtained show how well the Mask R-CNN model performs for segmenting and clas-
sifying dental X-ray images, with significant gains made possible by the addition of attention
mechanisms. In spite of several difficulties, especially when managing intricate dental structures
and areas that overlapped, the model demonstrated good accuracy and segmentation quality.
Nonetheless, a few restrictions were noted, such managing small or hidden objects. The dis-
cussion lays out a clear plan for future improvements, such as adjusting hyperparameters and
adding more varied datasets, which would increase the model’s resilience and suitability for use
in actual clinical settings.

69
Conclusion

This study presents a sophisticated strategy that blends cutting-edge computer vision techniques
with deep learning methodologies in order to automate the annotation and classification of den-
tal X-ray pictures. Through the use of the Mask R-CNN model, a cutting-edge architecture
renowned for its remarkable abilities in object identification and segmentation, we implemented
instance segmentation and classification over dental structures. In order to improve the model’s
performance in scenarios including small object detection, overlapping structures, and noise that
are commonly seen in medical imaging, multi-head attention mechanisms were integrated. This
was a critical improvement to the model. When attention was included, the model’s predictions
became much more accurate and reliable, strengthening its overall robustness in the context of
dental X-ray analysis.

Real-world dental X-ray pictures that had been manually annotated to provide high-quality
data for training and validation made up the dataset used in this study. These annotations were
crucial to the model’s performance because they enabled it to accurately capture the complex
spatial and anatomical information needed for segmentation and classification. The dataset con-
tained cases of impacted teeth, misaligned structures, and variable image quality. Despite this
complexity, the hybrid strategy of Mask R-CNN with attention mechanisms showed notable
success in recognizing and naming individual teeth. The model demonstrated an impressive ca-
pacity for generalization, providing precise annotations and classifications even in complex cir-
cumstances. The efficient application of deep learning with a little dataset is one of this work’s
primary contributions. With just a small amount of training data, the hybrid model was able to
achieve good results by optimizing the utilization of contextual and spatial information encoded
in dental X-ray pictures. This was achieved by integrating Mask R-CNN with multi-head atten-
tion. In particular, the attention processes made it possible for the model to concentrate on more
important regions and minimize errors brought on by complicated or occluded regions, which
helped it overcome obstacles like overlapping structures, which are frequently seen in dental pic-
tures. Higher classification performance, fewer false positives, and more precise segmentation

70
Conclusion

results were the end result. These are crucial for clinical relevance.

Even with these successes, there are still a number of issues with the suggested strategy. The
model’s computational complexity poses issues with processing time and resource demands,
especially because attention processes are included. Furthermore, the model’s performance is
dependent on a few hyperparameters, which need to be carefully adjusted for the best outcomes.
These elements show that the model has to be further improved in subsequent rounds. Subse-
quent investigations will concentrate on mitigating these constraints by refining the computa-
tional efficiency of the model and investigating more compact architectures that retain superior
precision. The model will be able to generalize better across various dental ailments and imaging
setups with a more varied and larger collection of dental X-rays, therefore expanding the dataset
is another essential next step.

In summary, it has shown to be a very successful approach to include deep learning and atten-
tion mechanisms into the Mask R-CNN framework for dental picture annotation. The capacity
of the model to precisely identify, categorize, and divide dental structures from X-ray pictures
has great promise for enhancing diagnostic precision and optimizing dental office procedures.
Although there are still issues with computing efficiency and model calibration, the outcomes
thus far show this strategy’s promise. This technology has the potential to become a potent
tool for automating dental diagnostics and supporting clinical decision-making in the future with
additional optimization and dataset expansion.

In a future version, we must give support to:

• Future studies in the field of image segmentation could concentrate on automatic or semi-
automatic annotation. Medical image annotation is the process of marking medical imag-
ing data, such as X-ray, CT, MRI, mammography, or ultrasound pictures. In order to help
clinicians save time, improve patient outcomes, and make more informed decisions, it is
used to train DL algorithms for medical image processing and diagnosis.

• The application of our approach to further computer vision applications, including multiple
class segmentation or missing tooth identification.

• The refinement in the instance segmentation techniques may yield speedy and efficient
object detection.

71
References

Agarwal, S., Terrail, J. O. D., & Jurie, F. (2018). Recent advances in object detection in the age
of deep convolutional neural networks. arXiv preprint arXiv:1809.03193.
Aghayev, A., Murphy, D. J., Keraliya, A. R., & Steigner, M. L. (2016). Recent developments
in the use of computed tomography scanners in coronary artery imaging. Expert review of
medical devices, 13(6), 545–553.
Alzubaidi, L., Zhang, J., Humaidi, A. J., Al-Dujaili, A., Duan, Y., Al-Shamma, O., . . . Farhan,
L. (2021). Review of deep learning: concepts, cnn architectures, challenges, applications,
future directions. Journal of big Data, 8, 1–74.
Bilgir, E., Bayrakdar, İ. Ş., Çelik, Ö., Orhan, K., Akkoca, F., Sağlam, H., . . . others (2021). An
artifıcial ıntelligence approach to automatic tooth detection and numbering in panoramic
radiographs. BMC medical imaging, 21, 1–9.
Bruno, F., Granata, V., Cobianchi Bellisari, F., Sgalambro, F., Tommasino, E., Palumbo, P.,
. . . others (2022). Advanced magnetic resonance imaging (mri) techniques: technical
principles and applications in nanomedicine. Cancers, 14(7), 1626.
Burrowes, D. P., Medellin, A., Harris, A. C., Milot, L., Lethebe, B. C., & Wilson, S. R. (2021).
Characterization of focal liver masses: a multicenter comparison of contrast-enhanced ul-
trasound, computed tomography, and magnetic resonance imaging. Journal of Ultrasound
in Medicine, 40(12), 2581–2593.
Cui, Z., Li, C., & Wang, W. (2019). Toothnet: automatic tooth instance segmentation and
identification from cone beam ct images. In Proceedings of the ieee/cvf conference on
computer vision and pattern recognition (pp. 6368–6377).
Elyan, E., Vuttipittayamongkol, P., Johnston, P., Martin, K., McPherson, K., Moreno-Garcı́a,
C. F., . . . Sarker, M. M. K. (2022). Computer vision and machine learning for medi-
cal image analysis: recent advances, challenges, and way forward. Artificial Intelligence
Surgery, 2(1), 24–45.
Erdmann, M., Maedche, A., Schnurr, H.-P., & Staab, S. (2000). From manual to semi-automatic

72
References

semantic annotation: About ontology-based text annotation tools. In Proceedings of the

coling-2000 workshop on semantic annotation and intelligent content (pp. 79–85).
Estai, M., Tennant, M., Gebauer, D., Brostek, A., Vignarajan, J., Mehdizadeh, M., & Saha,
S. (2022). Deep learning for automated detection and numbering of permanent teeth on
panoramic images. Dentomaxillofacial Radiology, 51(2), 20210296.
Farook, T. H., Saad, F. H., Ahmed, S., & Dudley, J. (2023). Clinical annotation and segmentation
tool (cast) implementation for dental diagnostics. Cureus, 15(11).
Gao, S., Li, X., Li, X., Li, Z., & Deng, Y. (2022). Transformer based tooth classification
from cone-beam computed tomography for dental charting. Computers in Biology and
Medicine, 148, 105880.
Gaur, E., Saxena, V., & Singh, S. K. (2018). Video annotation tools: A review. In 2018 in-
ternational conference on advances in computing, communication control and networking
(icacccn) (pp. 911–914).
Guo, J., Wu, Y., Chen, L., Long, S., Chen, D., Ouyang, H., . . . Wang, W. (2022). A perspective on
the diagnosis of cracked tooth: imaging modalities evolve to ai-based analysis. Biomedical
Engineering Online, 21(1), 36.
Hafiz, A. M., & Bhat, G. M. (2020). A survey on instance segmentation: state of the art.
International journal of multimedia information retrieval, 9(3), 171–189.
Hanbury, A. (2008). A survey of methods for image annotation. Journal of Visual Languages &
Computing, 19(5), 617–627.
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. In Proceedings of the ieee
international conference on computer vision (pp. 2961–2969).
Hosntalab, M., Aghaeizadeh Zoroofi, R., Abbaspour Tehrani-Fard, A., & Shirani, G. (2010).
Classification and numbering of teeth in multi-slice ct images using wavelet-fourier de-
scriptor. International journal of computer assisted radiology and surgery, 5, 237–249.
Hsieh, J., & Flohr, T. (2021). Computed tomography recent history and future perspectives.
Journal of Medical Imaging, 8(5), 052109–052109.
Huang, H.-N., Zhang, T., Yang, C.-T., Sheen, Y.-J., Chen, H.-M., Chen, C.-J., & Tseng, M.-W.
(2022). Image segmentation using transfer learning and fast r-cnn for diabetic foot wound
treatments. Frontiers in Public Health, 10, 969846.
Jehangir, B., Radhakrishnan, S., & Agarwal, R. (2023). A survey on named entity recog-
nition—datasets, tools, and methodologies. Natural Language Processing Journal, 3,
100017.
KaimingHe. (2015). Deep residual learning for image recognition.
Kılıc, M. C., Bayrakdar, I. S., Çelik, Ö., Bilgir, E., Orhan, K., Aydın, O. B., . . . others (2021).
Artificial intelligence system for automatic deciduous tooth detection and numbering in
panoramic radiographs. Dentomaxillofacial Radiology, 50(6), 20200172.
Kim, C., Kim, D., Jeong, H., Yoon, S.-J., & Youm, S. (2020). Automatic tooth detection and
numbering using a combination of a cnn and heuristic algorithm. Applied Sciences, 10(16),

73
References

5624.
Kong, H.-J., Yoo, J.-Y., Lee, J.-H., Eom, S.-H., & Kim, J.-H. (2023). Performance evaluation
of deep learning models for the classification and identification of dental implants. The
Journal of Prosthetic Dentistry.
Kumar, A., Bhadauria, H. S., & Singh, A. (2021). Descriptive analysis of dental x-ray images
using various practical methods: A review. PeerJ Computer Science, 7, e620.
Li, X., Wang, Y., & Cai, Y. (2021). Automatic annotation algorithm of medical radiological
images using convolutional neural network. Pattern Recognition Letters, 152, 158–165.
Lu, Y., & Young, S. (2020). A survey of public datasets for computer vision tasks in precision
agriculture. Computers and Electronics in Agriculture, 178, 105760.
Murrugarra-Llerena, J., Kirsten, L. N., & Jung, C. R. (2022). Can we trust bounding box
annotations for object detection? In Proceedings of the ieee/cvf conference on computer
vision and pattern recognition (pp. 4813–4822).
Pal, D., Reddy, P. B., & Roy, S. (2022). Attention uw-net: A fully connected model for automatic
segmentation and annotation of chest x-ray. Computers in Biology and Medicine, 150,
106083.
Rebinth, A., & Kumar, S. M. (2019). Importance of manual image annotation tools and free
datasets for medical research. Journal of Advanced Research in Dynamical and Control
Systems, 10(5), 1880–1885.
Ren, S., He, K., Girshick, R., & Sun, J. (2016). Faster r-cnn: Towards real-time object detec-
tion with region proposal networks. IEEE transactions on pattern analysis and machine
intelligence, 39(6), 1137–1149.
Santosh, A. B. R., & Jones, T. (2024). Enhancing precision: Proposed revision of fdi’s 2-digit
dental numbering system. international dental journal, 74(2), 359.
Schwendicke, F., & Krois, J. (2022). Data dentistry: how data are changing clinical care and
research. Journal of dental research, 101(1), 21–29.
Shafi, I., Fatima, A., Afzal, H., Dı́ez, I. d. l. T., Lipari, V., Breñosa, J., & Ashraf, I. (2023). A
comprehensive review of recent advances in artificial intelligence for dentistry e-health.
Diagnostics, 13(13), 2196.
Tekin, B. Y., Ozcan, C., Pekince, A., & Yasa, Y. (2022). An enhanced tooth segmentation and
numbering according to fdi notation in bitewing radiographs. Computers in Biology and
Medicine, 146, 105547.
Tuzoff, D. V., Tuzova, L. N., Bornstein, M. M., Krasnov, A. S., Kharchenko, M. A., Nikolenko,
S. I., . . . Bednenko, G. B. (2019). Tooth detection and numbering in panoramic ra-
diographs using convolutional neural networks. Dentomaxillofacial Radiology, 48(4),
20180051.
Vaswani, A. (2017). Attention is all you need. Advances in Neural Information Processing
Systems.
Virasova, A., Klimov, D. I., Khromov, O., Gubaidullin, I. R., & Oreshko, V. V. (2021). Rich fea-

74
References

ture hierarchies for accurate object detection and semantic segmentation. Radioengineer-
ing. Retrieved from https://api.semanticscholar.org/CorpusID:239533166
Volgenant, C. M., Persoon, I. F., de Ruijter, R. A., & de Soet, J. (2021). Infection control in
dental health care during and after the sars-cov-2 outbreak. Oral Diseases, 27, 674–683.
Warreth, A. (2023). Dental caries and its management. International Journal of Dentistry,
2023(1), 9365845.
Wu, Q., Zheng, Q., He, Y., Chen, Q., & Yang, H. (2023). Emerging nanoagents for medical
x-ray imaging. Analytical Chemistry, 95(1), 33–48.
Xu, M., Wu, Y., Xu, Z., Ding, P., Bai, H., & Deng, X. (2023). Robust automated teeth identifi-
cation from dental radiographs using deep learning. Journal of Dentistry, 136, 104607.
Yang, Z., Dong, R., Xu, H., & Gu, J. (2020). Instance segmentation method based on improved
mask r-cnn for the stacked electronic components. Electronics, 9(6), 886.
Yao, J., Zhang, Z. M., Antani, S., Long, R., & Thoma, G. (2008). Automatic medical image
annotation and retrieval. Neurocomputing, 71(10-12), 2012–2022.
Yazdanian, M., Karami, S., Tahmasebi, E., Alam, M., Abbasi, K., Rahbar, M., . . . Yazdanian, A.
(2022). Dental radiographic/digital radiography technology along with biological agents
in human identification. Scanning, 2022(1), 5265912.
Zhou, S. K., Greenspan, H., Davatzikos, C., Duncan, J. S., Van Ginneken, B., Madabhushi, A.,
. . . Summers, R. M. (2021). A review of deep learning in medical imaging: Imaging traits,
technology trends, case studies with progress highlights, and future promises. Proceedings
of the IEEE, 109(5), 820–838.

75
Abstract

Abstract: The evolution of medical imaging technologies has transformed dentistry by improv-
ing the efficiency of manual radiograph annotation. This research develops an automated image
annotation system for dental radiographs using Convolutional Neural Networks (CNNs). The
goal is to create a reliable system for automatically identifying and labeling dental abnormali-
ties in panoramic radiographs. Key contributions include an innovative instance segmentation
pipeline, enhancement of the database with real radiological data, and the integration of an atten-
tion mechanism to optimize the model. This study offers a practical tool for dental practitioners
and advances automated annotation methodologies in medical imaging.

Keywords: Deep Learning, Convolutional Neural Networks, Dental radiographs, Automatic

annotation, Attention, Real dataset

Résumé: L’évolution des technologies d’imagerie médicale a transformé la dentisterie en

rendant l’annotation manuelle des radiographies plus efficace. Cette recherche développe un
système automatisé d’annotation d’images pour radiographies dentaires utilisant des Réseaux
de Neurones Convolutionnels (CNN). L’objectif est de créer un système fiable pour identifier
et labelliser automatiquement les anomalies dentaires sur des radiographies panoramiques. Les
contributions clés incluent un pipeline de segmentation d’instances innovant, l’enrichissement
de la base de données avec des données radiologiques réelles, et l’intégration d’un mécanisme
d’attention pour optimiser le modèle. Cette étude propose un outil pratique pour les praticiens
dentaires et fait avancer les méthodologies d’annotation automatique en imagerie médicale.

Mots-clés : Apprentissage profond, Réseaux de Neurones Convolutionnels, Radiographies

dentaires, Annotation automatique, Mécanisme d’attention, Ensemble de données réel
References

Identification of Unknown Landscape Types Using CNN Transfer Lear
No ratings yet
Identification of Unknown Landscape Types Using CNN Transfer Lear
105 pages
Brain Tumor Image Retrieval Technique
No ratings yet
Brain Tumor Image Retrieval Technique
66 pages
A.yavlinsky PHD
No ratings yet
A.yavlinsky PHD
129 pages
EScholarship UC Item 5936054z
No ratings yet
EScholarship UC Item 5936054z
115 pages
Advances in Deep Learning for Small Object Detection
No ratings yet
Advances in Deep Learning for Small Object Detection
14 pages
Face Recognition Attendance System Report
No ratings yet
Face Recognition Attendance System Report
59 pages
Computer Vision Methods For Fast Image Classification and Retrieval 2020
100% (5)
Computer Vision Methods For Fast Image Classification and Retrieval 2020
144 pages
Stochastic Modeling For Medical Image Analysis - 240719 - 164330
No ratings yet
Stochastic Modeling For Medical Image Analysis - 240719 - 164330
299 pages
Pedestrian Tracking in Image Sequences With Deepsort and Yolov8: A Comprehensive Study and Performance Analysis
No ratings yet
Pedestrian Tracking in Image Sequences With Deepsort and Yolov8: A Comprehensive Study and Performance Analysis
66 pages
Deep Learning for Image Retrieval
No ratings yet
Deep Learning for Image Retrieval
44 pages
GANs For Data Augmentation in Healthcare
No ratings yet
GANs For Data Augmentation in Healthcare
24 pages
Deep Learning for Polyp Detection
No ratings yet
Deep Learning for Polyp Detection
105 pages
Road Detection from Images Thesis
No ratings yet
Road Detection from Images Thesis
65 pages
(Ebook PDF) Deep Learning For Medical Image Analysis by S. Kevin Zhou Install Download
No ratings yet
(Ebook PDF) Deep Learning For Medical Image Analysis by S. Kevin Zhou Install Download
55 pages
Stochastic Modeling For Medical Image Analysis - 1st Edition High-Quality Download
No ratings yet
Stochastic Modeling For Medical Image Analysis - 1st Edition High-Quality Download
17 pages
Rimsha Qaisar
100% (1)
Rimsha Qaisar
101 pages
Object Recognition in Health Informatics
No ratings yet
Object Recognition in Health Informatics
13 pages
Budd S 2022 PHD Thesis
100% (1)
Budd S 2022 PHD Thesis
214 pages
VietNguyen MasterThesis
No ratings yet
VietNguyen MasterThesis
66 pages
Robert Varga Thesis 3
No ratings yet
Robert Varga Thesis 3
162 pages
Deep CNNs in Object Detection
No ratings yet
Deep CNNs in Object Detection
104 pages
Thesis 2022-Bayesian Convolutional Neural Network With Prediction Smoothing A
No ratings yet
Thesis 2022-Bayesian Convolutional Neural Network With Prediction Smoothing A
65 pages
Multimodal Medical Image Fusion Techniques
No ratings yet
Multimodal Medical Image Fusion Techniques
60 pages
Human Detection From Images and Videos
No ratings yet
Human Detection From Images and Videos
184 pages
Iris Recognition Seminar Report
No ratings yet
Iris Recognition Seminar Report
51 pages
Logo Recognition Theory and Practice Compress
No ratings yet
Logo Recognition Theory and Practice Compress
189 pages
CNN for Human Expression Analysis
No ratings yet
CNN for Human Expression Analysis
34 pages
Automated Parking Space Detection System
No ratings yet
Automated Parking Space Detection System
86 pages
Vision Transformers for Skin Cancer Detection
No ratings yet
Vision Transformers for Skin Cancer Detection
46 pages
Lunar Crater Detection with CNNs
No ratings yet
Lunar Crater Detection with CNNs
159 pages
Bachelor of Technology
No ratings yet
Bachelor of Technology
39 pages
Robotic Surgery Tool Segmentation Thesis
No ratings yet
Robotic Surgery Tool Segmentation Thesis
67 pages
Detection of Safety Equipment in The Manufacturing Industry Using Image Recognition
No ratings yet
Detection of Safety Equipment in The Manufacturing Industry Using Image Recognition
62 pages
Deformable Registration Techniques For Thoracic CT Images An Insight Into Medical Image Registration Instant Download
No ratings yet
Deformable Registration Techniques For Thoracic CT Images An Insight Into Medical Image Registration Instant Download
17 pages
Phishing Website Detection with ML
No ratings yet
Phishing Website Detection with ML
73 pages
Computer Vision Lecture Notes 2024
No ratings yet
Computer Vision Lecture Notes 2024
98 pages
Deformable Registration Techniques For Thoracic CT Images An Insight Into Medical Image Registration Scribd PDF Download
100% (18)
Deformable Registration Techniques For Thoracic CT Images An Insight Into Medical Image Registration Scribd PDF Download
15 pages
Hechth Dpath
No ratings yet
Hechth Dpath
96 pages
AI in Breast Cancer Detection
No ratings yet
AI in Breast Cancer Detection
72 pages
GUB CSE Thesis ProjectTemplate
No ratings yet
GUB CSE Thesis ProjectTemplate
121 pages
Advances in Computers, Vol.65 (Elsevier, 2005) (ISBN 9780120121656) (447s) - CsAl
No ratings yet
Advances in Computers, Vol.65 (Elsevier, 2005) (ISBN 9780120121656) (447s) - CsAl
447 pages
Thesis Rashed Doha
No ratings yet
Thesis Rashed Doha
73 pages
MRI Based Medical Image Analysis
No ratings yet
MRI Based Medical Image Analysis
23 pages
Comprehensive Guide to Computer Vision
No ratings yet
Comprehensive Guide to Computer Vision
152 pages
Heart Disease Prediction with ML
100% (1)
Heart Disease Prediction with ML
48 pages
GUB CSE Thesis ProjectTemplate 2 1
No ratings yet
GUB CSE Thesis ProjectTemplate 2 1
131 pages
Point Cloud Gradient Class Activation Mapping
No ratings yet
Point Cloud Gradient Class Activation Mapping
87 pages
Roadway Surface Profiling Thesis
No ratings yet
Roadway Surface Profiling Thesis
67 pages
Car Detection Motion
No ratings yet
Car Detection Motion
52 pages
Summer Intern Report
No ratings yet
Summer Intern Report
25 pages
Deep Learning for Flower Recognition
No ratings yet
Deep Learning for Flower Recognition
87 pages
Pfe Rhazzafe Marouane
No ratings yet
Pfe Rhazzafe Marouane
73 pages
Mini Project FINAL Report
No ratings yet
Mini Project FINAL Report
40 pages
AI and Machine Learning Report Sample 2
No ratings yet
AI and Machine Learning Report Sample 2
71 pages
Ventricular Arrhythmia Classification Using Convolutional Neural Networks
No ratings yet
Ventricular Arrhythmia Classification Using Convolutional Neural Networks
118 pages
Stroke Detection Using Deep Learning
No ratings yet
Stroke Detection Using Deep Learning
78 pages
Sanket Wandhare Resume
No ratings yet
Sanket Wandhare Resume
1 page
Healthcare Chatbot Report
No ratings yet
Healthcare Chatbot Report
17 pages
Autonomous Driving Machine Learning Case Study
No ratings yet
Autonomous Driving Machine Learning Case Study
15 pages
PSO Mini Tutorial
No ratings yet
PSO Mini Tutorial
23 pages
AI-Driven Blogging for Stock Analysis
No ratings yet
AI-Driven Blogging for Stock Analysis
123 pages
Data Engineering Cookbook
100% (1)
Data Engineering Cookbook
62 pages
The Impact of Artificial Intelligence On Software Development
No ratings yet
The Impact of Artificial Intelligence On Software Development
3 pages
Karthik Ponna: AI & ML Intern Profile
No ratings yet
Karthik Ponna: AI & ML Intern Profile
1 page
Data Scientist Resume: Joy Ghosh
No ratings yet
Data Scientist Resume: Joy Ghosh
1 page
Aditya Chavhan: B.Tech in Biotechnology
No ratings yet
Aditya Chavhan: B.Tech in Biotechnology
1 page
Infant Cry Analysis Web App Guide
No ratings yet
Infant Cry Analysis Web App Guide
10 pages
Rishithakattipallem - Eng21am0095 - Rishitha Kattipallem
No ratings yet
Rishithakattipallem - Eng21am0095 - Rishitha Kattipallem
1 page
Fruit Freshness Checker
No ratings yet
Fruit Freshness Checker
36 pages
CUET ML Algorithms Report
No ratings yet
CUET ML Algorithms Report
28 pages
UAE AI Curriculum - Detailed Grade-Level Breakdown
No ratings yet
UAE AI Curriculum - Detailed Grade-Level Breakdown
6 pages
Web & Mobile App Development Solutions
No ratings yet
Web & Mobile App Development Solutions
133 pages
ECE 569A Grading Rubric
No ratings yet
ECE 569A Grading Rubric
6 pages
Solutions To Applied Data Science AI
No ratings yet
Solutions To Applied Data Science AI
9 pages
Learning Models of Individual Behavior in Chess: Reid Mcilroy-Young Russell Wang Siddhartha Sen
No ratings yet
Learning Models of Individual Behavior in Chess: Reid Mcilroy-Young Russell Wang Siddhartha Sen
12 pages
Data Warehousing & Mining Course
No ratings yet
Data Warehousing & Mining Course
2 pages
It's All Analytics!
No ratings yet
It's All Analytics!
80 pages
AI Usage in Information Technology
No ratings yet
AI Usage in Information Technology
2 pages
Machine Learning OBE Question Paper 2020
0% (1)
Machine Learning OBE Question Paper 2020
3 pages
AI in Cybersecurity: A Survey
No ratings yet
AI in Cybersecurity: A Survey
5 pages
Disease Diagnosis Using Chatbot
No ratings yet
Disease Diagnosis Using Chatbot
66 pages
Regional Ensemble For Improving Unsupervised
No ratings yet
Regional Ensemble For Improving Unsupervised
12 pages
Ieee 2023 DQL Fairness
No ratings yet
Ieee 2023 DQL Fairness
6 pages
Template IEEE Format
No ratings yet
Template IEEE Format
6 pages
Electronics 13 03575
No ratings yet
Electronics 13 03575
45 pages
AI Automation Essentials Guide
No ratings yet
AI Automation Essentials Guide
10 pages