0% found this document useful (0 votes)
43 views20 pages

Image Segmentation Using Deep Learning A Survey

Deep learning has achieved remarkable success in image segmentation. The survey reviews over 100 recent deep learning-based image segmentation methods and organizes them into 9 categories: 1) Fully convolutional networks 2) Convolutional models with graphical models 3) Encoder-decoder based models 4) Multiscale and pyramid network based models 5) R-CNN based models for instance segmentation 6) Dilated convolutional models like DeepLab 7) Recurrent neural network based models 8) Attention-based models 9) Generative models and adversarial training

Uploaded by

Sujan M
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
43 views20 pages

Image Segmentation Using Deep Learning A Survey

Deep learning has achieved remarkable success in image segmentation. The survey reviews over 100 recent deep learning-based image segmentation methods and organizes them into 9 categories: 1) Fully convolutional networks 2) Convolutional models with graphical models 3) Encoder-decoder based models 4) Multiscale and pyramid network based models 5) R-CNN based models for instance segmentation 6) Dilated convolutional models like DeepLab 7) Recurrent neural network based models 8) Attention-based models 9) Generative models and adversarial training

Uploaded by

Sujan M
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 20

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 44, NO.

7, JULY 2022 3523

Image Segmentation Using


Deep Learning: A Survey
Shervin Minaee , Member, IEEE, Yuri Boykov , Member, IEEE, Fatih Porikli , Fellow, IEEE,
Antonio Plaza , Fellow, IEEE, Nasser Kehtarnavaz , Fellow, IEEE,
and Demetri Terzopoulos, Fellow, IEEE

Abstract—Image segmentation is a key task in computer vision and image processing with important applications such as scene
understanding, medical image analysis, robotic perception, video surveillance, augmented reality, and image compression, among
others, and numerous segmentation algorithms are found in the literature. Against this backdrop, the broad success of deep learning
(DL) has prompted the development of new image segmentation approaches leveraging DL models. We provide a comprehensive
review of this recent literature, covering the spectrum of pioneering efforts in semantic and instance segmentation, including
convolutional pixel-labeling networks, encoder-decoder architectures, multiscale and pyramid-based approaches, recurrent networks,
visual attention models, and generative models in adversarial settings. We investigate the relationships, strengths, and challenges of
these DL-based segmentation models, examine the widely used datasets, compare performances, and discuss promising research
directions.

Index Terms—Image segmentation, deep learning, convolutional neural networks, encoder-decoder models, recurrent models, generative
models, semantic segmentation, instance segmentation, panoptic segmentation, medical image segmentation

1 INTRODUCTION object categories (e.g., human, car, tree, sky) for all image pix-
els; thus, it is generally a more demanding undertaking than
MAGE segmentation has been a fundamental problem in
I computer vision since the early days of the field [1] (Chap-
ter 8). An essential component of many visual understand-
whole-image classification, which predicts a single label for
the entire image. Instance segmentation extends the scope of
semantic segmentation by detecting and delineating each
ing systems, it involves partitioning images (or video
object of interest in the image (e.g., individual people).
frames) into multiple segments and objects [2] (Chapter 5)
Numerous image segmentation algorithms have been
and plays a central role in a broad range of applications [3]
developed in the literature, from the earliest methods, such
(Part VI), including medical image analysis (e.g., tumor
as thresholding [4], histogram-based bundling, region-
boundary extraction and measurement of tissue volumes),
growing [5], k-means clustering [6], watershed methods [7],
autonomous vehicles (e.g., navigable surface and pedestrian
to more advanced algorithms such as active contours [8],
detection), video surveillance, and augmented reality to
graph cuts [9], conditional and Markov random fields [10],
name a few.
and sparsity-based [11], [12] methods. In recent years, how-
Image segmentation can be formulated as the problem of
ever, deep learning (DL) models have yielded a new genera-
classifying pixels with semantic labels (semantic segmenta-
tion of image segmentation models with remarkable
tion), or partitioning of individual objects (instance segmen-
performance improvements, often achieving the highest
tation), or both (panoptic segmentation). Semantic
accuracy rates on popular benchmarks (e.g., Fig. 1). This has
segmentation performs pixel-level labeling with a set of
caused a paradigm shift in the field.
This survey, a revised version of [14], covers the recent lit-
 Shervin Minaee is with the Snapchat Machine Learning Research, Venice, erature in deep-learning-based image segmentation, includ-
CA 90405 USA. E-mail: shervin.minaee@nyu.edu.
ing more than 100 such segmentation methods proposed to
 Yuri Boykov is with the University of Waterloo, Waterloo, ON N2L 3G1,
Canada. E-mail: yboykov@uwaterloo.ca. date. It provides a comprehensive review with insights into
 Fatih Porikli is with the Australian National University, Canberra, ACT different aspects of these methods, including the training
0200, Australia, and also with Huawei, San Diego, CA 92121 USA. E- data, the choice of network architectures, loss functions,
mail: fatih.porikli@anu.edu.au.
 Antonio Plaza is with the University of Extremadura, 06006 Badajoz, training strategies, and their key contributions. The target lit-
Spain. E-mail: aplaza@unex.es. erature is organized into the following categories:
 Nasser Kehtarnavaz is with the University of Texas at Dallas, Richardson,
TX 75080 USA. E-mail: kehtar@utdallas.edu. 1) Fully convolutional networks
 Demetri Terzopoulos is with the University of California, Los Angeles, Los 2) Convolutional models with graphical models
Angeles, CA 90095 USA. E-mail: dt@cs.ucla.edu. 3) Encoder-decoder based models
Manuscript received 18 Jan. 2020; revised 28 Jan. 2021; accepted 7 Feb. 2021. 4) Multiscale and pyramid network based models
Date of publication 17 Feb. 2021; date of current version 3 June 2022. 5) R-CNN based models (for instance segmentation)
(Corresponding author: Shervin Minaee.)
Recommended for acceptance by L. Wang. 6) Dilated convolutional models and DeepLab family
Digital Object Identifier no. 10.1109/TPAMI.2021.3059968 7) Recurrent neural network based models
0162-8828 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tps://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: GITAM University. Downloaded on January 25,2023 at 11:09:35 UTC from IEEE Xplore. Restrictions apply.
3524 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 44, NO. 7, JULY 2022

Fig. 3. Architecture of a simple RNN. Courtesy of Christopher Olah [21].

2.1 Convolutional Neural Networks (CNNs)


CNNs are among the most successful and widely used
architectures in the deep learning community, especially for
computer vision tasks. CNNs were initially proposed
by Fukushima [16] in his seminal paper on the
“Neocognitron”, which was based on Hubel and Wiesel’s
Fig. 1. Segmentation results of DeepLabV3 [13] on sample images. hierarchical receptive field model of the visual cortex. Sub-
sequently, Waibel et al. [17] introduced CNNs with weights
8) Attention-based models shared among temporal receptive fields and backpropaga-
9) Generative models and adversarial training tion training for phoneme recognition, and LeCun et al. [15]
10) Convolutional models with active contour models developed a practical CNN architecture for document rec-
11) Other models ognition (Fig. 2). CNNs usually include three types of
Within this taxonomy, layers: i) convolutional layers, where a kernel (or filter) of
weights is convolved to extract features; ii) nonlinear layers,
 we provide a comprehensive review and analysis of which apply (usually element-wise) an activation function
deep-learning-based image segmentation algorithms; to feature maps, thus enabling the network to model nonlin-
 we overview popular image segmentation datasets, ear functions; and iii) pooling layers, which reduce spatial
grouped into 2D and 2.5D (RGB-D) images; resolution by replacing small neighborhoods in a feature
 we summarize the performances of the reviewed map with some statistical information about those neighbor-
segmentation methods on popular benchmarks; hoods (mean, max, etc.). The neuronal units in layers are
 we discuss several challenges and future locally connected; that is, each unit receives weighted inputs
research directions for deep-learning-based image from a small neighborhood, known as the receptive field, of
segmentation. units in the previous layer. By stacking layers to form multi-
The remainder of this survey is organized as follows: Sec- resolution pyramids, the higher-level layers learn features
tion 2 overviews popular Deep Neural Network (DNN) from increasingly wider receptive fields. The main compu-
architectures that serve as the backbones of many modern tational advantage of CNNs is that all the receptive fields in
segmentation algorithms. Section 3 reviews the most signifi- a layer share weights, resulting in a significantly smaller
cant state-of-the-art deep learning based segmentation mod- number of parameters than fully-connected neural net-
els. Section 4 overviews some of the most popular image works. Some of the most well known CNN architectures
segmentation datasets and their characteristics. Section 5 include AlexNet [18], VGGNet [19], and ResNet [20].
lists popular metrics for evaluating deep-learning-based
segmentation models and tabulates model performances. 2.2 Recurrent Neural Networks (RNNs) and
Section 6 discusses the main challenges and opportunities the LSTM
of deep learning-based segmentation methods. Section 7 RNNs [22] are commonly used to process sequential data,
presents our conclusions. such as speech, text, videos, and time-series. Referring to
Fig. 3, at each time step t the model collects the input xt and
2 DEEP NEURAL NETWORK ARCHITECTURES the hidden state ht1 from the previous step, and outputs a
This section provides an overview of prominent DNN archi- target value ot and the next hidden state htþ1 . RNNs are typ-
tectures used by the computer vision community, including ically problematic for long sequences as they cannot capture
convolutional neural networks, recurrent neural networks long-term dependencies in many real-world applications
and long short-term memory, encoder-decoder and autoen- and often suffer from gradient vanishing or exploding prob-
coder models, and generative adversarial networks. Due to lems. However, a type of RNN known as the Long Short-
space limitations, several other DNN architectures that Term Memory (LSTM) [23] is designed to avoid these
have been proposed, among them transformers, capsule issues. The LSTM architecture (Fig. 4) includes three gates
networks, gated recurrent units, and spatial transformer (input gate, output gate, and forget gate) that regulate the
networks, will not be covered. flow of information into and out of a memory cell that stores
values over arbitrary time intervals.

2.3 Encoder-Decoder and Auto-Encoder Models


Encoder-decoders [24], [25] are a family of models that learn
to map data-points from an input domain to an output
domain via a two-stage network (Fig. 5): The encoder, per-
Fig. 2. Architecture of CNNs. From [15]. forming an encoding function z ¼ gðxÞ, compresses the

Authorized licensed use limited to: GITAM University. Downloaded on January 25,2023 at 11:09:35 UTC from IEEE Xplore. Restrictions apply.
MINAEE ET AL.: IMAGE SEGMENTATION USING DEEP LEARNING: A SURVEY 3525

Fig. 7. The FCN learns to make pixel-accurate predictions. From [30].

Fig. 4. Architecture of a standard LSTM module. Courtesy of Olah [21].

Fig. 8. Skip connections combine coarse and fine information. From [30].
Fig. 5. Architecture of a simple encoder-decoder model.
common among many of these methods, such as encoders
and decoders, skip-connections, multiscale architectures,
and more recently the use of dilated convolutions. It is con-
venient to group models based on their architectural contri-
butions over prior models.

3.1 Fully Convolutional Models


Fig. 6. Architecture of a GAN. Courtesy of Ian Goodfellow.
Long et al. [30] proposed Fully Convolutional Networks
(FCNs), a milestone in DL-based semantic image segmenta-
input x into a latent-space representation z, while the tion models. An FCN (Fig. 7) includes only convolutional
decoder y ¼ fðzÞ predicts the output y from z. The latent, or layers, which enables it to output a segmentation map whose
feature (vector), representation captures the semantic infor- size is the same as that of the input image. To handle arbi-
mation of the input useful in predicting the output. Such trarily-sized images, the authors modified existing CNN
models are popular for sequence-to-sequence modeling in architectures, such as VGG16 and GoogLeNet, by removing
Natural Language Processing (NLP) applications as well as all fully-connected layers such that the model outputs a spa-
in image-to-image translation, where the output could be tial segmentation map instead of classification scores.
an enhanced version of the image (such as in image de- Through the use of skip connections (Fig. 8) in which fea-
blurring, or super-resolution) or a segmentation map. Auto- ture maps from the final layers of the model are up-sampled
encoders are a special case of encoder-decoder models in and fused with feature maps of earlier layers, the model
which the input and output are the same. combines semantic information (from deep, coarse layers)
and appearance information (from shallow, fine layers) in
2.4 Generative Adversarial Networks (GANs) order to produce accurate and detailed segmentations.
GANs [26] are a newer family of deep learning models. They Tested on PASCAL VOC, NYUDv2, and SIFT Flow, the
consist of two networks—a generator and a discriminator model achieved state-of-the-art segmentation performance.
(Fig. 6). In the conventional GAN, the generator network G FCNs have been applied to a variety of segmentation
learns a mapping from noise z (with a prior distribution) to a problems, such as brain tumor segmentation [31], instance-
target distribution y, which is similar to the “real” samples. aware semantic segmentation [32], skin lesion segmenta-
The discriminator network D attempts to distinguish the tion [33], and iris segmentation [34]. While demonstrating
generated “fake” samples from the real ones. The GAN may that DNNs can be trained to perform semantic segmenta-
be characterized as a minimax game between G and D, tion in an end-to-end manner on variable-sized images, the
where D tries to minimize its classification error in distin- conventional FCN model has some limitations—it is too
guishing fake samples from real ones, hence maximizing a computationally expensive for real-time inference, it does
loss function, and G tries to maximize the discriminator not account for global context information in an efficient
network’s error, hence minimizing the loss function. GAN manner, and it is not easily generalizable to 3D images. Sev-
variants include Convolutional-GANs [27], conditional- eral researchers have attempted to overcome some of the
GANs [28], and Wasserstein-GANs [29]. limitations of the FCN. For example, Liu et al. [35] proposed
ParseNet (Fig. 9), which adds global context to FCNs by
using the average feature for a layer to augment the features
3 DL-BASED IMAGE SEGMENTATION MODELS at each location. The feature map for a layer is pooled over
This section is a survey of numerous learning-based seg- the whole image, resulting in a context vector. The context
mentation methods, grouped into 10 categories based on vector is normalized and unpooled to produce new feature
their model architectures. Several architectural features are maps of the same size as the initial ones, which are then

Authorized licensed use limited to: GITAM University. Downloaded on January 25,2023 at 11:09:35 UTC from IEEE Xplore. Restrictions apply.
3526 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 44, NO. 7, JULY 2022

Fig. 9. The ParseNet (e) uses extra global context to produce a segmen-
tation (d) smoother than that of an FCN (c). From [35]. Fig. 11. Deconvolutional semantic segmentation. From [41].

Fig. 12. The SegNet model. From [25].

Fig. 10. A CNNþCRF model. From [36]. algorithms, they proposed a CNN model, namely a Parsing
Network, which enables deterministic end-to-end computa-
concatenated, which amounts to an FCN whose convolu- tion in one pass.
tional layers are replaced by the described module (Fig. 9e).
3.3 Encoder-Decoder Based Models
Most of the popular DL-based segmentation models use
3.2 CNNs With Graphical Models
some kind of encoder-decoder architecture. We group these
As discussed, the FCN ignores potentially useful scene-level models into two categories: those for general image segmen-
semantic context. To exploit more context, several tation, and those for medical image segmentation.
approaches incorporate into DL architectures probabilistic
graphical models, such as Conditional Random Fields
(CRFs) and Markov Random Fields (MRFs). 3.3.1 General Image Segmentation
Due to the invariance properties that make CNNs good Noh et al. [41] introduced semantic segmentation based on
for high level tasks such as classification, responses from deconvolution (a.k.a. transposed convolution). Their model,
the later layers of deep CNNs are not sufficiently well local- DeConvNet (Fig. 11), consists of two parts, an encoder using
ized for accurate object segmentation. To address this draw- convolutional layers adopted from the VGG 16-layer net-
back, Chen et al. [36] proposed a semantic segmentation work and a multilayer deconvolutional network that inputs
algorithm that combines CNNs and fully-connected CRFs the feature vector and generates a map of pixel-accurate
(Fig. 10). They showed that their model can localize segment class probabilities. The latter comprises deconvolution and
boundaries with higher accuracy than was possible with unpooling layers, which identify pixel-wise class labels and
previous methods. predict segmentation masks.
Schwing and Urtasun [37] proposed a fully-connected Badrinarayanan et al. [25] proposed SegNet, a fully con-
deep structured network for image segmentation. They volutional encoder-decoder architecture for image segmen-
jointly trained CNNs and fully-connected CRFs for seman- tation (Fig. 12). Similar to the deconvolution network, the
tic image segmentation, and achieved encouraging results core trainable segmentation engine of SegNet consists of an
on the challenging PASCAL VOC 2012 dataset. Zheng et al. encoder network, which is topologically identical to the 13
[38] proposed a similar semantic segmentation approach. In convolutional layers of the VGG16 network, and a corre-
related work, Lin et al. [39] proposed an efficient semantic sponding decoder network followed by a pixel-wise classifi-
segmentation model based on contextual deep CRFs. They cation layer. The main novelty of SegNet is in the way the
explored “patch-patch” context (between image regions) decoder upsamples its lower-resolution input feature map(s);
and “patch-background” context to improve semantic seg- specifically, using pooling indices computed in the max-
mentation through the use of contextual information. pooling step of the corresponding encoder to perform non-
Liu et al. [40] proposed a semantic segmentation algo- linear up-sampling.
rithm that incorporates rich information into MRFs, includ- A limitation of encoder-decoder based models is the loss of
ing high-order relations and mixture of label contexts. fine-grained image information, due to the loss of resolution
Unlike previous efforts that optimized MRFs using iterative through the encoding process. HRNet [42] (Fig. 13) addresses

Fig. 13. The HRNet architecture. From [42].

Authorized licensed use limited to: GITAM University. Downloaded on January 25,2023 at 11:09:35 UTC from IEEE Xplore. Restrictions apply.
MINAEE ET AL.: IMAGE SEGMENTATION USING DEEP LEARNING: A SURVEY 3527

Fig. 14. The U-Net model. From [47]. Fig. 15. The V-Net model for 3D image segmentation. From [51].

this shortcoming. Other than recovering high-resolution Dense V-Net et al. for automatic segmentation of pulmonary
representations as is done in DeConvNet, SegNet, and other lobes from chest CT images, and the 3D-CNN encoder for
models, HRNet maintains high-resolution representations lesion segmentation [52].
through the encoding process by connecting the high-to-low
resolution convolution streams in parallel and repeatedly
3.4 Multiscale and Pyramid Network Based Models
exchanging the information across resolutions. There are
Multiscale analysis, a well established idea in image proc-
four stages: the 1st stage consists of high-resolution convolu-
essing, has been deployed in various neural network archi-
tions, while the 2nd/3rd/4th stage repeats 2-resolution/
tectures. One of the most prominent models of this sort is
3-resolution/4-resolution blocks. Several recent semantic
the Feature Pyramid Network (FPN) proposed by Lin et al.
segmentation models use HRNet as a backbone.
[53], which was developed for object detection but was also
Several other works adopt transposed convolutions, or
applied to segmentation. The inherent multiscale, pyrami-
encoder-decoders for image segmentation, such as Stacked
dal hierarchy of deep CNNs was used to construct feature
Deconvolutional Network (SDN) [43], Linknet [44], W-
pyramids with marginal extra cost. To merge low and high
Net [45], and locality-sensitive deconvolution networks for
resolution features, the FPN is composed of a bottom-up
RGB-D segmentation [46].
pathway, a top-down pathway and lateral connections. The
concatenated feature maps are then processed by a 3  3
3.3.2 Medical and Biomedical Image Segmentation convolution to produce the output of each stage. Finally,
Several models inspired by FCNs and encoder-decoder net- each stage of the top-down pathway generates a prediction
works were initially developed for medical/biomedical to detect an object. For image segmentation, the authors use
image segmentation, but are now also being used outside two multilayer perceptrons (MLPs) to generate the masks.
the medical domain. Zhao et al. [54] developed the Pyramid Scene Parsing
Ronneberger et al. [47] proposed the U-Net (Fig. 14) for Network (PSPN), a multiscale network to better learn the
efficiently segmenting biological microscopy images. The global context representation of a scene (Fig. 16). Multiple
U-Net architecture comprises two parts, a contracting path patterns are extracted from the input image using a residual
to capture context, and a symmetric expanding path that network (ResNet) as a feature extractor, with a dilated net-
enables precise localization. The U-Net training strategy work. These feature maps are then fed into a pyramid pool-
relies on the use of data augmentation to learn effectively ing module to distinguish patterns of different scales. They
from very few annotated images. It was trained on 30 trans- are pooled at four different scales, each one corresponding
mitted light microscopy images, and it won the ISBI cell to a pyramid level, and processed by a 1  1 convolutional
tracking challenge 2015 by a large margin. layer to reduce their dimensions. The outputs of the pyra-
Various extensions of U-Net have been developed for mid levels are up-sampled and concatenated with the initial
different kinds of images and problem domains; for exam- feature maps to capture both local and global context infor-
ple, Zhou et al. [48] developed a nested U-Net architecture, mation. Finally, a convolutional layer is used to generate
Zhang et al. [49] developed a road segmentation algorithm the pixel-wise predictions.
based on U-Net, and Cicek et al. [50] proposed a U-Net Ghiasi and Fowlkes [55] developed a multiresolution
architecture for 3D images. reconstruction architecture based on a Laplacian pyramid
V-Net (Fig. 15), proposed by Milletari et al. [51] for 3D that uses skip connections from higher resolution feature
medical image segmentation, is another well known FCN- maps and multiplicative gating to successively refine
based model. The authors introduced a new loss function
based on the Dice coefficient, enabling the model to deal
with situations in which there is a strong imbalance
between the number of voxels in the foreground and back-
ground. The network was trained end-to-end on MRI
images of the prostate and learns to predict segmentation
for the whole volume at once. Some of the other relevant
works on medical image segmentation includes Progressive Fig. 16. The PSPN architecture. From [54].

Authorized licensed use limited to: GITAM University. Downloaded on January 25,2023 at 11:09:35 UTC from IEEE Xplore. Restrictions apply.
3528 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 44, NO. 7, JULY 2022

Fig. 19. Mask R-CNN instance segmentation results. From [62].

Fig. 17. Faster R-CNN architecture. Each image is processed by convo-


lutional layers and its features are extracted, a sliding window is used in
RPN for each location over the feature map, for each location, k (k ¼ 9)
anchor boxes are used (3 scales of 128, 256 and 512, and 3 aspect
ratios of 1:1, 1:2, 2:1) to generate a region proposal; A cls layer outputs Fig. 20. The Path Aggregation Network. (a) FPN backbone. (b) Bottom-
2k scores whether there or not there is an object for k boxes; A reg layer up path augmentation. (c) Adaptive feature pooling. (d) Box branch. (e)
outputs 4k for the coordinates (box center coordinates, width, and Fully-connected fusion. From [63].
height) of k boxes. From [61].

Fig. 18. Mask R-CNN architecture. From [62].


Fig. 21. The MaskLab model. From [66].

segment boundaries reconstructed from lower-resolution


maps. They showed that while the apparent spatial resolu- binary mask to segment the object. The Mask R-CNN loss
tion of convolutional feature maps is low, the high-dimen- function combines the losses of the bounding box coordi-
sional feature representation contains significant sub-pixel nates, the predicted class, and the segmentation mask, and
localization information. trains all of them jointly.
Other models use multiscale analysis for segmentation, The Path Aggregation Network (PANet) proposed by
among them Dynamic Multiscale Filters Network (DM- Liu et al. [63] is based on the Mask R-CNN and FPN models
Net) [56], Context Contrasted Network and gated multiscale (Fig. 20). The feature extractor of the network uses an FPN
aggregation (CCN) [57], Adaptive Pyramid Context Net- backbone with a new augmented bottom-up pathway
work (APC-Net) [58], MultiScale Context Intertwining improving the propagation of lower-layer features. Each
(MSCI) [59], and salient object segmentation [60]. stage of this third pathway takes as input the feature maps
of the previous stage and processes them with a 3  3 con-
3.5 R-CNN Based Models volutional layer. A lateral connection adds the output to the
The Regional CNN (R-CNN) and its extensions have proven same-stage feature maps of the top-down pathway and
successful in object detection applications. In particular, the these feed the next stage.
Faster R-CNN [61] architecture (Fig. 17) uses a region pro- Dai et al. [64] developed a multitask network for instance-
posal network (RPN) that proposes bounding box candi- aware semantic segmentation that consists of three net-
dates. The RPN extracts a Region of Interest (RoI), and an works for differentiating instances, estimating masks, and
RoIPool layer computes features from these proposals to categorizing objects. These networks form a cascaded struc-
infer the bounding box coordinates and class of the object. ture and are designed to share their convolutional features.
Some extensions of R-CNN have been used to address the Hu et al. [65] proposed a new partially-supervised training
instance segmentation problem; i.e., the task of simulta- paradigm together with a novel weight transfer function,
neously performing object detection and semantic which enables training instance segmentation models on a
segmentation. large set of categories, all of which have box annotations,
He et al. [62] proposed Mask R-CNN (Fig. 18), which out- but only a small fraction of which have mask annotations.
performed previous benchmarks on many COCO object Chen et al. [66] developed an instance segmentation
instance segmentation challenges (Fig. 19), efficiently detect- model, MaskLab, by refining object detection with semantic
ing objects in an image while simultaneously generating a and direction features based on Faster R-CNN. This model
high-quality segmentation mask for each instance. Essen- produces three outputs (Fig. 21), box detection, semantic
tially, it is a Faster R-CNN with 3 output branches—the first segmentation logits for pixel-wise classification, and direc-
computes the bounding box coordinates, the second com- tion prediction logits for predicting each pixel’s direction
putes the associated classes, and the third computes the toward its instance center. Building on the Faster R-CNN

Authorized licensed use limited to: GITAM University. Downloaded on January 25,2023 at 11:09:35 UTC from IEEE Xplore. Restrictions apply.
MINAEE ET AL.: IMAGE SEGMENTATION USING DEEP LEARNING: A SURVEY 3529

Fig. 22. Dilated convolution. A 3  3 kernel at different dilation rates.

object detector, the predicted boxes provide accurate locali-


zation of object instances. Within each region of interest,
MaskLab performs foreground/background segmentation Fig. 24. The DeepLab-v3+ model. From [81].
by combining semantic and direction prediction.
Tensormask, proposed by Chen et al. [67], is based on Second is Atrous Spatial Pyramid Pooling (ASPP), which
dense sliding window instance segmentation. The authors probes an incoming convolutional feature layer with filters
treat dense instance segmentation as a prediction task over at multiple sampling rates, thus capturing objects as well as
4D tensors and present a general framework that enables multiscale image context to robustly segment objects at mul-
novel operators on 4D tensors. They demonstrate that the tiple scales. Third is improved localization of object bound-
tensor approach yields large gains over baselines, with aries by combining methods from deep CNNs, such as fully
results comparable to Mask R-CNN. convolutional VGG-16 or ResNet 101, and probabilistic
Other instance segmentation models have been devel- graphical models, specifically fully-connected CRFs.
oped based on R-CNN, such as those developed for mask Subsequently, Chen et al. [13] proposed DeepLabv3,
proposals, including R-FCN [68], DeepMask [69], Polar- which combines cascaded and parallel modules of dilated
Mask [70], boundary-aware instance segmentation [71], and convolutions. The parallel convolution modules are
CenterMask [72]. Another promising approach is to tackle grouped in the ASPP. A 1  1 convolution and batch nor-
the instance segmentation problem by learning grouping malization are added in the ASPP. All the outputs are
cues for bottom-up segmentation, such as deep watershed concatenated and processed by another 1  1 convolution
transform [73], real-time instance segmentation [74], to create the final output with logits for each pixel. Next,
and semantic instance segmentation via deep metric Chen et al. [81] released Deeplabv3+ (Fig. 24), which uses
learning [75]. an encoder-decoder architecture including dilated separa-
ble convolution composed of a depthwise convolution
3.6 Dilated Convolutional Models (spatial convolution for each channel of the input) and
Dilated (a.k.a. “atrous”) convolution introduces to convolu- pointwise convolution (1  1 convolution with the depth-
tional layers another parameter, the dilation rate. For exam- wise convolution as input). They used the DeepLabv3
ple, a 3  3 kernel (Fig. 22) with a dilation rate of 2 will have framework as the encoder. The most relevant model has a
the same size receptive field as a 5  5 kernel while using modified Xception backbone with more layers, dilated
only 9 parameters, thus enlarging the receptive field with depthwise separable convolutions instead of max pooling
no increase in computational cost. and batch normalization.
Dilated convolutions have been popular in the field of
real-time segmentation, and many recent publications 3.7 RNN Based Models
report the use of this technique. Some of the most important While CNNs are a natural fit for computer vision problems,
include the DeepLab family [76], multiscale context aggre- they are not the only possibility. RNNs are useful in model-
gation [77], Dense Upsampling Convolution and Hybrid ing the short/long term dependencies among pixels to
Dilated Convolution (DUC-HDC) [78], densely connected (potentially) improve the estimation of the segmentation
Atrous Spatial Pyramid Pooling (DenseASPP) [79], and the map. Using RNNs, pixels may be linked together and proc-
Efficient Network (ENet) [80]. essed sequentially to model global contexts and improve
DeepLabv1 [36] and DeepLabv2 [76], developed by Chen semantic segmentation. However the natural 2D structure
et al., are among the most popular image segmentation of images poses a challenge.
models. The latter has three key features (Fig. 23). First is Visin et al. [82] proposed an RNN-based model for
the use of dilated convolution to address the decreasing res- semantic segmentation called ReSeg (Fig. 25). This model is
olution in the network caused by max-pooling and striding. mainly based on ReNet [83], which was developed for image
classification. Each ReNet layer is composed of four RNNs
that sweep the image horizontally and vertically in both direc-
tions, encoding patches/activations, and providing relevant

Fig. 25. The ReSeg model (without the pre-trained VGG-16 feature
Fig. 23. The DeepLab model. From [76]. extractor). From [82].

Authorized licensed use limited to: GITAM University. Downloaded on January 25,2023 at 11:09:35 UTC from IEEE Xplore. Restrictions apply.
3530 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 44, NO. 7, JULY 2022

Fig. 29. CNN+LSTM segmentation masks generated for the query


“people in blue coat”. From [87].

Fig. 26. The graph-LSTM model for semantic segmentation. From [85].

Fig. 30. Attention-based semantic segmentation model. From [88].

mapping techniques such as Kinect-Fusion in order to inject


Fig. 27. Comparison of conventional RNN models and the graph-LSTM.
From [85].
semantic information into the reconstructed 3D scene.
Hu et al. [87] developed a semantic segmentation algo-
rithm that combines a CNN to encode the image and an
LSTM to encode its linguistic description. To produce pixel-
wise image segmentations from language inputs, they pro-
pose an end-to-end trainable recurrent and convolutional
model that jointly learns to process visual and linguistic
information (Fig. 28). This differs from traditional semantic
segmentation over a predefined set of semantic classes; i.e.,
the phrase “two men sitting on the right bench” requires
segmenting only the two people on the right bench and no
Fig. 28. The CNN+LSTM architecture for semantic segmentation from others sitting on another bench or standing. Fig. 29 shows
natural language expressions. From [87]. an example segmentation result by the model.
A drawback of RNN-based models is that they will gen-
global information. To perform image segmentation with the erally be slower than their CNN counterparts as their
ReSeg model, ReNet layers are stacked atop pre-trained sequential nature is not amenable to parallelization.
VGG-16 convolutional layers, which extract generic local
features, and are then followed by up-sampling layers to
recover the original image resolution in the final predictions. 3.8 Attention-Based Models
Byeon et al. [84] performed per-pixel segmentation and Attention mechanisms have been persistently explored in
classification of images of natural scenes using 2D LSTM computer vision over the years, and it is not surprising to
networks, which learn textures and the complex spatial find publications that apply them to semantic segmentation.
dependencies of labels in a single model that carries out Chen et al. [88] proposed an attention mechanism that
classification, segmentation, and context integration. learns to softly weight multiscale features at each pixel loca-
Liang et al. [85] proposed a semantic segmentation tion. They adapt a powerful semantic segmentation model
model based on a graph-LSTM network (Fig. 26) in which and jointly train it with multiscale images and the attention
convolutional layers are augmented by graph-LSTM model In Fig. 30, the model assigns large weights to the per-
layers built on super-pixel maps, which provide a more son (green dashed circle) in the background for features
global structural context. These layers generalize the from scale 1.0 as well as on the large child (magenta dashed
LSTM for uniform, array-structured data (i.e., row, grid, circle) for features from scale 0.5. The attention mechanism
or diagonal LSTMs) to nonuniform, graph-structured enables the model to assess the importance of features at
data, where arbitrary-shaped superpixels are semantically different positions and scales, and it outperforms average
consistent nodes and the adjacency relations between and max pooling.
superpixels correspond to edges, thus forming an undi- Unlike approaches in which convolutional classifiers are
rected graph (Fig. 27). trained to learn the representative semantic features of
Xiang and Fox [86] proposed Data Associated Recurrent labeled objects, Huang et al. [89] proposed a Reverse Atten-
Neural Networks (DA-RNNs) for joint 3D scene mapping tion Network (RAN) architecture (Fig. 31) for semantic seg-
and semantic labeling. DA-RNNs use a new recurrent neu- mentation that also applies reverse attention mechanisms,
ral network architecture for semantic labeling on RGB-D thereby training the model to capture the opposite con-
videos. The output of the network is integrated with cept—features that are not associated with a target class.

Authorized licensed use limited to: GITAM University. Downloaded on January 25,2023 at 11:09:35 UTC from IEEE Xplore. Restrictions apply.
MINAEE ET AL.: IMAGE SEGMENTATION USING DEEP LEARNING: A SURVEY 3531

Fig. 31. The RAN architecture. From [89].


Fig. 32. The GAN for semantic segmentation. From [100].
The RAN network performs the direct and reverse-attention
learning processes simultaneously. the discriminator network, and semi-supervised loss based
Li et al. [90] developed a Pyramid Attention Network for on the confidence map output of the discriminator.
semantic segmentation, which exploits global contextual Xue et al. [103] proposed an adversarial network with
information for semantic segmentation. Eschewing compli- multiscale L1 Loss for medical image segmentation. They
cated dilated convolutions and decoder networks, they used an FCN as the segmentor to generate segmentation
combined attention mechanisms and spatial pyramids to label maps, and proposed a novel adversarial critic network
extract precise dense features for pixel labeling. Fu et al. [91] with a multi-scale L1 loss function to force the critic and seg-
proposed a dual attention network for scene segmentation mentor to learn both global and local features that capture
that can capture rich contextual dependencies based on the long and short range spatial relationships between pixels.
self-attention mechanism. Specifically, they append two Other approaches based on adversarial training include
types of attention modules on top of a dilated FCN that cell image segmentation using GANs [104], and segmenta-
models the semantic inter-dependencies in spatial and tion and generation of the invisible parts of objects [105].
channel dimensions, respectively. The position attention
module selectively aggregates the features at each position
3.10 CNN Models With Active Contour Models
via weighted sums.
Other applications of attention mechanisms to semantic The exploration of synergies between FCNs and Active
segmentation include OCNet [92], which employs an object Contour Models (ACMs) [8] has recently attracted research
context pooling inspired by self-attention mechanism, interest.
ResNeSt: Split-Attention Networks [93], Height-driven One approach is to formulate new loss functions that
Attention Networks [94], Expectation-Maximization Atten- are inspired by ACM principles. For example, inspired by
tion (EMANet) [95], Criss-Cross Attention Network the global energy formulation of [106], Chen et al. [107]
(CCNet) [96], end-to-end instance segmentation with recur- proposed a supervised loss layer that incorporated area
rent attention [97], a point-wise spatial attention network and size information of the predicted masks during train-
for scene parsing [98], and Discriminative Feature Network ing of an FCN and tackled the problem of ventricle seg-
(DFN) [99]. mentation in cardiac MRI. Similarly, Gur et al. [108]
presented an unsupervised loss function based on mor-
phological active contours without edges [109] for micro-
3.9 Generative Models and Adversarial Training vascular image segmentation.
GANs have been applied to a wide range of tasks in com- A different approach initially sought to utilize the ACM
puter vision, not excluding image segmentation. merely as a post-processor of the output of an FCN and sev-
Luc et al. [100] proposed an adversarial training approach eral efforts attempted modest co-learning by pre-training
for semantic segmentation in which they trained a convolu- the FCN. One example of an ACM post-processor for the
tional semantic segmentation network (Fig. 32), along with an task of semantic segmentation of natural images is the work
adversarial network that discriminates between ground-truth by Le et al. [110] in which level-set ACMs are implemented
segmentation maps and those generated by the segmentation as RNNs. Deep Active Contours by Rupprecht et al. [111], is
network. They showed that the adversarial training approach another example. For medical image segmentation, Hatami-
yields improved accuracy on the Stanford Background and zadeh et al. [112] proposed an integrated Deep Active
PASCAL VOC 2012 datasets. Lesion Segmentation (DALS) model that trains the FCN
Souly et al. [101] proposed semi-weakly supervised backbone to predict the parameter functions of a novel,
semantic segmentation using GANs. Their model consists locally-parameterized level-set energy functional. In
of a generator network providing extra training examples to another relevant effort, Marcos et al. [113] proposed Deep
a multiclass classifier, acting as discriminator in the GAN Structured Active Contours (DSAC), which combines
framework, that assigns sample a label from the possible ACMs and pre-trained FCNs in a structured prediction
label classes or marks it as a fake sample (extra class). framework for building instance segmentation (albeit with
Hung et al. [102] developed a framework for semi-super- manual initialization) in aerial images. For the same appli-
vised semantic segmentation using an adversarial network. cation, Cheng et al. [114] proposed the Deep Active Ray Net-
They designed an FCN discriminator to differentiate the work (DarNet), which is similar to DSAC, but with a
predicted probability maps from the ground truth segmen- different explicit ACM formulation based on polar coordi-
tation distribution, considering the spatial resolution. The nates to prevent contour self-intersection.
loss function of this model has three terms: cross-entropy A truly end-to-end backpropagation trainable, fully-inte-
loss on the segmentation ground truth, adversarial loss of grated FCN-ACM combination was recently introduced by

Authorized licensed use limited to: GITAM University. Downloaded on January 25,2023 at 11:09:35 UTC from IEEE Xplore. Restrictions apply.
3532 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 44, NO. 7, JULY 2022

Fig. 33. Timeline of representative DL-based image segmentation algorithms. Orange, green, and yellow blocks indicate semantic, instance, and
panoptic segmentation algorithms, respectively.

Hatamizadeh et al. [115], dubbed Trainable Deep Active search [138], hierarchical multiscale attention [139], Efficient
Contours (TDAC). Going beyond [112], they implemented RGB-D Semantic Segmentation (ESA-Net) [140], Iterative
the locally-parameterized level-set ACM in the form of Pyramid Contexts [141], and Learning Dynamic Routing for
additional convolutional layers following the layers of the Semantic Segmentation [142].
backbone FCN, exploiting Tensorflow’s automatic differen- Panoptic segmentation [143] is growing in popularity.
tiation mechanism to backpropagate training error gra- Efforts in this direction include Panoptic Feature Pyramid
dients throughout the entire DCAC framework. The fully- Network (PFPN) [144], attention-guided network for pan-
automated model requires no intervention either during optic segmentation [145], seamless scene segmentation [146],
training or segmentation, can naturally segment multiple panoptic Deeplab [147], unified panoptic segmentation net-
instances of objects of interest, and deal with arbitrary object work [148], and efficient panoptic segmentation [149].
shape including sharp corners. Fig. 33 provides a timeline of some of the most represen-
tative DL image segmentation models since 2014.
3.11 Other Models
4 DATASETS
Other popular DL architectures for image segmentation
include the following: In this section we survey the image datasets most com-
Context Encoding Network (EncNet) [116] uses a basic monly used to train and test DL image segmentation mod-
feature extractor and feeds the feature maps into a context els, grouping them into 3 categories—2D (pixel) images,
encoding module. RefineNet [117] is a multipath refinement 2.5D RGB-D (color+depth) images, and 3D (voxel) images—
network that explicitly exploits all the information available and provide details about the characteristics of each dataset.
along the down-sampling process to enable high-resolution Data augmentation is often used to increase the number of
prediction using long-range residual connections. Seed- labeled samples, especially for small datasets such as those
net [118] introduced an automatic seed generation tech- in the medical imaging domain, thus improving the perfor-
nique with deep reinforcement learning that learns to solve mance of DL segmentation models. A set of transformations
the interactive segmentation problem. Object-Contextual is applied either in the data space, or feature space, or both
Representations (OCR) [42] learns object regions and the (i.e., both the image and the segmentation map). Typical
relation between each pixel and each object region, aug- transformations include translation, reflection, rotation,
menting the representation pixels with the object-contextual warping, scaling, color space shifting, cropping, and projec-
representation. Additional models and methods include tions onto principal components. Data augmentation can
BoxSup [119], Graph Convolutional Networks (GCN) [120], also benefit by yielding faster convergence, decreasing the
Wide ResNet [121], Exfuse [122] (enhancing low-level chance of over-fitting, and enhancing generalization. For
and high-level features fusion), Feedforward-Net [123], some small datasets, data augmentation has been shown to
saliency-aware models for geodesic video segmenta- boost model performance by more than 20 percent.
tion [124], Dual Image Segmentation (DIS) [125], Fovea-
Net [126] (perspective-aware scene parsing), Ladder 4.1 2D Image Datasets
DenseNet [127], Bilateral Segmentation Network (BiSe- The bulk of image segmentation research has focused on 2D
Net) [128], Semantic Prediction Guidance for Scene Parsing images; therefore, many 2D image segmentation datasets
(SPGNet) [129], gated shape CNNs [130], Adaptive Context are available. The following are some of the most popular:
Network (AC-Net) [131], Dynamic-Structured Semantic PASCAL Visual Object Classes (VOC) [150] is a highly pop-
Propagation Network (DSSPN) [132], Symbolic Graph Rea- ular dataset in computer vision, with annotated images
soning (SGR) [133], CascadeNet [134], Scale-Adaptive available for 5 tasks—classification, segmentation, detec-
Convolutions (SAC) [135], Unified Perceptual parsing tion, action recognition, and person layout. For the segmen-
Network (UperNet) [136], segmentation by re-training and tation task, there are 21 labeled object classes and pixels are
self-training [137], densely connected neural architecture labeled as background if they do not belong to any of these

Authorized licensed use limited to: GITAM University. Downloaded on January 25,2023 at 11:09:35 UTC from IEEE Xplore. Restrictions apply.
MINAEE ET AL.: IMAGE SEGMENTATION USING DEEP LEARNING: A SURVEY 3533

Fig. 36. Segmentation maps from the Cityscapes dataset. From [154].
Fig. 34. An example image from the PASCAL VOC dataset. From [151].
Stanford Background [156] comprises outdoor images of
classes. The dataset is divided into two sets, training and scenes from existing datasets, such as LabelMe, MSRC, and
validation, with 1,464 and 1,449 images, respectively, and a PASCAL VOC. It includes 715 images with at least one fore-
private test set for the actual challenge. Fig. 34 shows an ground object. The dataset is pixel-wise annotated, and can
example image and its pixel-wise label. be used for semantic scene understanding.
PASCAL Context [152] is an extension of the PASCAL Berkeley Segmentation Dataset (BSD) [157] contains
VOC 2010 detection challenge. It includes pixel-wise labels 12,000 hand-labeled segmentations of 1,000 Corel dataset
for all the training images. It contains more than 400 classes images from 30 human subjects. It aims to provide an
(including the original 20 classes plus backgrounds from empirical basis for research on image segmentation and
PASCAL VOC segmentation), in three categories (objects, boundary detection. Half of the segmentations were
stuff, and hybrids). Many of the object categories of this obtained from presenting the subject a color image and
dataset are too sparse and; therefore, a subset of 59 classes is the other half from presenting a grayscale image. The
usually selected for use. public benchmark based on this data consists of all of
Microsoft Common Objects in Context (MS COCO) [153] is a the grayscale and color segmentations for 300 images.
large-scale object detection, segmentation, and captioning The images are divided into a training set of 200 images
dataset. COCO includes images of complex everyday and a test set of 100 images.
scenes, containing common objects in their natural contexts. Youtube-Objects [158] contains videos collected from You-
This dataset contains photos of 91 object types, with a total Tube, which include objects from ten PASCAL VOC classes
of 2.5 million labeled instances in 328K images. Fig. 35 com- (aeroplane, bird, boat, car, cat, cow, dog, horse, motorbike,
pares MS-COCO labels with those of previous datasets for a and train). The original dataset did not contain pixel-wise
sample image. annotations (as it was originally developed for object detec-
Cityscapes [154] is a large database with a focus on seman- tion, with weak annotations). However, Jain et al. [159] man-
tic understanding of urban street scenes. It contains a ually annotated a subset of 126 sequences, and then
diverse set of stereo video sequences recorded in street extracted a subset of frames to further generate semantic
scenes from 50 cities, with high quality pixel-level annota- labels. In total, there are about 10,167 annotated 480x360
tion of 5K frames, in addition to a set of 20K weakly anno- pixel frames available in this dataset.
tated frames. It includes semantic and dense pixel CamVid: is another scene understanding database (with a
annotations of 30 classes, grouped into 8 categories—flat focus on road/driving scenes) which was originally cap-
surfaces, humans, vehicles, constructions, objects, nature, tured as five video sequences via camera mounted on the
sky, and void. Fig. 36 shows sample segmentation maps dashboard of a car. A total of 701 frames were provided by
from this dataset. sampling from the sequences. These frames were manually
ADE20K / MIT Scene Parsing (SceneParse150) offers a annotated into 32 classes.
training and evaluation platform for scene parsing algo- KITTI [160] is one of the most popular datasets for
rithms. The data for this benchmark comes from the autonomous driving, containing videos of traffic scenar-
ADE20K dataset [134], which contains more than 20K ios, recorded with a variety of sensor modalities (includ-
scene-centric images exhaustively annotated with objects ing high-resolution RGB, grayscale stereo cameras, and a
and object parts. The benchmark is divided into 20K images 3D laser scanners). The original dataset does not contain
for training, 2K images for validation, and another batch of ground truth for semantic segmentation, but researchers
images for testing. There are 150 semantic categories in this have manually annotated parts of the dataset; e.g.,
dataset. Alvarez et al. [161] generated ground truth for 323 images
SiftFlow [155] includes 2,688 annotated images, from a from the road detection challenge with 3 classes—road,
subset of the LabelMe database, of 8 different outdoor vertical, and sky.
scenes, among them streets, mountains, fields, beaches, and Other datasets for image segmentation purposes
buildings, and in one of 33 semantic classes. include Semantic Boundaries Dataset (SBD) [162], PASCAL
Part [163], SYNTHIA [164], and Adobe’s Portrait Segmenta-
tion [165].

4.2 2.5D Datasets


With the availability of affordable range scanners, RGB-D
images have became increasingly widespread. The follow-
ing RGB-D datasets are among the most popular:
NYU-Depth V2 [166] consists of video sequences from a
variety of indoor scenes, recorded by the RGB and depth
cameras of the Microsoft Kinect. It includes 1,449 densely
Fig. 35. A sample image and segmentation map in COCO. From [153]. labeled RGB and depth image pairs of more than 450 scenes

Authorized licensed use limited to: GITAM University. Downloaded on January 25,2023 at 11:09:35 UTC from IEEE Xplore. Restrictions apply.
3534 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 44, NO. 7, JULY 2022

(K foreground classes and the background) pixel accu-


racy is defined as
PK
pii
PA ¼ PK i¼0
PK ; (1)
i¼0 j¼0 pij
Fig. 37. A sample from the NYU V2 dataset. From left: RGB image, pre-
processed depth image, class labels map. From [166]. where pij is the number of pixels of class i predicted as
belonging to class j.
taken from 3 cities. Each object is labeled with a class and Mean Pixel Accuracy (MPA) is an extension of PA, in
instance number (e.g., cup1, cup2, cup3, etc.). It also con- which the ratio of correct pixels is computed in a per-class
tains 407,024 unlabeled frames. Fig. 37 shows an RGB-D manner and then averaged over the total number of classes
sample and its label map.
SUN-3D [167] is a large RGB-D video dataset that con- 1 X K
pii
tains 415 sequences captured from 254 different spaces in 41 MPA ¼ PK : (2)
K þ 1 i¼0 j¼0 pij
different buildings; 8 sequences are annotated and more
will be annotated in the future. Each annotated frame pro-
Intersection over Union (IoU), or the Jaccard Index, is
vides the semantic segmentation of the objects in the scene
defined as the area of intersection between the predicted
as well as information about the camera pose.
segmentation map A and the ground truth map B, divided
SUN RGB-D [168] provides an RGB-D benchmark for
by the area of the union between the two maps, and ranges
advancing the state-of-the-art of all major scene understand-
between 0 and 1
ing tasks. It is captured by four different sensors and con-
tains 10,000 RGB-D images at a scale similar to PASCAL jA \ Bj
VOC. IoU ¼ JðA; BÞ ¼ : (3)
jA [ Bj
ScanNet [169] is an RGB-D video dataset containing 2.5
million views in more than 1,500 scans, annotated with 3D Mean-IoU is defined as the average IoU over all classes.
camera poses, surface reconstructions, and instance-level Precision / Recall / F1 score can be defined for each class, as
semantic segmentations. To collect these data, an easy-to- well as at the aggregate level, as follows:
use and scalable RGB-D capture system was designed that
includes automated surface reconstruction, and the seman- TP TP
Precision ¼ ; Recall ¼ ; (4)
tic annotation was crowd-sourced. Using this data helped TP þ FP TP þ FN
achieve state-of-the-art performance on several 3D scene where TP refers to the true positive fraction, FP refers to the
understanding tasks, including 3D object classification, false positive fraction, and FN refers to the false negative
semantic voxel labeling, and CAD model retrieval. fraction. Usually one is interested in a combined version of
Stanford 2D-3D [170] provides a variety of mutually reg- precision and recall rates; the F1 score is defined as the har-
istered 2D, 2.5D, and 3D modalities, with instance-level monic mean of precision and recall
semantic and geometric annotations, acquired from 6
indoor areas. It contains over 70,000 RGB images, along 2 Precision Recall
F1 ¼ : (5)
with the corresponding depths, surface normals, semantic Precision þ Recall
annotations, as well as global XYZ images, camera informa-
tion, and registered raw and semantically annotated 3D Dice coefficient, commonly used in medical image analysis,
meshes and point clouds. can be defined as twice the overlap area of the predicted and
Another popular 2.5D datasets is UW RGB-D Object Data- ground-truth maps divided by the total number of pixels.
set [171], which contains 300 common household objects
recorded using a Kinect-style sensor. 2jA \ Bj
Dice ¼ : (6)
jAj þ jBj
5 DL SEGMENTATION MODEL PERFORMANCE It is very similar to the IoU (3) and when applied to binary
In this section, we summarize the metrics commonly used maps, with foreground as the positive class, the Dice coeffi-
in evaluating the performance of segmentation models and cient is identical to the F1 score (7)
report the performance of DL-based segmentation models 2TP
on benchmark datasets. Dice ¼ ¼ F1: (7)
2TP þ FP þ FN

5.1 Metrics for Image Segmentation Models


Ideally, an image segmentation model should be evaluated 5.2 Quantitative Performance of DL-Based Models
in multiple respects, such as quantitative accuracy, visual In this section we tabulate the performance of several of the
quality, speed (inference time), and storage requirements previously discussed algorithms on popular segmentation
(memory footprint). However, most researchers to date benchmarks. Although most publications report model per-
have focused on metrics for quantifying model accuracy. formance on standard datasets and use standard metrics,
The following metrics are most popular: some of them fail to do so, making across-the-board com-
Pixel accuracy is the ratio of properly classified pixels parisons difficult. Furthermore, only a few publications pro-
divided by the total number of pixels. For K þ 1 classes vide additional information, such as execution time and

Authorized licensed use limited to: GITAM University. Downloaded on January 25,2023 at 11:09:35 UTC from IEEE Xplore. Restrictions apply.
MINAEE ET AL.: IMAGE SEGMENTATION USING DEEP LEARNING: A SURVEY 3535

TABLE 1 TABLE 2
Accuracies of Segmentation Models on the Accuracies of Segmentation Models on the Cityscapes Dataset
PASCAL VOC Test Set
Method Backbone mIoU
Method Backbone mIoU SegNet [25] - 57.0
FCN [30] VGG-16 62.2 FCN-8s [30] - 65.3
CRF-RNN [38] - 72.0 DPN [40] - 66.8
CRF-RNN [38] - 74.7 Dilation10 [77] - 67.1
BoxSup* [119] - 75.1 DeeplabV2 [76] ResNet-101 70.4
Piecewise [39] - 78.0 RefineNet [117] ResNet-101 73.6
DPN [40] - 77.5 FoveaNet [126] ResNet-101 74.1
DeepLab-CRF [76] ResNet-101 79.7 Ladder DenseNet [127] Ladder DenseNet-169 73.7
GCN [120] ResNet-152 82.2 GCN [120] ResNet-101 76.9
Dynamic Routing [142] - 84.0 DUC-HDC [78] ResNet-101 77.6
RefineNet [117] ResNet-152 84.2 Wide ResNet [121] WideResNet-38 78.4
Wide ResNet [121] WideResNet-38 84.9 PSPNet [54] ResNet-101 85.4
PSPNet [54] ResNet-101 85.4 BiSeNet [128] ResNet-101 78.9
DeeplabV3 [13] ResNet-101 85.7 DFN [99] ResNet-101 79.3
PSANet [98] ResNet-101 85.7 PSANet [98] ResNet-101 80.1
EncNet [116] ResNet-101 85.9 DenseASPP [79] DenseNet-161 80.6
DFN [99] ResNet-101 86.2 Dynamic Routing [142] - 80.7
Exfuse [122] ResNet-101 86.2 SPGNet [129] 2xResNet-50 81.1
SDN* [43] DenseNet-161 86.6 DANet [91] ResNet-101 81.5
DIS [125] ResNet-101 86.8 CCNet [96] ResNet-101 81.4
APC-Net [58] ResNet-101 87.1 DeeplabV3 [13] ResNet-101 81.3
EMANet [95] ResNet-101 87.7 IPC [141] ResNet-101 81.8
DeeplabV3+ [81] Xception-71 87.8 AC-Net [131] ResNet-101 82.3
Exfuse [122] ResNeXt-131 87.9 OCR [42] ResNet-101 82.4
MSCI [59] ResNet-152 88.0 ResNeSt200 [93] ResNeSt-200 82.7
EMANet [95] ResNet-152 88.2 GS-CNN [130] WideResNet 82.8
DeeplabV3+ [81] Xception-71 89.0 HA-Net [94] ResNext-101 83.2
EfficientNet+NAS-FPN [137] - 90.5 HRNetV2+OCR [42] HRNetV2-W48 83.7
Hierarchical MSA [139] HRNet-OCR 85.1

Models pre-trained on other datasets (MS-COCO, ImageNet, etc.).

TABLE 3
memory footprint, in a reproducible way, which is impor- Accuracies of Segmentation Models on the
tant to industrial applications (such as drones, self-driving MS COCO Stuff Dataset
cars, robotics, etc.) that may run on embedded systems with
limited computational power and storage, thus requiring Method Backbone mIoU
light-weight models. RefineNet [117] ResNet-101 33.6
The following tables summarize the performances of sev- CCN [57] Ladder DenseNet-101 35.7
eral of the prominent DL-based segmentation models on DANet [91] ResNet-50 37.9
different datasets: DSSPN [132] ResNet-101 37.3
Table 1 focuses on the PASCAL VOC test set. Clearly, EMA-Net [95] ResNet-50 37.5
SGR [133] ResNet-101 39.1
there has been much improvement in the accuracy of the
OCR [42] ResNet-101 39.5
models since the introduction of the first DL-based image DANet [91] ResNet-101 39.7
segmentation model, the FCN. EMA-Net [95] ResNet-50 39.9
Table 2 focuses on the Cityscape test dataset. The latest AC-Net [131] ResNet-101 40.1
models feature about 23 percent relative gain over the pio- OCR [42] HRNetV2-W48 40.5
neering FCN model on this dataset.
Table 3 focuses on the MS COCO stuff test set. This data-
set is more challenging than PASCAL VOC, and Citye- In summary, we have witnessed been significant
scapes, as the highest mIoU is approximately 40 percent. improvement in the performance of deep segmentation
Table 4 focuses on the ADE20k validation set. This data- models over the past 5–6 years, with a relative improve-
set is also more challenging than the PASCAL VOC and Cit- ment of 25-42 percent in mIoU on different datasets.
yescapes datasets. However, some publications suffer from lack of repro-
Table 5 provides the performance of prominent instance ducibility for multiple reasons—they report performance
segmentation algorithms on COCO test-dev 2017 dataset, in on non-standard benchmarks/databases, or only on arbi-
terms of average precision, and their speed. trary subsets of the test set from a popular benchmark,
Table 6 provides the performance of prominent panoptic or they do not adequately describe the experimental
segmentation algorithms on MS-COCO val dataset, in terms setup and sometimes evaluate model performance only
of panoptic quality [143]. on a subset of object classes. Most importantly, many
Finally, Table 7 summarizes the performance of several publications do not provide the source-code for their
prominent models for RGB-D segmentation on the NYUD- model implementations. Fortunately, with the increasing
v2 and SUN-RGBD datasets. popularity of deep learning models, the trend has been

Authorized licensed use limited to: GITAM University. Downloaded on January 25,2023 at 11:09:35 UTC from IEEE Xplore. Restrictions apply.
3536 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 44, NO. 7, JULY 2022

TABLE 4 TABLE 7
Accuracies of Segmentation Models on the Segmentation Model Performance on the
ADE20k Validation Dataset NYUD-v2 and SUN-RGBD

Method Backbone mIoU NYUD-v2 SUN-RGBD


FCN [30] - 29.39 Method m-Acc m-IoU m-Acc m-IoU
DilatedNet [77] - 32.31 Mutex [177] - 31.5 - -
CascadeNet [134] - 34.90 MS-CNN [178] 45.1 34.1 - -
RefineNet [117] ResNet-152 40.7 FCN [30] 46.1 34.0 - -
PSPNet [54] ResNet-101 43.29 Joint-Seg [179] 52.3 39.2 - -
PSPNet [54] ResNet-269 44.94 SegNet [25] - - 44.76 31.84
EncNet [116] ResNet-101 44.64 Structured Net [39] 53.6 40.6 53.4 42.3
SAC [135] ResNet-101 44.3 B-SegNet [180] - - 45.9 30.7
PSANet [98] ResNet-101 43.70 3D-GNN [181] 55.7 43.1 57.0 45.9
UperNet [136] ResNet-101 42.66 LSD-Net [46] 60.7 45.9 58.0 -
DSSPN [132] ResNet-101 43.68
RefineNet [117] 58.9 46.5 58.5 45.9
DM-Net [56] ResNet-101 45.50 D-aware CNN [182] 61.1 48.4 53.5 42.0
AC-Net [131] ResNet-101 45.90 MTI-Net [183] 62.9 49 - -
ResNeSt-101 [93] ResNeSt-101 46.91 RDFNet [184] 62.8 50.1 60.1 47.7
ResNeSt-200 [93] ResNeSt-200 48.36 ESANet-R34-NBt1D [140] - 50.3 - 48.17
G-Aware Net [185] 68.7 59.6 74.9 54.5

TABLE 5
Instance Segmentation Model Performance on
COCO Test-Dev 2017 6.1 More Challenging Datasets
Several large-scale image datasets have been created for
Method Backbone FPS AP
semantic segmentation and instance segmentation. How-
YOLACT-550 [74] R-101-FPN 33.5 29.8 ever, there remains a need for more challenging datasets, as
YOLACT-700 [74] R-101-FPN 23.8 31.2 well as datasets of different kinds of images. For still
RetinaMask [172] R-101-FPN 10.2 34.7
images, datasets with a large number of objects and overlap-
TensorMask [67] R-101-FPN 2.6 37.1
SharpMask [173] R-101-FPN 8.0 37.4 ping objects would be very valuable. This can enable the
Mask-RCNN [62] R-101-FPN 10.6 37.9 training of models that handle dense object scenarios better,
CenterMask [72] R-101-FPN 13.2 38.3 as well as large overlaps among objects as is common in
real-world scenarios. With the rising popularity of 3D image
segmentation, especially in medical image analysis, there is
TABLE 6 also a strong need for large-scale annotated 3D image data-
Panoptic Segmentation Model Performance on MS-COCO Val sets, which are more difficult to create than their lower
dimensional counterparts.
Method Backbone PQ
Panoptic FPN [144] ResNet-50 39.0 6.2 Combining DL and Earlier Segmentation Models
Panoptic FPN [144] ResNet-101 40.3
AU-Net [145] ResNet-50 39.6 There is now broad agreement that the performance of DL-
Panoptic-DeepLab [147] Xception-71 39.7 based segmentation algorithms is plateauing, especially in
OANet [174] ResNet-50 39.0 certain application domains such as medical image analysis.
OANet [174] ResNet-101 40.7 To advance to the next level of performance, we must fur-
AdaptIS [175] ResNet-50 35.9 ther explore the combination of CNN-based image segmen-
AdaptIS [175] ResNet-101 37.0 tation models with prominent “classical” model-based
UPSNet [148] ResNet-50 42.5 image segmentation methods. The integration of CNNs
OCFusion [176] ResNet-50 41.3
OCFusion [176] ResNet-101 43.0 with graphical models has been studied, but their integra-
OCFusion [176] ResNeXt-101 45.7 tion with active contours, graph cuts, and other segmenta-
tion models is fairly recent and deserves further work.

Use of deformable convolution.

positive and many research groups are moving toward 6.3 Interpretable Deep Models
reproducible frameworks and open-sourcing their While DL-based models have achieved promising perfor-
implementations. mance on challenging benchmarks, there remain open ques-
tions about these models. For example, what exactly are
deep models learning? How should we interpret the fea-
6 CHALLENGES AND OPPORTUNITIES tures learned by these models? What is a minimal neural
Without a doubt, image segmentation has benefited greatly architecture that can achieve a certain segmentation accu-
from deep learning, but several challenges lie ahead. We racy on a given dataset? Although some techniques are
will next discuss some of the promising research directions available to visualize the learned convolutional kernels of
that we believe will help in further advancing image seg- these models, a comprehensive study of the underlying
mentation algorithms. behavior/dynamics of these models is lacking. A better

Authorized licensed use limited to: GITAM University. Downloaded on January 25,2023 at 11:09:35 UTC from IEEE Xplore. Restrictions apply.
MINAEE ET AL.: IMAGE SEGMENTATION USING DEEP LEARNING: A SURVEY 3537

understanding of the theoretical aspects of these models can imagery (often collected by imaging spectrometers with
enable the development of better models curated toward hundreds or even thousands of spectral bands) and the
various segmentation scenarios. limited ground-truth information necessary to evaluate
the accuracy of the segmentation algorithms. Similarly,
6.4 Weakly-Supervised and Unsupervised Learning DL-based segmentation techniques in the evaluation of
Weakly-supervised (a.k.a. few shot) learning [186] and unsu- construction materials [194] face challenges related to the
pervised learning [187] are becoming very active research massive volume of the related image data and the limited
areas. These techniques promise to be specially valuable for reference information for validation purposes. Last but
image segmentation, as collecting pixel-accurately labeled not least, an important application field for DL-based seg-
training images is problematic in many application domains, mentation has been biomedical imaging [195]. Here, an
particularly so in medical image analysis. The transfer learn- opportunity is to design standardized image databases
ing approach is to train a generic image segmentation model useful in evaluating new infectious diseases and tracking
on a large set of labeled samples (perhaps from a public pandemics [196].
benchmark) and then fine-tune that model on a few samples
from some specific target application. Self-supervised learn-
ing is another promising direction that is attracting much 7 CONCLUSION
attraction in various fields. With the help of self-supervised We have surveyed image segmentation algorithms based on
learning, many details in images can be captured in order to deep learning models, which have achieved impressive per-
train segmentation models with far fewer training samples. formance in various image segmentation tasks and bench-
Models based on reinforcement learning could also be marks, grouped into architectural categories such as: CNN
another potential future direction, as they have scarcely and FCN, RNN, R-CNN, dilated CNN, attention-based
received attention for image segmentation. For example, models, generative and adversarial models, among others.
MOREL [188] introduced a deep reinforcement learning We have summarized the quantitative performance of these
approach for moving object segmentation in videos. models on some popular benchmarks, such as the PASCAL
VOC, MS COCO, Cityscapes, and ADE20k datasets. Finally,
6.5 Real-Time Models for Various Applications we discussed some of the open challenges and promising
In many applications, accuracy is the most important factor; research directions for deep-learning-based image segmen-
however, there are applications in which it is also critical to tation in the coming years.
have segmentation models that can run in near real-time, or
at common camera frame rates (at least 25 frames per sec- ACKNOWLEDGMENTS
ond). This is useful for computer vision systems that are, for
The authors would like to thank Tsung-Yi Lin of Google
example, deployed in autonomous vehicles. Most of the cur-
Brain, as well as Jingdong Wang and Yuhui Yuan of Micro-
rent models are far from this frame-rate; e.g., FCN-8 takes
soft Research Asia for providing helpful comments that
roughly 100 ms to process a low-resolution image. Models
improved the manuscript.
based on dilated convolution help to increase the speed of
segmentation models to some extent, but there is still plenty
of room for improvement. REFERENCES
[1] A. Rosenfeld and A. C. Kak, Digital Picture Processing.
Cambridge, MA, USA: Academic Press, 1976.
6.6 Memory Efficient Models [2] R. Szeliski, Computer Vision: Algorithms and Applications. Berlin,
Many modern segmentation models require a significant Germany: Springer, 2010.
amount of memory even during the inference stage. So far, [3] D. Forsyth and J. Ponce, Computer Vision: A Modern Approach.
Upper Saddle River, NJ, USA: Prentice Hall, 2002.
much effort has been directed towards improving the accu- [4] N. Otsu, “A threshold selection method from gray-level histo-
racy of such models, but in order to fit them into specific grams,” IEEE Trans. Syst. Man Cybern., vol. SMC-9, no. 1, pp. 62–
devices, such as mobile phones, the networks must be sim- 66, Jan. 1979.
[5] R. Nock and F. Nielsen, “Statistical region merging,” IEEE Trans.
plified. This can be done either by using simpler models, or Pattern Anal. Mach. Intell., vol. 26, no. 11, pp. 1452–1458, Nov.
by using model compression techniques, or even by training 2004.
a complex model and using knowledge distillation techni- [6] N. Dhanachandra, K. Manglem, and Y. J. Chanu, “Image seg-
ques to compress it into a smaller, memory efficient network mentation using K-means clustering algorithm and subtractive
clustering algorithm,” Procedia Comput. Sci., vol. 54, pp. 764–771,
that mimics the complex model. 2015.
[7] L. Najman and M. Schmitt, “Watershed of a continuous
6.7 Applications function,” Signal Process., vol. 38, no. 1, pp. 99–112, 1994.
[8] M. Kass, A. Witkin, and D. Terzopoulos, “Snakes: Active contour
DL-based segmentation methods have been successfully models,” Int. J. Comput. Vis., vol. 1, no. 4, pp. 321–331, 1988.
applied to satellite images in remote sensing [189], such [9] Y. Boykov, O. Veksler, and R. Zabih, “Fast approximate energy
as to support urban planning [190] and precision agricul- minimization via graph cuts,” IEEE Trans. Pattern Anal. Mach.
Intell., vol. 23, no. 11, pp. 1222–1239, Nov. 2001.
ture [191]. Images collected by airborne platforms [192] [10] N. Plath, M. Toussaint, and S. Nakajima, “Multi-class image seg-
and drones [193] have also been segmented using DL- mentation using conditional random fields and global classi-
based segmentation methods in order to address impor- fication,” in Proc. 26th Int. Conf. Mach. Learn., 2009, pp. 817–824.
tant environmental problems including ones related to [11] J.-L. Starck, M. Elad, and D. L. Donoho, “Image decomposition
via the combination of sparse representations and a variational
climate change. The main challenges of the remote sens- approach,” IEEE Trans. Image Process., vol. 14, no. 10, pp. 1570–
ing domain stem from the typically formidable size of the 1582, Oct. 2005.

Authorized licensed use limited to: GITAM University. Downloaded on January 25,2023 at 11:09:35 UTC from IEEE Xplore. Restrictions apply.
3538 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 44, NO. 7, JULY 2022

[12] S. Minaee and Y. Wang, “An ADMM approach to masked signal [39] G. Lin, C. Shen, A. Van Den Hengel, and I. Reid, “Efficient piece-
decomposition using subspace representation,” IEEE Trans. wise training of deep structured models for semantic
Image Process., vol. 28, no. 7, pp. 3192–3204, Jul. 2019. segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
[13] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, 2016, pp. 3194–3203.
“Rethinking atrous convolution for semantic image [40] Z. Liu, X. Li, P. Luo, C.-C. Loy, and X. Tang, “Semantic image
segmentation,” 2017, arXiv: 1706.05587. segmentation via deep parsing network,” in Proc. IEEE Int. Conf.
[14] S. Minaee, Y. Boykov, F. Porikli, A. Plaza, N. Kehtarnavaz, and Comput. Vis., 2015, pp. 1377–1385.
D. Terzopoulos, “Image segmentation using deep learning: A [41] H. Noh, S. Hong, and B. Han, “Learning deconvolution network
survey,” 2020, arXiv: 2001.05566. for semantic segmentation,” in Proc. IEEE Int. Conf. Comput. Vis.,
[15] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based 2015, pp. 1520–1528.
learning applied to document recognition,” Proc. IEEE, vol. 86, [42] Y. Yuan, X. Chen, and J. Wang, “Object-contextual representa-
no. 11, pp. 2278–2324, Nov. 1998. tions for semantic segmentation,” 2019, arXiv: 1909.11065.
[16] K. Fukushima, “Neocognitron: A self-organizing neural net- [43] J. Fu, J. Liu, Y. Wang, J. Zhou, C. Wang, and H. Lu, “Stacked
work model for a mechanism of pattern recognition unaf- deconvolutional network for semantic segmentation,” IEEE
fected by shift in position,” Biol. Cybern., vol. 36, no. 4, Trans. Image Process., early access, Jan. 25, 2019, doi: 10.1109/
pp. 193–202, 1980. TIP.2019.2895460.
[17] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang, [44] A. Chaurasia and E. Culurciello, “LinkNet: Exploiting encoder
“Phoneme recognition using time-delay neural networks,” IEEE representations for efficient semantic segmentation,” in Proc.
Trans. Acoust. Speech Signal Process., vol. 37, no. 3, pp. 328–339, IEEE Int. Conf. Visual Commun. Image Process., 2017, pp. 1–4.
Mar. 1989. [45] X. Xia and B. Kulis, “W-Net: A deep model for fully unsuper-
[18] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classi- vised image segmentation,” 2017, arXiv: 1711.08506.
fication with deep convolutional neural networks,” in Proc. 25th [46] Y. Cheng, R. Cai, Z. Li, X. Zhao, and K. Huang, “Locality-sensi-
Int. Conf. Neural Inf. Process. Syst., 2012, pp. 1097–1105. tive deconvolution networks with gated fusion for RGB-D indoor
[19] K. Simonyan and A. Zisserman, “Very deep convolutional net- semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern
works for large-scale image recognition,” 2014, arXiv:1409.1556. Recognit., 2017, pp. 3029–3037.
[20] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for [47] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Rec- networks for biomedical image segmentation,” in Proc. Int. Conf.
ognit., 2016, pp. 770–778. Med. Image Comput. Comput.-Assisted Intervention, 2015, pp. 234–
[21] 2015. [Online]. Available: https://colah.github.io/posts/2015– 241.
08-Understanding-LSTMs/ [48] Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang, “UNet++:
[22] D. E. Rumelhart et al., “Learning representations by back-propa- A nested U-Net architecture for medical image segmentation,” in
gating errors,” Cogn. Model., vol. 5, no. 3, 1988, Art. no. 1. Proc. Int. Workshop Deep Learn. Med. Image Anal. Multimodal Learn.
[23] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Clin. Decis. Support, 2018, pp. 3–11.
Neural Comput., vol. 9, no. 8, pp. 1735–1780, 1997. [49] Z. Zhang, Q. Liu, and Y. Wang, “Road extraction by deep resid-
[24] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. ual U-Net,” IEEE Geosci. Remote Sens. Lett., vol. 15, no. 5, pp. 749–
Cambridge, MA, USA: MIT Press, 2016. 753, May 2018.
[25] V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet: A deep [50] € Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronne-
O.
convolutional encoder-decoder architecture for image berger, “3D U-Net: Learning dense volumetric segmentation
segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, from sparse annotation,” in Proc. Int. Conf. Med. Image Comput.
no. 12, pp. 2481–2495, Dec. 2017. Comput.-Assisted Intervention, 2016, pp. 424–432.
[26] I. Goodfellow et al., “Generative adversarial nets,” in Proc. 27th [51] F. Milletari, N. Navab, and S.-A. Ahmadi, “V-Net: Fully convolu-
Int. Conf. Neural Inf. Process. Syst., 2014, pp. 2672–2680. tional neural networks for volumetric medical image
[27] A. Radford, L. Metz, and S. Chintala, “Unsupervised representa- segmentation,” in Proc. 4th Int. Conf. 3D Vis., 2016, pp. 565–571.
tion learning with deep convolutional generative adversarial [52] T. Brosch, L. Y. Tang, Y. Yoo, D. K. Li, A. Traboulsee, and R. Tam,
networks,” 2015, arXiv:1511.06434. “Deep 3D convolutional encoder networks with shortcuts for
[28] M. Mirza and S. Osindero, “Conditional generative adversarial multiscale feature integration applied to multiple sclerosis lesion
nets,” 2014, arXiv:1411.1784. segmentation,” IEEE Trans. Med. Imag., vol. 35, no. 5, pp. 1229–
[29] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein GAN,” 1239, May 2016.
2017, arXiv: 1701.07875. [53] T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S.
[30] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional net- Belongie, “Feature pyramid networks for object detection,” in
works for semantic segmentation,” in Proc. IEEE Conf. Comput. Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 2117–
Vis. Pattern Recognit., 2015, pp. 3431–3440. 2125.
[31] G. Wang, W. Li, S. Ourselin, and T. Vercauteren, “Automatic [54] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing
brain tumor segmentation using cascaded anisotropic convolu- network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017,
tional neural networks,” in Proc. Int. MICCAI Brainlesion Work- pp. 2881–2890.
shop, 2017, pp. 178–190. [55] G. Ghiasi and C. C. Fowlkes, “Laplacian pyramid reconstruction
[32] Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei, “Fully convolutional and refinement for semantic segmentation,” in Proc. Eur. Conf.
instance-aware semantic segmentation,” in Proc. IEEE Conf. Com- Comput. Vis., 2016, pp. 519–534.
put. Vis. Pattern Recognit., 2017, pp. 2359–2367. [56] J. He, Z. Deng, and Y. Qiao, “Dynamic multi-scale filters for
[33] Y. Yuan, M. Chao, and Y.-C. Lo, “Automatic skin lesion segmen- semantic segmentation,” in Proc. IEEE Int. Conf. Comput. Vis.,
tation using deep fully convolutional networks with Jaccard dis- 2019, pp. 3562–3572.
tance,” IEEE Trans. Med. Imag., vol. 36, no. 9, pp. 1876–1886, Sep. [57] H. Ding, X. Jiang, B. Shuai, A. Qun Liu, and G. Wang, “Context
2017. contrasted feature and gated multi-scale aggregation for scene
[34] N. Liu, H. Li, M. Zhang, J. Liu, Z. Sun, and T. Tan, “Accurate segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
iris segmentation in non-cooperative environments using fully 2018, pp. 2393–2402.
convolutional networks,” in Proc. Int. Conf. Biometrics, 2016, [58] J. He, Z. Deng, L. Zhou, Y. Wang, and Y. Qiao, “Adaptive pyra-
pp. 1–8. mid context network for semantic segmentation,” in Proc. Conf.
[35] W. Liu, A. Rabinovich, and A. C. Berg, “ParseNet: Looking wider Comput. Vis. Pattern Recognit., 2019, pp. 7519–7528.
to see better,” 2015, arXiv:1506.04579. [59] D. Lin, Y. Ji, D. Lischinski, D. Cohen-Or, and H. Huang, “Multi-
[36] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. scale context intertwining for semantic segmentation,” in Proc.
Yuille, “Semantic image segmentation with deep convolutional Eur. Conf. Comput. Vis., 2018, pp. 603–619.
nets and fully connected CRFs,” 2014, arXiv:1412.7062. [60] G. Li, Y. Xie, L. Lin, and Y. Yu, “Instance-level salient object
[37] A. G. Schwing and R. Urtasun, “Fully connected deep structured segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
networks,” 2015, arXiv:1503.02351. 2017, pp. 2386–2395.
[38] S. Zheng et al., “Conditional random fields as recurrent neural [61] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards
networks,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 1529– real-time object detection with region proposal networks,” in
1537. Proc. 28th Int. Conf. Neural Inf. Process. Syst., 2015, pp. 91–99.

Authorized licensed use limited to: GITAM University. Downloaded on January 25,2023 at 11:09:35 UTC from IEEE Xplore. Restrictions apply.
MINAEE ET AL.: IMAGE SEGMENTATION USING DEEP LEARNING: A SURVEY 3539

[62] K. He, G. Gkioxari, P. Dollar, and R. Girshick, “Mask R-CNN,” in [88] L.-C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille, “Attention
Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 2961–2969. to scale: Scale-aware semantic image segmentation,” in Proc.
[63] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 3640–3649.
for instance segmentation,” in Proc. IEEE Conf. Comput. Vis. Pat- [89] Q. Huang et al., “Semantic segmentation with reverse attention,”
tern Recognit., 2018, pp. 8759–8768. 2017, arXiv: 1707.06426.
[64] J. Dai, K. He, and J. Sun, “Instance-aware semantic segmentation [90] H. Li, P. Xiong, J. An, and L. Wang, “Pyramid attention network
via multi-task network cascades,” in Proc. IEEE Conf. Comput. for semantic segmentation,” 2018, arXiv: 1805.10180.
Vis. Pattern Recognit., 2016, pp. 3150–3158. [91] J. Fu et al., “Dual attention network for scene segmentation,” in
[65] R. Hu, P. Doll ar, K. He, T. Darrell, and R. Girshick, “Learning to Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 3146–
segment every thing,” in Proc. IEEE Conf. Comput. Vis. Pattern 3154.
Recognit., 2018, pp. 4233–4241. [92] Y. Yuan and J. Wang, “OCNet: Object context network for scene
[66] L.-C. Chen, A. Hermans, G. Papandreou, F. Schroff, P. Wang, parsing,” 2018, arXiv: 1809.00916.
and H. Adam, “MaskLab: Instance segmentation by refining [93] H. Zhang et al., “ResNeSt: Split-attention networks,” 2020, arXiv:
object detection with semantic and direction features,” in Proc. 2004.08955.
IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 4013–4022. [94] S. Choi, J. T. Kim, and J. Choo, “Cars can’t fly up in the sky:
[67] X. Chen, R. Girshick, K. He, and P. Dollar, “TensorMask: A foun- Improving urban-scene segmentation via height-driven attention
dation for dense object segmentation,” 2019, arXiv: 1903.12174. networks,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.,
[68] J. Dai, Y. Li, K. He, and J. Sun, “R-FCN: Object detection via 2020, pp. 9373–9383.
region-based fully convolutional networks,” in Proc. 30th Int. [95] X. Li, Z. Zhong, J. Wu, Y. Yang, Z. Lin, and H. Liu, “Expectation-
Conf. Neural Inf. Process. Syst., 2016, pp. 379–387. maximization attention networks for semantic segmentation,” in
[69] P. O. Pinheiro, R. Collobert, and P. Dollar, “Learning to segment Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 9167–9176.
object candidates,” in Proc. 28th Int. Conf. Neural Inf. Process. [96] Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, and W. Liu,
Syst., 2015, pp. 1990–1998. “CCNet: Criss-cross attention for semantic segmentation,” in
[70] E. Xie et al., “PolarMask: Single shot instance segmentation with Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 603–612.
polar representation,” 2019, arXiv: 1909.13226. [97] M. Ren and R. S. Zemel, “End-to-end instance segmentation with
[71] Z. Hayder, X. He, and M. Salzmann, “Boundary-aware instance recurrent attention,” in Proc. IEEE Conf. Comput. Vis. Pattern Rec-
segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., ognit., 2017, pp. 6656–6664.
2017, pp. 5696–5704. [98] H. Zhao et al., “PSANet: Point-wise spatial attention network
[72] Y. Lee and J. Park, “CenterMask: Real-time anchor-free instance for scene parsing,” in Proc. Eur. Conf. Comput. Vis., 2018,
segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 267–283.
2020, pp. 13 906–13 915. [99] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, “Learning a
[73] M. Bai and R. Urtasun, “Deep watershed transform for instance discriminative feature network for semantic segmentation,” in
segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 1857–
2017, pp. 5221–5229. 1866.
[74] D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee, “YOLACT: Real-time [100] P. Luc, C. Couprie, S. Chintala, and J. Verbeek, “Semantic seg-
instance segmentation,” in Proc. IEEE Int. Conf. Comput. Vis., mentation using adversarial networks,” 2016, arXiv:1611.08408.
2019, pp. 9157–9166. [101] N. Souly, C. Spampinato, and M. Shah, “Semi supervised seman-
[75] A. Fathi et al., “Semantic instance segmentation via deep metric tic segmentation using generative adversarial network,” in Proc.
learning,” 2017, arXiv: 1703.10277. IEEE Int. Conf. Comput. Vis., 2017, pp. 5688–5696.
[76] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. [102] W.-C. Hung, Y.-H. Tsai, Y.-T. Liou, Y.-Y. Lin, and M.-H. Yang,
Yuille, “DeepLab: Semantic image segmentation with deep con- “Adversarial learning for semi-supervised semantic
volutional nets, atrous convolution, and fully connected CRFs,” segmentation,” 2018, arXiv: 1802.07934.
IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 4, pp. 834–848, [103] Y. Xue, T. Xu, H. Zhang, L. R. Long, and X. Huang, “SegAN:
Apr. 2018. Adversarial network with multi-scale L1 loss for medical image
[77] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated segmentation,” Neuroinformatics, vol. 16, no. 3/4, pp. 383–392,
convolutions,” 2015, arXiv:1511.07122. 2018.
[78] P. Wang et al., “Understanding convolution for semantic [104] M. Majurski et al., “Cell image segmentation using generative
segmentation,” in Proc. IEEE Winter Conf. Appl. Comput. Vis., adversarial networks, transfer learning, and augmentations,” in
2018, pp. 1451–1460. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, 2019,
[79] M. Yang, K. Yu, C. Zhang, Z. Li, and K. Yang, “DenseASPP for pp. 1114–1122.
semantic segmentation in street scenes,” in Proc. IEEE Conf. Com- [105] K. Ehsani, R. Mottaghi, and A. Farhadi, “SegAN: Segmenting
put. Vis. Pattern Recognit., 2018, pp. 3684–3692. and generating the invisible,” in Proc. IEEE Conf. Comput. Vis.
[80] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello, “ENet: A Pattern Recognit., 2018, pp. 6144–6153.
deep neural network architecture for real-time semantic [106] T. F. Chan and L. A. Vese, “Active contours without edges,” IEEE
segmentation,” 2016, arXiv:1606.02147. Trans. Image Process., vol. 10, no. 2, pp. 266–277, Feb. 2001.
[81] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, [107] X. Chen, B. M. Williams, S. R. Vallabhaneni, G. Czanner, R.
“Encoder-decoder with atrous separable convolution for seman- Williams, and Y. Zheng, “Learning active contour models for
tic image segmentation,” in Proc. Eur. Conf. Comput. Vis., 2018, medical image segmentation,” in Proc. IEEE Conf. Comput. Vis.
pp. 801–818. Pattern Recognit., 2019, pp. 11 632–11 640.
[82] F. Visin et al., “ReSeg: A recurrent neural network-based model [108] S. Gur, L. Wolf, L. Golgher, and P. Blinder, “Unsupervised
for semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pat- microvascular image segmentation using an active contours
tern Recognit. Workshops, 2016, pp. 41–48. mimicking neural network,” in Proc. IEEE Int. Conf. Comput. Vis.,
[83] F. Visin, K. Kastner, K. Cho, M. Matteucci, A. Courville, and Y. 2019, pp. 10 722–10 731.
Bengio, “ReNet: A recurrent neural network based alternative to [109] P. Marquez-Neila, L. Baumela, and L. Alvarez, “A morphological
convolutional networks,” 2015, arXiv:1505.00393. approach to curvature-based evolution of curves and surfaces,”
[84] W. Byeon, T. M. Breuel, F. Raue, and M. Liwicki, “Scene labeling IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 1, pp. 2–17, Jan.
with LSTM recurrent neural networks,” in Proc. IEEE Conf. Com- 2014.
put. Vis. Pattern Recognit., 2015, pp. 3547–3555. [110] T. H. N. Le, K. G. Quach, K. Luu, C. N. Duong, and M. Savvides,
[85] X. Liang, X. Shen, J. Feng, L. Lin, and S. Yan, “Semantic object “Reformulating level sets as deep recurrent neural network
parsing with graph LSTM,” in Proc. Eur. Conf. Comput. Vis., 2016, approach to semantic segmentation,” IEEE Trans. Image Process.,
pp. 125–143. vol. 27, no. 5, pp. 2393–2407, May 2018.
[86] Y. Xiang and D. Fox, “DA-RNN: Semantic mapping with data [111] C. Rupprecht, E. Huaroc, M. Baust, and N. Navab, “Deep active
associated recurrent neural networks,” 2017, arXiv: 1703.03098. contours,” 2016, arXiv:1607.05074.
[87] R. Hu, M. Rohrbach, and T. Darrell, “Segmentation from natural [112] A. Hatamizadeh et al., “Deep active lesion segmentation,” in
language expressions,” in Proc. Eur. Conf. Comput. Vis., 2016, Proc. Int. Workshop Mach. Learn. Med. Imag., 2019, vol. 11861,
pp. 108–124. pp. 98–105.

Authorized licensed use limited to: GITAM University. Downloaded on January 25,2023 at 11:09:35 UTC from IEEE Xplore. Restrictions apply.
3540 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 44, NO. 7, JULY 2022

[113] D. Marcos et al., “Learning deep structured active contours end- [138] X. Zhang, H. Xu, H. Mo, J. Tan, C. Yang, and W. Ren, “DCNAS:
to-end,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, Densely connected neural architecture search for semantic image
pp. 8877–8885. segmentation,” 2020, arXiv: 2003.11883.
[114] D. Cheng, R. Liao, S. Fidler, and R. Urtasun, “DARNet: Deep [139] A. Tao, K. Sapra, and B. Catanzaro, “Hierarchical multi-scale
active ray network for building segmentation,” in Proc. IEEE attention for semantic segmentation,” 2020, arXiv: 2005.10821.
Conf. Comput. Vis. Pattern Recognit., 2019, pp. 7431–7439. [140] D. Seichter, M. K€ ohler, B. Lewandowski, T. Wengefeld, and
[115] A. Hatamizadeh, D. Sengupta, and D. Terzopoulos, “End-to-end H.-M. Gross, “Efficient RGB-D semantic segmentation for
trainable deep active contour models for automated image seg- indoor scene analysis,” 2020, arXiv: 2011.06961.
mentation: Delineating buildings in aerial imagery,” in Proc. Eur. [141] M. Zhen et al., “Joint semantic segmentation and boundary detec-
Conf. Comput. Vis., 2020, pp. 730–746. tion using iterative pyramid contexts,” in Proc. IEEE/CVF Conf.
[116] H. Zhang et al., “Context encoding for semantic segmentation,” Comput. Vis. Pattern Recognit., 2020, pp. 13 666–13 675.
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 7151– [142] Y. Li et al., “Learning dynamic routing for semantic
7160. segmentation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Rec-
[117] G. Lin, A. Milan, C. Shen, and I. Reid, “RefineNet: Multi-path ognit., 2020, pp. 8553–8562.
refinement networks for high-resolution semantic segmentation,” [143] A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollar,
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 1925– “Panoptic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern
1934. Recognit., 2019, pp. 9404–9413.
[118] G. Song, H. Myeong, and K. M. Lee, “SeedNet: Automatic seed [144] A. Kirillov, R. Girshick, K. He, and P. Dollar, “Panoptic feature
generation with deep reinforcement learning for robust interac- pyramid networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Rec-
tive segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Rec- ognit., 2019, pp. 6399–6408.
ognit., 2018, pp. 1760–1768. [145] Y. Li et al., “Attention-guided unified network for panoptic
[119] J. Dai, K. He, and J. Sun, “BoxSup: Exploiting bounding boxes to segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
supervise convolutional networks for semantic segmentation,” 2019, pp. 7019–7028.
in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 1635–1643. [146] L. Porzi, S. R. Bulo, A. Colovic, and P. Kontschieder, “Seamless
[120] C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun, “Large kernel mat- scene segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Rec-
ters — Improve semantic segmentation by global convolutional ognit., 2019, pp. 8277–8286.
network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, [147] B. Cheng et al., “Panoptic-DeepLab,” 2019, arXiv: 1910.04751.
pp. 4353–4361. [148] Y. Xiong et al., “UPSNet: A unified panoptic segmentation
[121] Z. Wu, C. Shen, and A. Van Den Hengel, “Wider or deeper: network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019,
Revisiting the ResNet model for visual recognition,” Pattern Rec- pp. 8818–8826.
ognit., vol. 90, pp. 119–133, 2019. [149] R. Mohan and A. Valada, “EfficientPS: Efficient panoptic
[122] Z. Zhang, X. Zhang, C. Peng, X. Xue, and J. Sun, “ExFuse: segmentation,” 2020, arXiv: 2004.02307.
Enhancing feature fusion for semantic segmentation,” in Proc. [150] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A.
Eur. Conf. Comput. Vis., 2018, pp. 269–284. Zisserman, “The PASCAL visual object classes (VOC) challenge,”
[123] M. Mostajabi, P. Yadollahpour, and G. Shakhnarovich, Int. J. Comput. Vis., vol. 88, pp. 303–338, 2010.
“Feedforward semantic segmentation with zoom-out features,” [151] 2012. [Online]. Available: http://host.robots.ox.ac.uk/pascal/
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 3376– VOC/voc2012/
3385. [152] R. Mottaghi et al., “The role of context for object detection and
[124] W. Wang, J. Shen, and F. Porikli, “Saliency-aware geodesic video semantic segmentation in the wild,” in Proc. IEEE Conf. Comput.
object segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Rec- Vis. Pattern Recognit., 2014, pp. 891–898.
ognit., 2015, pp. 3395–3402. [153] T.-Y. Lin et al., “Microsoft COCO: Common objects in context,” in
[125] P. Luo, G. Wang, L. Lin, and X. Wang, “Deep dual learning for Proc. Eur. Conf. Comput. Vis., 2014, pp. 740–755.
semantic image segmentation,” in Proc. IEEE Int. Conf. Comput. [154] M. Cordts et al., “The cityscapes dataset for semantic urban scene
Vis., 2017, pp. 2718–2726. understanding,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog-
[126] X. Li et al., “FoveaNet: Perspective-aware urban scene parsing,” nit., 2016, pp. 3213–3223.
in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 784–792. [155] C. Liu, J. Yuen, and A. Torralba, “Nonparametric scene parsing:
[127] I. Kreso, S. Segvic, and J. Krapac, “Ladder-style densenets for Label transfer via dense scene alignment,” in Proc. IEEE Conf.
semantic segmentation of large natural images,” in Proc. IEEE Comput. Vis. Pattern Recognit., 2009, pp. 1972–1979.
Int. Conf. Comput. Vis., 2017, pp. 238–245. [156] S. Gould, R. Fulton, and D. Koller, “Decomposing a scene into
[128] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, “BiSeNet: geometric and semantically consistent regions,” in Proc. Int. Conf.
Bilateral segmentation network for real-time semantic Comput. Vis., 2009, pp. 1–8.
segmentation,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 325–341. [157] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of
[129] B. Cheng et al., “SPGNet: Semantic prediction guidance for scene human segmented natural images and its application to evaluat-
parsing,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 5218–5228. ing segmentation algorithms and measuring ecological
[130] T. Takikawa, D. Acuna, V. Jampani, and S. Fidler, “Gated-SCNN: statistics,” in Proc. Int. Conf. Comput. Vis., 2001, vol. 2, pp. 416–
Gated shape CNNs for semantic segmentation,” in Proc. IEEE Int. 423.
Conf. Comput. Vis., 2019, pp. 5229–5238. [158] A. Prest, C. Leistner, J. Civera, C. Schmid, and V. Ferrari,
[131] J. Fu et al., “Adaptive context network for scene parsing,” in Proc. “Learning object class detectors from weakly annotated video,”
IEEE Int. Conf. Comput. Vis., 2019, pp. 6748–6757. in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2012, pp. 3282–
[132] X. Liang, H. Zhou, and E. Xing, “Dynamic-structured semantic 3289.
propagation network,” in Proc. IEEE Conf. Comput. Vis. Pattern [159] S. D. Jain and K. Grauman, “Supervoxel-consistent foreground
Recognit., 2018, pp. 752–761. propagation in video,” in Proc. Eur. Conf. Comput. Vis., 2014,
[133] X. Liang, Z. Hu, H. Zhang, L. Lin, and E. P. Xing, “Symbolic pp. 656–671.
graph reasoning meets convolutions,” in Proc. 32nd Int. Conf. [160] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets
Neural Inf. Process. Syst., 2018, pp. 1853–1863. robotics: The KITTI dataset,” The Int. J. Robot. Res., vol. 32, no. 11,
[134] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, pp. 1231–1237, 2013.
“Scene parsing through ADE20K dataset,” in Proc. IEEE Conf. [161] J. M. Alvarez, T. Gevers, Y. LeCun, and A. M. Lopez, “Road
Comput. Vis. Pattern Recognit., 2017, pp. 1858–1868. scene segmentation from a single image,” in Proc. Eur. Conf. Com-
[135] R. Zhang, S. Tang, Y. Zhang, J. Li, and S. Yan, “Scale-adaptive put. Vis., 2012, pp. 376–389.
convolutions for scene parsing,” in Proc. IEEE Int. Conf. Comput. [162] B. Hariharan, P. Arbelaez, L. Bourdev, S. Maji, and J. Malik,
Vis., 2017, pp. 2031–2039. “Semantic contours from inverse detectors,” in Proc. Int. Conf.
[136] T. Xiao, Y. Liu, B. Zhou, Y. Jiang, and J. Sun, “Unified perceptual Comput. Vis., 2011, pp. 991–998.
parsing for scene understanding,” in Proc. Eur. Conf. Comput. [163] X. Chen, R. Mottaghi, X. Liu, S. Fidler, R. Urtasun, and A. Yuille,
Vis., 2018, pp. 418–434. “Detect what you can: Detecting and representing objects using
[137] B. Zoph et al., “Rethinking pre-training and self-training,” 2020, holistic models and body parts,” in Proc. IEEE Conf. Comput. Vis.
arXiv: 2006.06882. Pattern Recognit., 2014, pp. 1971–1978.

Authorized licensed use limited to: GITAM University. Downloaded on January 25,2023 at 11:09:35 UTC from IEEE Xplore. Restrictions apply.
MINAEE ET AL.: IMAGE SEGMENTATION USING DEEP LEARNING: A SURVEY 3541

[164] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez, [188] V. Goel, J. Weng, and P. Poupart, “Unsupervised video object
“The SYNTHIA dataset: A large collection of synthetic images segmentation for deep reinforcement learning,” in Proc. 32nd Int.
for semantic segmentation of urban scenes,” in Proc. IEEE Conf. Conf. Neural Inf. Process. Syst., 2018, pp. 5683–5694.
Comput. Vis. Pattern Recognit., 2016, pp. 3234–3243. [189] L. Ma, Y. Liu, X. Zhang, Y. Ye, G. Yin, and B. A. Johnson, “Deep
[165] X. Shen et al., “Automatic portrait segmentation for image styl- learning in remote sensing applications: A meta-analysis and
ization,” Comput. Graph. Forum, vol. 35, no. 2, pp. 93–102, 2016. review,” ISPRS J. Photogrammetry Remote Sens., vol. 152, pp. 166–
[166] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor seg- 177, 2019.
mentation and support inference from RGBD images,” in Proc. [190] L. Gao, Y. Zhang, F. Zou, J. Shao, and J. Lai, “Unsupervised
Eur. Conf. Comput. Vis., 2012, pp. 746–760. urban scene segmentation via domain adaptation,” Neurocomput-
[167] J. Xiao, A. Owens, and A. Torralba, “Sun3D: A database of big ing, vol. 406, pp. 295–301, 2020.
spaces reconstructed using SFM and object labels,” in Proc. IEEE [191] M. Paoletti, J. Haut, J. Plaza, and A. Plaza, “Deep learning classi-
Int. Conf. Comput. Vis., 2013, pp. 1625–1632. fiers for hyperspectral imaging: A review,” ISPRS J. Photogram-
[168] S. Song, S. P. Lichtenberg, and J. Xiao, “Sun RGB-D: A RGB-D metry Remote Sens., vol. 158, pp. 279–317, 2019.
scene understanding benchmark suite,” in Proc. IEEE Conf. Com- [192] J. F. Abrams et al., “Habitat-Net: Segmentation of habitat images
put. Vis. Pattern Recognit., 2015, pp. 567–576. using deep learning,” Ecological Informat., vol. 51, pp. 121–128,
[169] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. 2019.
Nießner, “ScanNet: Richly-annotated 3D reconstructions of [193] M. Kerkech, A. Hafiane, and R. Canals, “Vine disease detection
indoor scenes,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., in UAV multispectral images using optimized image registration
2017, pp. 5828–5839. and deep learning segmentation approach,” Comput. Electron.
[170] I. Armeni, A. Sax, A. Zamir, and S. Savarese, “Joint 2D-3D- Agriculture, vol. 174, 2020, Art. no. 105446.
semantic data for indoor scene understanding,” Feb. 2017, ArXiv [194] Y. Song, Z. Huang, C. Shen, H. Shi, and D. A. Lange, “Deep
e-prints. learning-based automated image segmentation for concrete pet-
[171] K. Lai, L. Bo, X. Ren, and D. Fox, “A large-scale hierarchical rographic analysis,” Cement Concrete Res., vol. 135, 2020, Art.
multi-view RGB-D object dataset,” in Proc. IEEE Int. Conf. Robot. no. 106118.
Autom., 2011, pp. 1817–1824. [195] N. Tajbakhsh, L. Jeyaseelan, Q. Li, J. N. Chiang, Z. Wu, and X.
[172] C.-Y. Fu, M. Shvets, and A. C. Berg, “RetinaMask: Learning to Ding, “Embracing imperfect datasets: A review of deep learning
predict masks improves state-of-the-art single-shot detection for solutions for medical image segmentation,” Med. Image Anal.,
free,” 2019, arXiv: 1901.03353. vol. 63, 2020, Art. no. 101693.
[173] P. O. Pinheiro, T.-Y. Lin, R. Collobert, and P. Dollar, “Learning to [196] A. Amyar, R. Modzelewski, H. Li, and S. Ruan, “Multi-task
refine object segments,” in Proc. Eur. Conf. Comput. Vis., 2016, deep learning based CT imaging analysis for COVID-19 pneu-
pp. 75–91. monia: Classification and segmentation,” Comput. Biol. Med.,
[174] H. Liu et al., “An end-to-end network for panoptic vol. 126, 2020, Art. no. 104037.
segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
2019, pp. 6172–6181.
[175] K. Sofiiuk, O. Barinova, and A. Konushin, “AdaptIS: Adaptive
instance selection network,” in Proc. IEEE Int. Conf. Comput. Vis., Shervin Minaee (Member, IEEE) received the
2019, pp. 7355–7363. PhD degree in electrical engineering and com-
[176] J. Lazarow, K. Lee, K. Shi, and Z. Tu, “Learning instance occlu- puter science from New York University, New
sion for panoptic segmentation,” in Proc. IEEE Conf. Comput. Vis. York, in 2018. He is currently a machine learning
Pattern Recognit., 2020, pp. 10 720–10 729. lead in the computer vision team at Snapchat,
[177] Z. Deng, S. Todorovic, and L. J. Latecki, “Semantic segmentation Inc. His research interests include computer
of RGBD images with mutex constraints,” in Proc. IEEE Int. Conf. vision, image segmentation, biometric recogni-
Comput. Vis., 2015, pp. 1733–1741. tion, and applied deep learning. He has published
[178] D. Eigen and R. Fergus, “Predicting depth, surface normals and more than 40 papers and patents during his PhD.
semantic labels with a common multi-scale convolutional He previously worked as a research scientist at
architecture,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, Samsung Research, AT&T Labs, Huawei Labs,
pp. 2650–2658. and as a data scientist at Expedia group. He has been a reviewer
[179] A. Mousavian, H. Pirsiavash, and J. Kosecka, “Joint semantic seg- for more than 20 computer vision related journals from IEEE, ACM,
mentation and depth estimation with deep convolutional Elsevier, and Springer. He has won several awards, including the Best
networks,” in Proc. 4th Int. Conf. 3D Vis., 2016, pp. 611–619. Research Presentation at Samsung Research America, in 2017 and the
[180] A. Kendall, V. Badrinarayanan, and R. Cipolla, “Bayesian Seg- Verizon Open Innovation Challenge Award, in 2016.
Net: Model uncertainty in deep convolutional encoder-decoder
architectures for scene understanding,” 2015, arXiv:1511.02680.
[181] X. Qi, R. Liao, J. Jia, S. Fidler, and R. Urtasun, “3D graph neural
networks for RGBD semantic segmentation,” in Proc. IEEE Int. Yuri Boykov (Member, IEEE) is currently a pro-
Conf. Comput. Vis., 2017, pp. 5199–5208. fessor at the Cheriton School of Computer Sci-
[182] W. Wang and U. Neumann, “Depth-aware CNN for RGB-D ence, University of Waterloo, Canada. His
segmentation,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 135–150. research interests include the area of com-
[183] S. Vandenhende, S. Georgoulis, and L. Van Gool, “MTI-net: puter vision and biomedical image analysis
Multi-scale task interaction networks for multi-task learning,” with focus on modeling and optimization for
2020, arXiv: 2001.06902. structured segmentation, restoration, registra-
[184] S.-J. Park, K.-S. Hong, and S. Lee, “RDFNet: RGB-D multi-level tion, stereo, motion, model fitting, recognition,
residual feature fusion for indoor semantic segmentation,” in photo-video editing and other data analysis
Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 4980–4989. problems. He is an editor for the International
[185] J. Jiao, Y. Wei, Z. Jie, H. Shi, R. W. Lau, and T. S. Huang, “Geometry- Journal of Computer Vision (IJCV). His work
aware distillation for indoor semantic segmentation,” in Proc. IEEE was listed among the 10 most influential papers in the IEEE Trans-
Conf. Comput. Vis. Pattern Recognit., 2019, pp. 2869–2878. actions on Pattern Analysis and Machine Intelligence (Top Picks for
[186] Z.-H. Zhou, “A brief introduction to weakly supervised 30 years). In 2017 Google Scholar listed his work on segmentation
learning,” Nat. Sci. Rev., vol. 5, no. 1, pp. 44–53, 2018. as a “classic paper in computer vision and pattern recognition” (from
[187] L. Jing and Y. Tian, “Self-supervised visual feature learning with 2006). In 2011, he received the Helmholtz Prize from the IEEE and
deep neural networks: A survey,” IEEE Trans. Pattern Anal. Mach. the Test of Time Award from the International Conference on Com-
Intell., early access, May 4, 2020, doi: 10.1109/TPAMI.2020.2992393. puter Vision.

Authorized licensed use limited to: GITAM University. Downloaded on January 25,2023 at 11:09:35 UTC from IEEE Xplore. Restrictions apply.
3542 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 44, NO. 7, JULY 2022

Fatih Porikli (Fellow, IEEE) received the PhD Nasser Kehtarnavaz (Fellow, IEEE) is currently
degree from New York University, New York, in an Erik Jonsson distinguished professor at the
2002. He is currently a senior director at Qual- Department of Electrical and Computer Engineer-
comm, San Diego. He was a full professor with ing and the director of the Embedded Machine
the Research School of Engineering, Australian Learning Laboratory at the University of Texas at
National University, Australia and, until recently, a Dallas, Richardson, Texas. His research interests
vice president at Huawei CBG Device; Hardware, include signal and image processing, machine
San Diego. He led the Computer Vision Research learning, deep learning, and real-time implemen-
Group at NICTA, Australia, and was tation on embedded processors. He has authored
a distinguished research scientist at Mitsubishi or coauthored ten books and more than 400 jour-
Electric Research Laboratories, Cambridge, Mas- nal papers, conference papers, patents, manuals,
sachusetts. He was the recipient of the R&D 100 Scientist of the Year and editorials in these areas. He is a fellow of SPIE, a licensed profes-
Award, in 2006. He won six best paper awards, authored more than 250 sional engineer, and editor-in-chief of the Journal of Real-Time Image
papers, co-edited two books, and invented more than 100 patents. He has Processing.
served as the general chair and technical program chair of many IEEE
conferences and as an associate editor of premier IEEE and Springer
journals for the past 15 years. Demetri Terzopoulos (Fellow, IEEE) received
the PhD degree in artificial intelligence from
the Massachusetts Institute of Technology (MIT),
Antonio Plaza (Fellow, IEEE) received the MSc Cambridge, Massachusetts, in 1984. He is cur-
and PhD degrees from the Department of Tech- rently a distinguished professor and chancellor’s
nology of Computers and Communications, Uni- professor of computer science at the University of
versity of Extremadura, Spain,1999 and 2002, California, Los Angeles, Los Angeles, California,
respectively, both in computer engineering. He is where he directs the UCLA Computer Graphics &
currently a professor at the Department of Tech- Vision Laboratory, and is co-founder and chief
nology of Computers and Communications, Uni- scientist of VoxelCloud, Inc. He is or was a
versity of Extremadura, Spain. He has authored Guggenheim fellow, a fellow of the ACM, IETI,
more than 600 publications, including 300 JCR Royal Society of Canada, and Royal Society of London, and a member of
journal papers (more than 170 in IEEE journals), the European Academy of Sciences, the New York Academy of Sciences,
24 book chapters, and more than 300 peer- and Sigma Xi. Among his many awards are an Academy Award from the
reviewed conference proceedings papers. He is a recipient of the Best Academy of Motion Picture Arts and Sciences for his pioneering work
Column Award of the IEEE Signal Processing Magazine, in 2015, the on physics-based computer animation, and the Computer Pioneer
2013 Best Paper Award of the IEEE Journal of Selected Topics in Award, Helmholtz Prize, and inaugural Computer Vision Distinguished
Applied Earth Observations and Remote Sensing journal, and the most Researcher Award from the IEEE for his pioneering and sustained
highly cited paper (2005–2010) in the Journal of Parallel and Distributed research on deformable models and their applications. Deformable
Computing. He is included in the 2018, 2019, and 2020 Highly Cited models, a term he coined, is listed in the IEEE Taxonomy.
Researchers List.

" For more information on this or any other computing topic,


please visit our Digital Library at www.computer.org/csdl.

Authorized licensed use limited to: GITAM University. Downloaded on January 25,2023 at 11:09:35 UTC from IEEE Xplore. Restrictions apply.

You might also like