Image Segmentation Using Deep Learning A Survey
Image Segmentation Using Deep Learning A Survey
Abstract—Image segmentation is a key task in computer vision and image processing with important applications such as scene
understanding, medical image analysis, robotic perception, video surveillance, augmented reality, and image compression, among
others, and numerous segmentation algorithms are found in the literature. Against this backdrop, the broad success of deep learning
(DL) has prompted the development of new image segmentation approaches leveraging DL models. We provide a comprehensive
review of this recent literature, covering the spectrum of pioneering efforts in semantic and instance segmentation, including
convolutional pixel-labeling networks, encoder-decoder architectures, multiscale and pyramid-based approaches, recurrent networks,
visual attention models, and generative models in adversarial settings. We investigate the relationships, strengths, and challenges of
these DL-based segmentation models, examine the widely used datasets, compare performances, and discuss promising research
directions.
Index Terms—Image segmentation, deep learning, convolutional neural networks, encoder-decoder models, recurrent models, generative
models, semantic segmentation, instance segmentation, panoptic segmentation, medical image segmentation
1 INTRODUCTION object categories (e.g., human, car, tree, sky) for all image pix-
els; thus, it is generally a more demanding undertaking than
MAGE segmentation has been a fundamental problem in
I computer vision since the early days of the field [1] (Chap-
ter 8). An essential component of many visual understand-
whole-image classification, which predicts a single label for
the entire image. Instance segmentation extends the scope of
semantic segmentation by detecting and delineating each
ing systems, it involves partitioning images (or video
object of interest in the image (e.g., individual people).
frames) into multiple segments and objects [2] (Chapter 5)
Numerous image segmentation algorithms have been
and plays a central role in a broad range of applications [3]
developed in the literature, from the earliest methods, such
(Part VI), including medical image analysis (e.g., tumor
as thresholding [4], histogram-based bundling, region-
boundary extraction and measurement of tissue volumes),
growing [5], k-means clustering [6], watershed methods [7],
autonomous vehicles (e.g., navigable surface and pedestrian
to more advanced algorithms such as active contours [8],
detection), video surveillance, and augmented reality to
graph cuts [9], conditional and Markov random fields [10],
name a few.
and sparsity-based [11], [12] methods. In recent years, how-
Image segmentation can be formulated as the problem of
ever, deep learning (DL) models have yielded a new genera-
classifying pixels with semantic labels (semantic segmenta-
tion of image segmentation models with remarkable
tion), or partitioning of individual objects (instance segmen-
performance improvements, often achieving the highest
tation), or both (panoptic segmentation). Semantic
accuracy rates on popular benchmarks (e.g., Fig. 1). This has
segmentation performs pixel-level labeling with a set of
caused a paradigm shift in the field.
This survey, a revised version of [14], covers the recent lit-
Shervin Minaee is with the Snapchat Machine Learning Research, Venice, erature in deep-learning-based image segmentation, includ-
CA 90405 USA. E-mail: shervin.minaee@nyu.edu.
ing more than 100 such segmentation methods proposed to
Yuri Boykov is with the University of Waterloo, Waterloo, ON N2L 3G1,
Canada. E-mail: yboykov@uwaterloo.ca. date. It provides a comprehensive review with insights into
Fatih Porikli is with the Australian National University, Canberra, ACT different aspects of these methods, including the training
0200, Australia, and also with Huawei, San Diego, CA 92121 USA. E- data, the choice of network architectures, loss functions,
mail: fatih.porikli@anu.edu.au.
Antonio Plaza is with the University of Extremadura, 06006 Badajoz, training strategies, and their key contributions. The target lit-
Spain. E-mail: aplaza@unex.es. erature is organized into the following categories:
Nasser Kehtarnavaz is with the University of Texas at Dallas, Richardson,
TX 75080 USA. E-mail: kehtar@utdallas.edu. 1) Fully convolutional networks
Demetri Terzopoulos is with the University of California, Los Angeles, Los 2) Convolutional models with graphical models
Angeles, CA 90095 USA. E-mail: dt@cs.ucla.edu. 3) Encoder-decoder based models
Manuscript received 18 Jan. 2020; revised 28 Jan. 2021; accepted 7 Feb. 2021. 4) Multiscale and pyramid network based models
Date of publication 17 Feb. 2021; date of current version 3 June 2022. 5) R-CNN based models (for instance segmentation)
(Corresponding author: Shervin Minaee.)
Recommended for acceptance by L. Wang. 6) Dilated convolutional models and DeepLab family
Digital Object Identifier no. 10.1109/TPAMI.2021.3059968 7) Recurrent neural network based models
0162-8828 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tps://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: GITAM University. Downloaded on January 25,2023 at 11:09:35 UTC from IEEE Xplore. Restrictions apply.
3524 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 44, NO. 7, JULY 2022
Authorized licensed use limited to: GITAM University. Downloaded on January 25,2023 at 11:09:35 UTC from IEEE Xplore. Restrictions apply.
MINAEE ET AL.: IMAGE SEGMENTATION USING DEEP LEARNING: A SURVEY 3525
Fig. 8. Skip connections combine coarse and fine information. From [30].
Fig. 5. Architecture of a simple encoder-decoder model.
common among many of these methods, such as encoders
and decoders, skip-connections, multiscale architectures,
and more recently the use of dilated convolutions. It is con-
venient to group models based on their architectural contri-
butions over prior models.
Authorized licensed use limited to: GITAM University. Downloaded on January 25,2023 at 11:09:35 UTC from IEEE Xplore. Restrictions apply.
3526 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 44, NO. 7, JULY 2022
Fig. 9. The ParseNet (e) uses extra global context to produce a segmen-
tation (d) smoother than that of an FCN (c). From [35]. Fig. 11. Deconvolutional semantic segmentation. From [41].
Fig. 10. A CNNþCRF model. From [36]. algorithms, they proposed a CNN model, namely a Parsing
Network, which enables deterministic end-to-end computa-
concatenated, which amounts to an FCN whose convolu- tion in one pass.
tional layers are replaced by the described module (Fig. 9e).
3.3 Encoder-Decoder Based Models
Most of the popular DL-based segmentation models use
3.2 CNNs With Graphical Models
some kind of encoder-decoder architecture. We group these
As discussed, the FCN ignores potentially useful scene-level models into two categories: those for general image segmen-
semantic context. To exploit more context, several tation, and those for medical image segmentation.
approaches incorporate into DL architectures probabilistic
graphical models, such as Conditional Random Fields
(CRFs) and Markov Random Fields (MRFs). 3.3.1 General Image Segmentation
Due to the invariance properties that make CNNs good Noh et al. [41] introduced semantic segmentation based on
for high level tasks such as classification, responses from deconvolution (a.k.a. transposed convolution). Their model,
the later layers of deep CNNs are not sufficiently well local- DeConvNet (Fig. 11), consists of two parts, an encoder using
ized for accurate object segmentation. To address this draw- convolutional layers adopted from the VGG 16-layer net-
back, Chen et al. [36] proposed a semantic segmentation work and a multilayer deconvolutional network that inputs
algorithm that combines CNNs and fully-connected CRFs the feature vector and generates a map of pixel-accurate
(Fig. 10). They showed that their model can localize segment class probabilities. The latter comprises deconvolution and
boundaries with higher accuracy than was possible with unpooling layers, which identify pixel-wise class labels and
previous methods. predict segmentation masks.
Schwing and Urtasun [37] proposed a fully-connected Badrinarayanan et al. [25] proposed SegNet, a fully con-
deep structured network for image segmentation. They volutional encoder-decoder architecture for image segmen-
jointly trained CNNs and fully-connected CRFs for seman- tation (Fig. 12). Similar to the deconvolution network, the
tic image segmentation, and achieved encouraging results core trainable segmentation engine of SegNet consists of an
on the challenging PASCAL VOC 2012 dataset. Zheng et al. encoder network, which is topologically identical to the 13
[38] proposed a similar semantic segmentation approach. In convolutional layers of the VGG16 network, and a corre-
related work, Lin et al. [39] proposed an efficient semantic sponding decoder network followed by a pixel-wise classifi-
segmentation model based on contextual deep CRFs. They cation layer. The main novelty of SegNet is in the way the
explored “patch-patch” context (between image regions) decoder upsamples its lower-resolution input feature map(s);
and “patch-background” context to improve semantic seg- specifically, using pooling indices computed in the max-
mentation through the use of contextual information. pooling step of the corresponding encoder to perform non-
Liu et al. [40] proposed a semantic segmentation algo- linear up-sampling.
rithm that incorporates rich information into MRFs, includ- A limitation of encoder-decoder based models is the loss of
ing high-order relations and mixture of label contexts. fine-grained image information, due to the loss of resolution
Unlike previous efforts that optimized MRFs using iterative through the encoding process. HRNet [42] (Fig. 13) addresses
Authorized licensed use limited to: GITAM University. Downloaded on January 25,2023 at 11:09:35 UTC from IEEE Xplore. Restrictions apply.
MINAEE ET AL.: IMAGE SEGMENTATION USING DEEP LEARNING: A SURVEY 3527
Fig. 14. The U-Net model. From [47]. Fig. 15. The V-Net model for 3D image segmentation. From [51].
this shortcoming. Other than recovering high-resolution Dense V-Net et al. for automatic segmentation of pulmonary
representations as is done in DeConvNet, SegNet, and other lobes from chest CT images, and the 3D-CNN encoder for
models, HRNet maintains high-resolution representations lesion segmentation [52].
through the encoding process by connecting the high-to-low
resolution convolution streams in parallel and repeatedly
3.4 Multiscale and Pyramid Network Based Models
exchanging the information across resolutions. There are
Multiscale analysis, a well established idea in image proc-
four stages: the 1st stage consists of high-resolution convolu-
essing, has been deployed in various neural network archi-
tions, while the 2nd/3rd/4th stage repeats 2-resolution/
tectures. One of the most prominent models of this sort is
3-resolution/4-resolution blocks. Several recent semantic
the Feature Pyramid Network (FPN) proposed by Lin et al.
segmentation models use HRNet as a backbone.
[53], which was developed for object detection but was also
Several other works adopt transposed convolutions, or
applied to segmentation. The inherent multiscale, pyrami-
encoder-decoders for image segmentation, such as Stacked
dal hierarchy of deep CNNs was used to construct feature
Deconvolutional Network (SDN) [43], Linknet [44], W-
pyramids with marginal extra cost. To merge low and high
Net [45], and locality-sensitive deconvolution networks for
resolution features, the FPN is composed of a bottom-up
RGB-D segmentation [46].
pathway, a top-down pathway and lateral connections. The
concatenated feature maps are then processed by a 3 3
3.3.2 Medical and Biomedical Image Segmentation convolution to produce the output of each stage. Finally,
Several models inspired by FCNs and encoder-decoder net- each stage of the top-down pathway generates a prediction
works were initially developed for medical/biomedical to detect an object. For image segmentation, the authors use
image segmentation, but are now also being used outside two multilayer perceptrons (MLPs) to generate the masks.
the medical domain. Zhao et al. [54] developed the Pyramid Scene Parsing
Ronneberger et al. [47] proposed the U-Net (Fig. 14) for Network (PSPN), a multiscale network to better learn the
efficiently segmenting biological microscopy images. The global context representation of a scene (Fig. 16). Multiple
U-Net architecture comprises two parts, a contracting path patterns are extracted from the input image using a residual
to capture context, and a symmetric expanding path that network (ResNet) as a feature extractor, with a dilated net-
enables precise localization. The U-Net training strategy work. These feature maps are then fed into a pyramid pool-
relies on the use of data augmentation to learn effectively ing module to distinguish patterns of different scales. They
from very few annotated images. It was trained on 30 trans- are pooled at four different scales, each one corresponding
mitted light microscopy images, and it won the ISBI cell to a pyramid level, and processed by a 1 1 convolutional
tracking challenge 2015 by a large margin. layer to reduce their dimensions. The outputs of the pyra-
Various extensions of U-Net have been developed for mid levels are up-sampled and concatenated with the initial
different kinds of images and problem domains; for exam- feature maps to capture both local and global context infor-
ple, Zhou et al. [48] developed a nested U-Net architecture, mation. Finally, a convolutional layer is used to generate
Zhang et al. [49] developed a road segmentation algorithm the pixel-wise predictions.
based on U-Net, and Cicek et al. [50] proposed a U-Net Ghiasi and Fowlkes [55] developed a multiresolution
architecture for 3D images. reconstruction architecture based on a Laplacian pyramid
V-Net (Fig. 15), proposed by Milletari et al. [51] for 3D that uses skip connections from higher resolution feature
medical image segmentation, is another well known FCN- maps and multiplicative gating to successively refine
based model. The authors introduced a new loss function
based on the Dice coefficient, enabling the model to deal
with situations in which there is a strong imbalance
between the number of voxels in the foreground and back-
ground. The network was trained end-to-end on MRI
images of the prostate and learns to predict segmentation
for the whole volume at once. Some of the other relevant
works on medical image segmentation includes Progressive Fig. 16. The PSPN architecture. From [54].
Authorized licensed use limited to: GITAM University. Downloaded on January 25,2023 at 11:09:35 UTC from IEEE Xplore. Restrictions apply.
3528 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 44, NO. 7, JULY 2022
Authorized licensed use limited to: GITAM University. Downloaded on January 25,2023 at 11:09:35 UTC from IEEE Xplore. Restrictions apply.
MINAEE ET AL.: IMAGE SEGMENTATION USING DEEP LEARNING: A SURVEY 3529
Fig. 25. The ReSeg model (without the pre-trained VGG-16 feature
Fig. 23. The DeepLab model. From [76]. extractor). From [82].
Authorized licensed use limited to: GITAM University. Downloaded on January 25,2023 at 11:09:35 UTC from IEEE Xplore. Restrictions apply.
3530 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 44, NO. 7, JULY 2022
Fig. 26. The graph-LSTM model for semantic segmentation. From [85].
Authorized licensed use limited to: GITAM University. Downloaded on January 25,2023 at 11:09:35 UTC from IEEE Xplore. Restrictions apply.
MINAEE ET AL.: IMAGE SEGMENTATION USING DEEP LEARNING: A SURVEY 3531
Authorized licensed use limited to: GITAM University. Downloaded on January 25,2023 at 11:09:35 UTC from IEEE Xplore. Restrictions apply.
3532 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 44, NO. 7, JULY 2022
Fig. 33. Timeline of representative DL-based image segmentation algorithms. Orange, green, and yellow blocks indicate semantic, instance, and
panoptic segmentation algorithms, respectively.
Hatamizadeh et al. [115], dubbed Trainable Deep Active search [138], hierarchical multiscale attention [139], Efficient
Contours (TDAC). Going beyond [112], they implemented RGB-D Semantic Segmentation (ESA-Net) [140], Iterative
the locally-parameterized level-set ACM in the form of Pyramid Contexts [141], and Learning Dynamic Routing for
additional convolutional layers following the layers of the Semantic Segmentation [142].
backbone FCN, exploiting Tensorflow’s automatic differen- Panoptic segmentation [143] is growing in popularity.
tiation mechanism to backpropagate training error gra- Efforts in this direction include Panoptic Feature Pyramid
dients throughout the entire DCAC framework. The fully- Network (PFPN) [144], attention-guided network for pan-
automated model requires no intervention either during optic segmentation [145], seamless scene segmentation [146],
training or segmentation, can naturally segment multiple panoptic Deeplab [147], unified panoptic segmentation net-
instances of objects of interest, and deal with arbitrary object work [148], and efficient panoptic segmentation [149].
shape including sharp corners. Fig. 33 provides a timeline of some of the most represen-
tative DL image segmentation models since 2014.
3.11 Other Models
4 DATASETS
Other popular DL architectures for image segmentation
include the following: In this section we survey the image datasets most com-
Context Encoding Network (EncNet) [116] uses a basic monly used to train and test DL image segmentation mod-
feature extractor and feeds the feature maps into a context els, grouping them into 3 categories—2D (pixel) images,
encoding module. RefineNet [117] is a multipath refinement 2.5D RGB-D (color+depth) images, and 3D (voxel) images—
network that explicitly exploits all the information available and provide details about the characteristics of each dataset.
along the down-sampling process to enable high-resolution Data augmentation is often used to increase the number of
prediction using long-range residual connections. Seed- labeled samples, especially for small datasets such as those
net [118] introduced an automatic seed generation tech- in the medical imaging domain, thus improving the perfor-
nique with deep reinforcement learning that learns to solve mance of DL segmentation models. A set of transformations
the interactive segmentation problem. Object-Contextual is applied either in the data space, or feature space, or both
Representations (OCR) [42] learns object regions and the (i.e., both the image and the segmentation map). Typical
relation between each pixel and each object region, aug- transformations include translation, reflection, rotation,
menting the representation pixels with the object-contextual warping, scaling, color space shifting, cropping, and projec-
representation. Additional models and methods include tions onto principal components. Data augmentation can
BoxSup [119], Graph Convolutional Networks (GCN) [120], also benefit by yielding faster convergence, decreasing the
Wide ResNet [121], Exfuse [122] (enhancing low-level chance of over-fitting, and enhancing generalization. For
and high-level features fusion), Feedforward-Net [123], some small datasets, data augmentation has been shown to
saliency-aware models for geodesic video segmenta- boost model performance by more than 20 percent.
tion [124], Dual Image Segmentation (DIS) [125], Fovea-
Net [126] (perspective-aware scene parsing), Ladder 4.1 2D Image Datasets
DenseNet [127], Bilateral Segmentation Network (BiSe- The bulk of image segmentation research has focused on 2D
Net) [128], Semantic Prediction Guidance for Scene Parsing images; therefore, many 2D image segmentation datasets
(SPGNet) [129], gated shape CNNs [130], Adaptive Context are available. The following are some of the most popular:
Network (AC-Net) [131], Dynamic-Structured Semantic PASCAL Visual Object Classes (VOC) [150] is a highly pop-
Propagation Network (DSSPN) [132], Symbolic Graph Rea- ular dataset in computer vision, with annotated images
soning (SGR) [133], CascadeNet [134], Scale-Adaptive available for 5 tasks—classification, segmentation, detec-
Convolutions (SAC) [135], Unified Perceptual parsing tion, action recognition, and person layout. For the segmen-
Network (UperNet) [136], segmentation by re-training and tation task, there are 21 labeled object classes and pixels are
self-training [137], densely connected neural architecture labeled as background if they do not belong to any of these
Authorized licensed use limited to: GITAM University. Downloaded on January 25,2023 at 11:09:35 UTC from IEEE Xplore. Restrictions apply.
MINAEE ET AL.: IMAGE SEGMENTATION USING DEEP LEARNING: A SURVEY 3533
Fig. 36. Segmentation maps from the Cityscapes dataset. From [154].
Fig. 34. An example image from the PASCAL VOC dataset. From [151].
Stanford Background [156] comprises outdoor images of
classes. The dataset is divided into two sets, training and scenes from existing datasets, such as LabelMe, MSRC, and
validation, with 1,464 and 1,449 images, respectively, and a PASCAL VOC. It includes 715 images with at least one fore-
private test set for the actual challenge. Fig. 34 shows an ground object. The dataset is pixel-wise annotated, and can
example image and its pixel-wise label. be used for semantic scene understanding.
PASCAL Context [152] is an extension of the PASCAL Berkeley Segmentation Dataset (BSD) [157] contains
VOC 2010 detection challenge. It includes pixel-wise labels 12,000 hand-labeled segmentations of 1,000 Corel dataset
for all the training images. It contains more than 400 classes images from 30 human subjects. It aims to provide an
(including the original 20 classes plus backgrounds from empirical basis for research on image segmentation and
PASCAL VOC segmentation), in three categories (objects, boundary detection. Half of the segmentations were
stuff, and hybrids). Many of the object categories of this obtained from presenting the subject a color image and
dataset are too sparse and; therefore, a subset of 59 classes is the other half from presenting a grayscale image. The
usually selected for use. public benchmark based on this data consists of all of
Microsoft Common Objects in Context (MS COCO) [153] is a the grayscale and color segmentations for 300 images.
large-scale object detection, segmentation, and captioning The images are divided into a training set of 200 images
dataset. COCO includes images of complex everyday and a test set of 100 images.
scenes, containing common objects in their natural contexts. Youtube-Objects [158] contains videos collected from You-
This dataset contains photos of 91 object types, with a total Tube, which include objects from ten PASCAL VOC classes
of 2.5 million labeled instances in 328K images. Fig. 35 com- (aeroplane, bird, boat, car, cat, cow, dog, horse, motorbike,
pares MS-COCO labels with those of previous datasets for a and train). The original dataset did not contain pixel-wise
sample image. annotations (as it was originally developed for object detec-
Cityscapes [154] is a large database with a focus on seman- tion, with weak annotations). However, Jain et al. [159] man-
tic understanding of urban street scenes. It contains a ually annotated a subset of 126 sequences, and then
diverse set of stereo video sequences recorded in street extracted a subset of frames to further generate semantic
scenes from 50 cities, with high quality pixel-level annota- labels. In total, there are about 10,167 annotated 480x360
tion of 5K frames, in addition to a set of 20K weakly anno- pixel frames available in this dataset.
tated frames. It includes semantic and dense pixel CamVid: is another scene understanding database (with a
annotations of 30 classes, grouped into 8 categories—flat focus on road/driving scenes) which was originally cap-
surfaces, humans, vehicles, constructions, objects, nature, tured as five video sequences via camera mounted on the
sky, and void. Fig. 36 shows sample segmentation maps dashboard of a car. A total of 701 frames were provided by
from this dataset. sampling from the sequences. These frames were manually
ADE20K / MIT Scene Parsing (SceneParse150) offers a annotated into 32 classes.
training and evaluation platform for scene parsing algo- KITTI [160] is one of the most popular datasets for
rithms. The data for this benchmark comes from the autonomous driving, containing videos of traffic scenar-
ADE20K dataset [134], which contains more than 20K ios, recorded with a variety of sensor modalities (includ-
scene-centric images exhaustively annotated with objects ing high-resolution RGB, grayscale stereo cameras, and a
and object parts. The benchmark is divided into 20K images 3D laser scanners). The original dataset does not contain
for training, 2K images for validation, and another batch of ground truth for semantic segmentation, but researchers
images for testing. There are 150 semantic categories in this have manually annotated parts of the dataset; e.g.,
dataset. Alvarez et al. [161] generated ground truth for 323 images
SiftFlow [155] includes 2,688 annotated images, from a from the road detection challenge with 3 classes—road,
subset of the LabelMe database, of 8 different outdoor vertical, and sky.
scenes, among them streets, mountains, fields, beaches, and Other datasets for image segmentation purposes
buildings, and in one of 33 semantic classes. include Semantic Boundaries Dataset (SBD) [162], PASCAL
Part [163], SYNTHIA [164], and Adobe’s Portrait Segmenta-
tion [165].
Authorized licensed use limited to: GITAM University. Downloaded on January 25,2023 at 11:09:35 UTC from IEEE Xplore. Restrictions apply.
3534 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 44, NO. 7, JULY 2022
Authorized licensed use limited to: GITAM University. Downloaded on January 25,2023 at 11:09:35 UTC from IEEE Xplore. Restrictions apply.
MINAEE ET AL.: IMAGE SEGMENTATION USING DEEP LEARNING: A SURVEY 3535
TABLE 1 TABLE 2
Accuracies of Segmentation Models on the Accuracies of Segmentation Models on the Cityscapes Dataset
PASCAL VOC Test Set
Method Backbone mIoU
Method Backbone mIoU SegNet [25] - 57.0
FCN [30] VGG-16 62.2 FCN-8s [30] - 65.3
CRF-RNN [38] - 72.0 DPN [40] - 66.8
CRF-RNN [38] - 74.7 Dilation10 [77] - 67.1
BoxSup* [119] - 75.1 DeeplabV2 [76] ResNet-101 70.4
Piecewise [39] - 78.0 RefineNet [117] ResNet-101 73.6
DPN [40] - 77.5 FoveaNet [126] ResNet-101 74.1
DeepLab-CRF [76] ResNet-101 79.7 Ladder DenseNet [127] Ladder DenseNet-169 73.7
GCN [120] ResNet-152 82.2 GCN [120] ResNet-101 76.9
Dynamic Routing [142] - 84.0 DUC-HDC [78] ResNet-101 77.6
RefineNet [117] ResNet-152 84.2 Wide ResNet [121] WideResNet-38 78.4
Wide ResNet [121] WideResNet-38 84.9 PSPNet [54] ResNet-101 85.4
PSPNet [54] ResNet-101 85.4 BiSeNet [128] ResNet-101 78.9
DeeplabV3 [13] ResNet-101 85.7 DFN [99] ResNet-101 79.3
PSANet [98] ResNet-101 85.7 PSANet [98] ResNet-101 80.1
EncNet [116] ResNet-101 85.9 DenseASPP [79] DenseNet-161 80.6
DFN [99] ResNet-101 86.2 Dynamic Routing [142] - 80.7
Exfuse [122] ResNet-101 86.2 SPGNet [129] 2xResNet-50 81.1
SDN* [43] DenseNet-161 86.6 DANet [91] ResNet-101 81.5
DIS [125] ResNet-101 86.8 CCNet [96] ResNet-101 81.4
APC-Net [58] ResNet-101 87.1 DeeplabV3 [13] ResNet-101 81.3
EMANet [95] ResNet-101 87.7 IPC [141] ResNet-101 81.8
DeeplabV3+ [81] Xception-71 87.8 AC-Net [131] ResNet-101 82.3
Exfuse [122] ResNeXt-131 87.9 OCR [42] ResNet-101 82.4
MSCI [59] ResNet-152 88.0 ResNeSt200 [93] ResNeSt-200 82.7
EMANet [95] ResNet-152 88.2 GS-CNN [130] WideResNet 82.8
DeeplabV3+ [81] Xception-71 89.0 HA-Net [94] ResNext-101 83.2
EfficientNet+NAS-FPN [137] - 90.5 HRNetV2+OCR [42] HRNetV2-W48 83.7
Hierarchical MSA [139] HRNet-OCR 85.1
Models pre-trained on other datasets (MS-COCO, ImageNet, etc.).
TABLE 3
memory footprint, in a reproducible way, which is impor- Accuracies of Segmentation Models on the
tant to industrial applications (such as drones, self-driving MS COCO Stuff Dataset
cars, robotics, etc.) that may run on embedded systems with
limited computational power and storage, thus requiring Method Backbone mIoU
light-weight models. RefineNet [117] ResNet-101 33.6
The following tables summarize the performances of sev- CCN [57] Ladder DenseNet-101 35.7
eral of the prominent DL-based segmentation models on DANet [91] ResNet-50 37.9
different datasets: DSSPN [132] ResNet-101 37.3
Table 1 focuses on the PASCAL VOC test set. Clearly, EMA-Net [95] ResNet-50 37.5
SGR [133] ResNet-101 39.1
there has been much improvement in the accuracy of the
OCR [42] ResNet-101 39.5
models since the introduction of the first DL-based image DANet [91] ResNet-101 39.7
segmentation model, the FCN. EMA-Net [95] ResNet-50 39.9
Table 2 focuses on the Cityscape test dataset. The latest AC-Net [131] ResNet-101 40.1
models feature about 23 percent relative gain over the pio- OCR [42] HRNetV2-W48 40.5
neering FCN model on this dataset.
Table 3 focuses on the MS COCO stuff test set. This data-
set is more challenging than PASCAL VOC, and Citye- In summary, we have witnessed been significant
scapes, as the highest mIoU is approximately 40 percent. improvement in the performance of deep segmentation
Table 4 focuses on the ADE20k validation set. This data- models over the past 5–6 years, with a relative improve-
set is also more challenging than the PASCAL VOC and Cit- ment of 25-42 percent in mIoU on different datasets.
yescapes datasets. However, some publications suffer from lack of repro-
Table 5 provides the performance of prominent instance ducibility for multiple reasons—they report performance
segmentation algorithms on COCO test-dev 2017 dataset, in on non-standard benchmarks/databases, or only on arbi-
terms of average precision, and their speed. trary subsets of the test set from a popular benchmark,
Table 6 provides the performance of prominent panoptic or they do not adequately describe the experimental
segmentation algorithms on MS-COCO val dataset, in terms setup and sometimes evaluate model performance only
of panoptic quality [143]. on a subset of object classes. Most importantly, many
Finally, Table 7 summarizes the performance of several publications do not provide the source-code for their
prominent models for RGB-D segmentation on the NYUD- model implementations. Fortunately, with the increasing
v2 and SUN-RGBD datasets. popularity of deep learning models, the trend has been
Authorized licensed use limited to: GITAM University. Downloaded on January 25,2023 at 11:09:35 UTC from IEEE Xplore. Restrictions apply.
3536 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 44, NO. 7, JULY 2022
TABLE 4 TABLE 7
Accuracies of Segmentation Models on the Segmentation Model Performance on the
ADE20k Validation Dataset NYUD-v2 and SUN-RGBD
TABLE 5
Instance Segmentation Model Performance on
COCO Test-Dev 2017 6.1 More Challenging Datasets
Several large-scale image datasets have been created for
Method Backbone FPS AP
semantic segmentation and instance segmentation. How-
YOLACT-550 [74] R-101-FPN 33.5 29.8 ever, there remains a need for more challenging datasets, as
YOLACT-700 [74] R-101-FPN 23.8 31.2 well as datasets of different kinds of images. For still
RetinaMask [172] R-101-FPN 10.2 34.7
images, datasets with a large number of objects and overlap-
TensorMask [67] R-101-FPN 2.6 37.1
SharpMask [173] R-101-FPN 8.0 37.4 ping objects would be very valuable. This can enable the
Mask-RCNN [62] R-101-FPN 10.6 37.9 training of models that handle dense object scenarios better,
CenterMask [72] R-101-FPN 13.2 38.3 as well as large overlaps among objects as is common in
real-world scenarios. With the rising popularity of 3D image
segmentation, especially in medical image analysis, there is
TABLE 6 also a strong need for large-scale annotated 3D image data-
Panoptic Segmentation Model Performance on MS-COCO Val sets, which are more difficult to create than their lower
dimensional counterparts.
Method Backbone PQ
Panoptic FPN [144] ResNet-50 39.0 6.2 Combining DL and Earlier Segmentation Models
Panoptic FPN [144] ResNet-101 40.3
AU-Net [145] ResNet-50 39.6 There is now broad agreement that the performance of DL-
Panoptic-DeepLab [147] Xception-71 39.7 based segmentation algorithms is plateauing, especially in
OANet [174] ResNet-50 39.0 certain application domains such as medical image analysis.
OANet [174] ResNet-101 40.7 To advance to the next level of performance, we must fur-
AdaptIS [175] ResNet-50 35.9 ther explore the combination of CNN-based image segmen-
AdaptIS [175] ResNet-101 37.0 tation models with prominent “classical” model-based
UPSNet [148] ResNet-50 42.5 image segmentation methods. The integration of CNNs
OCFusion [176] ResNet-50 41.3
OCFusion [176] ResNet-101 43.0 with graphical models has been studied, but their integra-
OCFusion [176] ResNeXt-101 45.7 tion with active contours, graph cuts, and other segmenta-
tion models is fairly recent and deserves further work.
Use of deformable convolution.
positive and many research groups are moving toward 6.3 Interpretable Deep Models
reproducible frameworks and open-sourcing their While DL-based models have achieved promising perfor-
implementations. mance on challenging benchmarks, there remain open ques-
tions about these models. For example, what exactly are
deep models learning? How should we interpret the fea-
6 CHALLENGES AND OPPORTUNITIES tures learned by these models? What is a minimal neural
Without a doubt, image segmentation has benefited greatly architecture that can achieve a certain segmentation accu-
from deep learning, but several challenges lie ahead. We racy on a given dataset? Although some techniques are
will next discuss some of the promising research directions available to visualize the learned convolutional kernels of
that we believe will help in further advancing image seg- these models, a comprehensive study of the underlying
mentation algorithms. behavior/dynamics of these models is lacking. A better
Authorized licensed use limited to: GITAM University. Downloaded on January 25,2023 at 11:09:35 UTC from IEEE Xplore. Restrictions apply.
MINAEE ET AL.: IMAGE SEGMENTATION USING DEEP LEARNING: A SURVEY 3537
understanding of the theoretical aspects of these models can imagery (often collected by imaging spectrometers with
enable the development of better models curated toward hundreds or even thousands of spectral bands) and the
various segmentation scenarios. limited ground-truth information necessary to evaluate
the accuracy of the segmentation algorithms. Similarly,
6.4 Weakly-Supervised and Unsupervised Learning DL-based segmentation techniques in the evaluation of
Weakly-supervised (a.k.a. few shot) learning [186] and unsu- construction materials [194] face challenges related to the
pervised learning [187] are becoming very active research massive volume of the related image data and the limited
areas. These techniques promise to be specially valuable for reference information for validation purposes. Last but
image segmentation, as collecting pixel-accurately labeled not least, an important application field for DL-based seg-
training images is problematic in many application domains, mentation has been biomedical imaging [195]. Here, an
particularly so in medical image analysis. The transfer learn- opportunity is to design standardized image databases
ing approach is to train a generic image segmentation model useful in evaluating new infectious diseases and tracking
on a large set of labeled samples (perhaps from a public pandemics [196].
benchmark) and then fine-tune that model on a few samples
from some specific target application. Self-supervised learn-
ing is another promising direction that is attracting much 7 CONCLUSION
attraction in various fields. With the help of self-supervised We have surveyed image segmentation algorithms based on
learning, many details in images can be captured in order to deep learning models, which have achieved impressive per-
train segmentation models with far fewer training samples. formance in various image segmentation tasks and bench-
Models based on reinforcement learning could also be marks, grouped into architectural categories such as: CNN
another potential future direction, as they have scarcely and FCN, RNN, R-CNN, dilated CNN, attention-based
received attention for image segmentation. For example, models, generative and adversarial models, among others.
MOREL [188] introduced a deep reinforcement learning We have summarized the quantitative performance of these
approach for moving object segmentation in videos. models on some popular benchmarks, such as the PASCAL
VOC, MS COCO, Cityscapes, and ADE20k datasets. Finally,
6.5 Real-Time Models for Various Applications we discussed some of the open challenges and promising
In many applications, accuracy is the most important factor; research directions for deep-learning-based image segmen-
however, there are applications in which it is also critical to tation in the coming years.
have segmentation models that can run in near real-time, or
at common camera frame rates (at least 25 frames per sec- ACKNOWLEDGMENTS
ond). This is useful for computer vision systems that are, for
The authors would like to thank Tsung-Yi Lin of Google
example, deployed in autonomous vehicles. Most of the cur-
Brain, as well as Jingdong Wang and Yuhui Yuan of Micro-
rent models are far from this frame-rate; e.g., FCN-8 takes
soft Research Asia for providing helpful comments that
roughly 100 ms to process a low-resolution image. Models
improved the manuscript.
based on dilated convolution help to increase the speed of
segmentation models to some extent, but there is still plenty
of room for improvement. REFERENCES
[1] A. Rosenfeld and A. C. Kak, Digital Picture Processing.
Cambridge, MA, USA: Academic Press, 1976.
6.6 Memory Efficient Models [2] R. Szeliski, Computer Vision: Algorithms and Applications. Berlin,
Many modern segmentation models require a significant Germany: Springer, 2010.
amount of memory even during the inference stage. So far, [3] D. Forsyth and J. Ponce, Computer Vision: A Modern Approach.
Upper Saddle River, NJ, USA: Prentice Hall, 2002.
much effort has been directed towards improving the accu- [4] N. Otsu, “A threshold selection method from gray-level histo-
racy of such models, but in order to fit them into specific grams,” IEEE Trans. Syst. Man Cybern., vol. SMC-9, no. 1, pp. 62–
devices, such as mobile phones, the networks must be sim- 66, Jan. 1979.
[5] R. Nock and F. Nielsen, “Statistical region merging,” IEEE Trans.
plified. This can be done either by using simpler models, or Pattern Anal. Mach. Intell., vol. 26, no. 11, pp. 1452–1458, Nov.
by using model compression techniques, or even by training 2004.
a complex model and using knowledge distillation techni- [6] N. Dhanachandra, K. Manglem, and Y. J. Chanu, “Image seg-
ques to compress it into a smaller, memory efficient network mentation using K-means clustering algorithm and subtractive
clustering algorithm,” Procedia Comput. Sci., vol. 54, pp. 764–771,
that mimics the complex model. 2015.
[7] L. Najman and M. Schmitt, “Watershed of a continuous
6.7 Applications function,” Signal Process., vol. 38, no. 1, pp. 99–112, 1994.
[8] M. Kass, A. Witkin, and D. Terzopoulos, “Snakes: Active contour
DL-based segmentation methods have been successfully models,” Int. J. Comput. Vis., vol. 1, no. 4, pp. 321–331, 1988.
applied to satellite images in remote sensing [189], such [9] Y. Boykov, O. Veksler, and R. Zabih, “Fast approximate energy
as to support urban planning [190] and precision agricul- minimization via graph cuts,” IEEE Trans. Pattern Anal. Mach.
Intell., vol. 23, no. 11, pp. 1222–1239, Nov. 2001.
ture [191]. Images collected by airborne platforms [192] [10] N. Plath, M. Toussaint, and S. Nakajima, “Multi-class image seg-
and drones [193] have also been segmented using DL- mentation using conditional random fields and global classi-
based segmentation methods in order to address impor- fication,” in Proc. 26th Int. Conf. Mach. Learn., 2009, pp. 817–824.
tant environmental problems including ones related to [11] J.-L. Starck, M. Elad, and D. L. Donoho, “Image decomposition
via the combination of sparse representations and a variational
climate change. The main challenges of the remote sens- approach,” IEEE Trans. Image Process., vol. 14, no. 10, pp. 1570–
ing domain stem from the typically formidable size of the 1582, Oct. 2005.
Authorized licensed use limited to: GITAM University. Downloaded on January 25,2023 at 11:09:35 UTC from IEEE Xplore. Restrictions apply.
3538 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 44, NO. 7, JULY 2022
[12] S. Minaee and Y. Wang, “An ADMM approach to masked signal [39] G. Lin, C. Shen, A. Van Den Hengel, and I. Reid, “Efficient piece-
decomposition using subspace representation,” IEEE Trans. wise training of deep structured models for semantic
Image Process., vol. 28, no. 7, pp. 3192–3204, Jul. 2019. segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
[13] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, 2016, pp. 3194–3203.
“Rethinking atrous convolution for semantic image [40] Z. Liu, X. Li, P. Luo, C.-C. Loy, and X. Tang, “Semantic image
segmentation,” 2017, arXiv: 1706.05587. segmentation via deep parsing network,” in Proc. IEEE Int. Conf.
[14] S. Minaee, Y. Boykov, F. Porikli, A. Plaza, N. Kehtarnavaz, and Comput. Vis., 2015, pp. 1377–1385.
D. Terzopoulos, “Image segmentation using deep learning: A [41] H. Noh, S. Hong, and B. Han, “Learning deconvolution network
survey,” 2020, arXiv: 2001.05566. for semantic segmentation,” in Proc. IEEE Int. Conf. Comput. Vis.,
[15] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based 2015, pp. 1520–1528.
learning applied to document recognition,” Proc. IEEE, vol. 86, [42] Y. Yuan, X. Chen, and J. Wang, “Object-contextual representa-
no. 11, pp. 2278–2324, Nov. 1998. tions for semantic segmentation,” 2019, arXiv: 1909.11065.
[16] K. Fukushima, “Neocognitron: A self-organizing neural net- [43] J. Fu, J. Liu, Y. Wang, J. Zhou, C. Wang, and H. Lu, “Stacked
work model for a mechanism of pattern recognition unaf- deconvolutional network for semantic segmentation,” IEEE
fected by shift in position,” Biol. Cybern., vol. 36, no. 4, Trans. Image Process., early access, Jan. 25, 2019, doi: 10.1109/
pp. 193–202, 1980. TIP.2019.2895460.
[17] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang, [44] A. Chaurasia and E. Culurciello, “LinkNet: Exploiting encoder
“Phoneme recognition using time-delay neural networks,” IEEE representations for efficient semantic segmentation,” in Proc.
Trans. Acoust. Speech Signal Process., vol. 37, no. 3, pp. 328–339, IEEE Int. Conf. Visual Commun. Image Process., 2017, pp. 1–4.
Mar. 1989. [45] X. Xia and B. Kulis, “W-Net: A deep model for fully unsuper-
[18] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classi- vised image segmentation,” 2017, arXiv: 1711.08506.
fication with deep convolutional neural networks,” in Proc. 25th [46] Y. Cheng, R. Cai, Z. Li, X. Zhao, and K. Huang, “Locality-sensi-
Int. Conf. Neural Inf. Process. Syst., 2012, pp. 1097–1105. tive deconvolution networks with gated fusion for RGB-D indoor
[19] K. Simonyan and A. Zisserman, “Very deep convolutional net- semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern
works for large-scale image recognition,” 2014, arXiv:1409.1556. Recognit., 2017, pp. 3029–3037.
[20] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for [47] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Rec- networks for biomedical image segmentation,” in Proc. Int. Conf.
ognit., 2016, pp. 770–778. Med. Image Comput. Comput.-Assisted Intervention, 2015, pp. 234–
[21] 2015. [Online]. Available: https://colah.github.io/posts/2015– 241.
08-Understanding-LSTMs/ [48] Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang, “UNet++:
[22] D. E. Rumelhart et al., “Learning representations by back-propa- A nested U-Net architecture for medical image segmentation,” in
gating errors,” Cogn. Model., vol. 5, no. 3, 1988, Art. no. 1. Proc. Int. Workshop Deep Learn. Med. Image Anal. Multimodal Learn.
[23] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Clin. Decis. Support, 2018, pp. 3–11.
Neural Comput., vol. 9, no. 8, pp. 1735–1780, 1997. [49] Z. Zhang, Q. Liu, and Y. Wang, “Road extraction by deep resid-
[24] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. ual U-Net,” IEEE Geosci. Remote Sens. Lett., vol. 15, no. 5, pp. 749–
Cambridge, MA, USA: MIT Press, 2016. 753, May 2018.
[25] V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet: A deep [50] € Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronne-
O.
convolutional encoder-decoder architecture for image berger, “3D U-Net: Learning dense volumetric segmentation
segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, from sparse annotation,” in Proc. Int. Conf. Med. Image Comput.
no. 12, pp. 2481–2495, Dec. 2017. Comput.-Assisted Intervention, 2016, pp. 424–432.
[26] I. Goodfellow et al., “Generative adversarial nets,” in Proc. 27th [51] F. Milletari, N. Navab, and S.-A. Ahmadi, “V-Net: Fully convolu-
Int. Conf. Neural Inf. Process. Syst., 2014, pp. 2672–2680. tional neural networks for volumetric medical image
[27] A. Radford, L. Metz, and S. Chintala, “Unsupervised representa- segmentation,” in Proc. 4th Int. Conf. 3D Vis., 2016, pp. 565–571.
tion learning with deep convolutional generative adversarial [52] T. Brosch, L. Y. Tang, Y. Yoo, D. K. Li, A. Traboulsee, and R. Tam,
networks,” 2015, arXiv:1511.06434. “Deep 3D convolutional encoder networks with shortcuts for
[28] M. Mirza and S. Osindero, “Conditional generative adversarial multiscale feature integration applied to multiple sclerosis lesion
nets,” 2014, arXiv:1411.1784. segmentation,” IEEE Trans. Med. Imag., vol. 35, no. 5, pp. 1229–
[29] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein GAN,” 1239, May 2016.
2017, arXiv: 1701.07875. [53] T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S.
[30] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional net- Belongie, “Feature pyramid networks for object detection,” in
works for semantic segmentation,” in Proc. IEEE Conf. Comput. Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 2117–
Vis. Pattern Recognit., 2015, pp. 3431–3440. 2125.
[31] G. Wang, W. Li, S. Ourselin, and T. Vercauteren, “Automatic [54] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing
brain tumor segmentation using cascaded anisotropic convolu- network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017,
tional neural networks,” in Proc. Int. MICCAI Brainlesion Work- pp. 2881–2890.
shop, 2017, pp. 178–190. [55] G. Ghiasi and C. C. Fowlkes, “Laplacian pyramid reconstruction
[32] Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei, “Fully convolutional and refinement for semantic segmentation,” in Proc. Eur. Conf.
instance-aware semantic segmentation,” in Proc. IEEE Conf. Com- Comput. Vis., 2016, pp. 519–534.
put. Vis. Pattern Recognit., 2017, pp. 2359–2367. [56] J. He, Z. Deng, and Y. Qiao, “Dynamic multi-scale filters for
[33] Y. Yuan, M. Chao, and Y.-C. Lo, “Automatic skin lesion segmen- semantic segmentation,” in Proc. IEEE Int. Conf. Comput. Vis.,
tation using deep fully convolutional networks with Jaccard dis- 2019, pp. 3562–3572.
tance,” IEEE Trans. Med. Imag., vol. 36, no. 9, pp. 1876–1886, Sep. [57] H. Ding, X. Jiang, B. Shuai, A. Qun Liu, and G. Wang, “Context
2017. contrasted feature and gated multi-scale aggregation for scene
[34] N. Liu, H. Li, M. Zhang, J. Liu, Z. Sun, and T. Tan, “Accurate segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
iris segmentation in non-cooperative environments using fully 2018, pp. 2393–2402.
convolutional networks,” in Proc. Int. Conf. Biometrics, 2016, [58] J. He, Z. Deng, L. Zhou, Y. Wang, and Y. Qiao, “Adaptive pyra-
pp. 1–8. mid context network for semantic segmentation,” in Proc. Conf.
[35] W. Liu, A. Rabinovich, and A. C. Berg, “ParseNet: Looking wider Comput. Vis. Pattern Recognit., 2019, pp. 7519–7528.
to see better,” 2015, arXiv:1506.04579. [59] D. Lin, Y. Ji, D. Lischinski, D. Cohen-Or, and H. Huang, “Multi-
[36] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. scale context intertwining for semantic segmentation,” in Proc.
Yuille, “Semantic image segmentation with deep convolutional Eur. Conf. Comput. Vis., 2018, pp. 603–619.
nets and fully connected CRFs,” 2014, arXiv:1412.7062. [60] G. Li, Y. Xie, L. Lin, and Y. Yu, “Instance-level salient object
[37] A. G. Schwing and R. Urtasun, “Fully connected deep structured segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
networks,” 2015, arXiv:1503.02351. 2017, pp. 2386–2395.
[38] S. Zheng et al., “Conditional random fields as recurrent neural [61] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards
networks,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 1529– real-time object detection with region proposal networks,” in
1537. Proc. 28th Int. Conf. Neural Inf. Process. Syst., 2015, pp. 91–99.
Authorized licensed use limited to: GITAM University. Downloaded on January 25,2023 at 11:09:35 UTC from IEEE Xplore. Restrictions apply.
MINAEE ET AL.: IMAGE SEGMENTATION USING DEEP LEARNING: A SURVEY 3539
[62] K. He, G. Gkioxari, P. Dollar, and R. Girshick, “Mask R-CNN,” in [88] L.-C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille, “Attention
Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 2961–2969. to scale: Scale-aware semantic image segmentation,” in Proc.
[63] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 3640–3649.
for instance segmentation,” in Proc. IEEE Conf. Comput. Vis. Pat- [89] Q. Huang et al., “Semantic segmentation with reverse attention,”
tern Recognit., 2018, pp. 8759–8768. 2017, arXiv: 1707.06426.
[64] J. Dai, K. He, and J. Sun, “Instance-aware semantic segmentation [90] H. Li, P. Xiong, J. An, and L. Wang, “Pyramid attention network
via multi-task network cascades,” in Proc. IEEE Conf. Comput. for semantic segmentation,” 2018, arXiv: 1805.10180.
Vis. Pattern Recognit., 2016, pp. 3150–3158. [91] J. Fu et al., “Dual attention network for scene segmentation,” in
[65] R. Hu, P. Doll ar, K. He, T. Darrell, and R. Girshick, “Learning to Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 3146–
segment every thing,” in Proc. IEEE Conf. Comput. Vis. Pattern 3154.
Recognit., 2018, pp. 4233–4241. [92] Y. Yuan and J. Wang, “OCNet: Object context network for scene
[66] L.-C. Chen, A. Hermans, G. Papandreou, F. Schroff, P. Wang, parsing,” 2018, arXiv: 1809.00916.
and H. Adam, “MaskLab: Instance segmentation by refining [93] H. Zhang et al., “ResNeSt: Split-attention networks,” 2020, arXiv:
object detection with semantic and direction features,” in Proc. 2004.08955.
IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 4013–4022. [94] S. Choi, J. T. Kim, and J. Choo, “Cars can’t fly up in the sky:
[67] X. Chen, R. Girshick, K. He, and P. Dollar, “TensorMask: A foun- Improving urban-scene segmentation via height-driven attention
dation for dense object segmentation,” 2019, arXiv: 1903.12174. networks,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.,
[68] J. Dai, Y. Li, K. He, and J. Sun, “R-FCN: Object detection via 2020, pp. 9373–9383.
region-based fully convolutional networks,” in Proc. 30th Int. [95] X. Li, Z. Zhong, J. Wu, Y. Yang, Z. Lin, and H. Liu, “Expectation-
Conf. Neural Inf. Process. Syst., 2016, pp. 379–387. maximization attention networks for semantic segmentation,” in
[69] P. O. Pinheiro, R. Collobert, and P. Dollar, “Learning to segment Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 9167–9176.
object candidates,” in Proc. 28th Int. Conf. Neural Inf. Process. [96] Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, and W. Liu,
Syst., 2015, pp. 1990–1998. “CCNet: Criss-cross attention for semantic segmentation,” in
[70] E. Xie et al., “PolarMask: Single shot instance segmentation with Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 603–612.
polar representation,” 2019, arXiv: 1909.13226. [97] M. Ren and R. S. Zemel, “End-to-end instance segmentation with
[71] Z. Hayder, X. He, and M. Salzmann, “Boundary-aware instance recurrent attention,” in Proc. IEEE Conf. Comput. Vis. Pattern Rec-
segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., ognit., 2017, pp. 6656–6664.
2017, pp. 5696–5704. [98] H. Zhao et al., “PSANet: Point-wise spatial attention network
[72] Y. Lee and J. Park, “CenterMask: Real-time anchor-free instance for scene parsing,” in Proc. Eur. Conf. Comput. Vis., 2018,
segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 267–283.
2020, pp. 13 906–13 915. [99] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, “Learning a
[73] M. Bai and R. Urtasun, “Deep watershed transform for instance discriminative feature network for semantic segmentation,” in
segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 1857–
2017, pp. 5221–5229. 1866.
[74] D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee, “YOLACT: Real-time [100] P. Luc, C. Couprie, S. Chintala, and J. Verbeek, “Semantic seg-
instance segmentation,” in Proc. IEEE Int. Conf. Comput. Vis., mentation using adversarial networks,” 2016, arXiv:1611.08408.
2019, pp. 9157–9166. [101] N. Souly, C. Spampinato, and M. Shah, “Semi supervised seman-
[75] A. Fathi et al., “Semantic instance segmentation via deep metric tic segmentation using generative adversarial network,” in Proc.
learning,” 2017, arXiv: 1703.10277. IEEE Int. Conf. Comput. Vis., 2017, pp. 5688–5696.
[76] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. [102] W.-C. Hung, Y.-H. Tsai, Y.-T. Liou, Y.-Y. Lin, and M.-H. Yang,
Yuille, “DeepLab: Semantic image segmentation with deep con- “Adversarial learning for semi-supervised semantic
volutional nets, atrous convolution, and fully connected CRFs,” segmentation,” 2018, arXiv: 1802.07934.
IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 4, pp. 834–848, [103] Y. Xue, T. Xu, H. Zhang, L. R. Long, and X. Huang, “SegAN:
Apr. 2018. Adversarial network with multi-scale L1 loss for medical image
[77] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated segmentation,” Neuroinformatics, vol. 16, no. 3/4, pp. 383–392,
convolutions,” 2015, arXiv:1511.07122. 2018.
[78] P. Wang et al., “Understanding convolution for semantic [104] M. Majurski et al., “Cell image segmentation using generative
segmentation,” in Proc. IEEE Winter Conf. Appl. Comput. Vis., adversarial networks, transfer learning, and augmentations,” in
2018, pp. 1451–1460. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, 2019,
[79] M. Yang, K. Yu, C. Zhang, Z. Li, and K. Yang, “DenseASPP for pp. 1114–1122.
semantic segmentation in street scenes,” in Proc. IEEE Conf. Com- [105] K. Ehsani, R. Mottaghi, and A. Farhadi, “SegAN: Segmenting
put. Vis. Pattern Recognit., 2018, pp. 3684–3692. and generating the invisible,” in Proc. IEEE Conf. Comput. Vis.
[80] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello, “ENet: A Pattern Recognit., 2018, pp. 6144–6153.
deep neural network architecture for real-time semantic [106] T. F. Chan and L. A. Vese, “Active contours without edges,” IEEE
segmentation,” 2016, arXiv:1606.02147. Trans. Image Process., vol. 10, no. 2, pp. 266–277, Feb. 2001.
[81] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, [107] X. Chen, B. M. Williams, S. R. Vallabhaneni, G. Czanner, R.
“Encoder-decoder with atrous separable convolution for seman- Williams, and Y. Zheng, “Learning active contour models for
tic image segmentation,” in Proc. Eur. Conf. Comput. Vis., 2018, medical image segmentation,” in Proc. IEEE Conf. Comput. Vis.
pp. 801–818. Pattern Recognit., 2019, pp. 11 632–11 640.
[82] F. Visin et al., “ReSeg: A recurrent neural network-based model [108] S. Gur, L. Wolf, L. Golgher, and P. Blinder, “Unsupervised
for semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pat- microvascular image segmentation using an active contours
tern Recognit. Workshops, 2016, pp. 41–48. mimicking neural network,” in Proc. IEEE Int. Conf. Comput. Vis.,
[83] F. Visin, K. Kastner, K. Cho, M. Matteucci, A. Courville, and Y. 2019, pp. 10 722–10 731.
Bengio, “ReNet: A recurrent neural network based alternative to [109] P. Marquez-Neila, L. Baumela, and L. Alvarez, “A morphological
convolutional networks,” 2015, arXiv:1505.00393. approach to curvature-based evolution of curves and surfaces,”
[84] W. Byeon, T. M. Breuel, F. Raue, and M. Liwicki, “Scene labeling IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 1, pp. 2–17, Jan.
with LSTM recurrent neural networks,” in Proc. IEEE Conf. Com- 2014.
put. Vis. Pattern Recognit., 2015, pp. 3547–3555. [110] T. H. N. Le, K. G. Quach, K. Luu, C. N. Duong, and M. Savvides,
[85] X. Liang, X. Shen, J. Feng, L. Lin, and S. Yan, “Semantic object “Reformulating level sets as deep recurrent neural network
parsing with graph LSTM,” in Proc. Eur. Conf. Comput. Vis., 2016, approach to semantic segmentation,” IEEE Trans. Image Process.,
pp. 125–143. vol. 27, no. 5, pp. 2393–2407, May 2018.
[86] Y. Xiang and D. Fox, “DA-RNN: Semantic mapping with data [111] C. Rupprecht, E. Huaroc, M. Baust, and N. Navab, “Deep active
associated recurrent neural networks,” 2017, arXiv: 1703.03098. contours,” 2016, arXiv:1607.05074.
[87] R. Hu, M. Rohrbach, and T. Darrell, “Segmentation from natural [112] A. Hatamizadeh et al., “Deep active lesion segmentation,” in
language expressions,” in Proc. Eur. Conf. Comput. Vis., 2016, Proc. Int. Workshop Mach. Learn. Med. Imag., 2019, vol. 11861,
pp. 108–124. pp. 98–105.
Authorized licensed use limited to: GITAM University. Downloaded on January 25,2023 at 11:09:35 UTC from IEEE Xplore. Restrictions apply.
3540 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 44, NO. 7, JULY 2022
[113] D. Marcos et al., “Learning deep structured active contours end- [138] X. Zhang, H. Xu, H. Mo, J. Tan, C. Yang, and W. Ren, “DCNAS:
to-end,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, Densely connected neural architecture search for semantic image
pp. 8877–8885. segmentation,” 2020, arXiv: 2003.11883.
[114] D. Cheng, R. Liao, S. Fidler, and R. Urtasun, “DARNet: Deep [139] A. Tao, K. Sapra, and B. Catanzaro, “Hierarchical multi-scale
active ray network for building segmentation,” in Proc. IEEE attention for semantic segmentation,” 2020, arXiv: 2005.10821.
Conf. Comput. Vis. Pattern Recognit., 2019, pp. 7431–7439. [140] D. Seichter, M. K€ ohler, B. Lewandowski, T. Wengefeld, and
[115] A. Hatamizadeh, D. Sengupta, and D. Terzopoulos, “End-to-end H.-M. Gross, “Efficient RGB-D semantic segmentation for
trainable deep active contour models for automated image seg- indoor scene analysis,” 2020, arXiv: 2011.06961.
mentation: Delineating buildings in aerial imagery,” in Proc. Eur. [141] M. Zhen et al., “Joint semantic segmentation and boundary detec-
Conf. Comput. Vis., 2020, pp. 730–746. tion using iterative pyramid contexts,” in Proc. IEEE/CVF Conf.
[116] H. Zhang et al., “Context encoding for semantic segmentation,” Comput. Vis. Pattern Recognit., 2020, pp. 13 666–13 675.
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 7151– [142] Y. Li et al., “Learning dynamic routing for semantic
7160. segmentation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Rec-
[117] G. Lin, A. Milan, C. Shen, and I. Reid, “RefineNet: Multi-path ognit., 2020, pp. 8553–8562.
refinement networks for high-resolution semantic segmentation,” [143] A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollar,
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 1925– “Panoptic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern
1934. Recognit., 2019, pp. 9404–9413.
[118] G. Song, H. Myeong, and K. M. Lee, “SeedNet: Automatic seed [144] A. Kirillov, R. Girshick, K. He, and P. Dollar, “Panoptic feature
generation with deep reinforcement learning for robust interac- pyramid networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Rec-
tive segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Rec- ognit., 2019, pp. 6399–6408.
ognit., 2018, pp. 1760–1768. [145] Y. Li et al., “Attention-guided unified network for panoptic
[119] J. Dai, K. He, and J. Sun, “BoxSup: Exploiting bounding boxes to segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
supervise convolutional networks for semantic segmentation,” 2019, pp. 7019–7028.
in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 1635–1643. [146] L. Porzi, S. R. Bulo, A. Colovic, and P. Kontschieder, “Seamless
[120] C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun, “Large kernel mat- scene segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Rec-
ters — Improve semantic segmentation by global convolutional ognit., 2019, pp. 8277–8286.
network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, [147] B. Cheng et al., “Panoptic-DeepLab,” 2019, arXiv: 1910.04751.
pp. 4353–4361. [148] Y. Xiong et al., “UPSNet: A unified panoptic segmentation
[121] Z. Wu, C. Shen, and A. Van Den Hengel, “Wider or deeper: network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019,
Revisiting the ResNet model for visual recognition,” Pattern Rec- pp. 8818–8826.
ognit., vol. 90, pp. 119–133, 2019. [149] R. Mohan and A. Valada, “EfficientPS: Efficient panoptic
[122] Z. Zhang, X. Zhang, C. Peng, X. Xue, and J. Sun, “ExFuse: segmentation,” 2020, arXiv: 2004.02307.
Enhancing feature fusion for semantic segmentation,” in Proc. [150] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A.
Eur. Conf. Comput. Vis., 2018, pp. 269–284. Zisserman, “The PASCAL visual object classes (VOC) challenge,”
[123] M. Mostajabi, P. Yadollahpour, and G. Shakhnarovich, Int. J. Comput. Vis., vol. 88, pp. 303–338, 2010.
“Feedforward semantic segmentation with zoom-out features,” [151] 2012. [Online]. Available: http://host.robots.ox.ac.uk/pascal/
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 3376– VOC/voc2012/
3385. [152] R. Mottaghi et al., “The role of context for object detection and
[124] W. Wang, J. Shen, and F. Porikli, “Saliency-aware geodesic video semantic segmentation in the wild,” in Proc. IEEE Conf. Comput.
object segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Rec- Vis. Pattern Recognit., 2014, pp. 891–898.
ognit., 2015, pp. 3395–3402. [153] T.-Y. Lin et al., “Microsoft COCO: Common objects in context,” in
[125] P. Luo, G. Wang, L. Lin, and X. Wang, “Deep dual learning for Proc. Eur. Conf. Comput. Vis., 2014, pp. 740–755.
semantic image segmentation,” in Proc. IEEE Int. Conf. Comput. [154] M. Cordts et al., “The cityscapes dataset for semantic urban scene
Vis., 2017, pp. 2718–2726. understanding,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog-
[126] X. Li et al., “FoveaNet: Perspective-aware urban scene parsing,” nit., 2016, pp. 3213–3223.
in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 784–792. [155] C. Liu, J. Yuen, and A. Torralba, “Nonparametric scene parsing:
[127] I. Kreso, S. Segvic, and J. Krapac, “Ladder-style densenets for Label transfer via dense scene alignment,” in Proc. IEEE Conf.
semantic segmentation of large natural images,” in Proc. IEEE Comput. Vis. Pattern Recognit., 2009, pp. 1972–1979.
Int. Conf. Comput. Vis., 2017, pp. 238–245. [156] S. Gould, R. Fulton, and D. Koller, “Decomposing a scene into
[128] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, “BiSeNet: geometric and semantically consistent regions,” in Proc. Int. Conf.
Bilateral segmentation network for real-time semantic Comput. Vis., 2009, pp. 1–8.
segmentation,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 325–341. [157] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of
[129] B. Cheng et al., “SPGNet: Semantic prediction guidance for scene human segmented natural images and its application to evaluat-
parsing,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 5218–5228. ing segmentation algorithms and measuring ecological
[130] T. Takikawa, D. Acuna, V. Jampani, and S. Fidler, “Gated-SCNN: statistics,” in Proc. Int. Conf. Comput. Vis., 2001, vol. 2, pp. 416–
Gated shape CNNs for semantic segmentation,” in Proc. IEEE Int. 423.
Conf. Comput. Vis., 2019, pp. 5229–5238. [158] A. Prest, C. Leistner, J. Civera, C. Schmid, and V. Ferrari,
[131] J. Fu et al., “Adaptive context network for scene parsing,” in Proc. “Learning object class detectors from weakly annotated video,”
IEEE Int. Conf. Comput. Vis., 2019, pp. 6748–6757. in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2012, pp. 3282–
[132] X. Liang, H. Zhou, and E. Xing, “Dynamic-structured semantic 3289.
propagation network,” in Proc. IEEE Conf. Comput. Vis. Pattern [159] S. D. Jain and K. Grauman, “Supervoxel-consistent foreground
Recognit., 2018, pp. 752–761. propagation in video,” in Proc. Eur. Conf. Comput. Vis., 2014,
[133] X. Liang, Z. Hu, H. Zhang, L. Lin, and E. P. Xing, “Symbolic pp. 656–671.
graph reasoning meets convolutions,” in Proc. 32nd Int. Conf. [160] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets
Neural Inf. Process. Syst., 2018, pp. 1853–1863. robotics: The KITTI dataset,” The Int. J. Robot. Res., vol. 32, no. 11,
[134] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, pp. 1231–1237, 2013.
“Scene parsing through ADE20K dataset,” in Proc. IEEE Conf. [161] J. M. Alvarez, T. Gevers, Y. LeCun, and A. M. Lopez, “Road
Comput. Vis. Pattern Recognit., 2017, pp. 1858–1868. scene segmentation from a single image,” in Proc. Eur. Conf. Com-
[135] R. Zhang, S. Tang, Y. Zhang, J. Li, and S. Yan, “Scale-adaptive put. Vis., 2012, pp. 376–389.
convolutions for scene parsing,” in Proc. IEEE Int. Conf. Comput. [162] B. Hariharan, P. Arbelaez, L. Bourdev, S. Maji, and J. Malik,
Vis., 2017, pp. 2031–2039. “Semantic contours from inverse detectors,” in Proc. Int. Conf.
[136] T. Xiao, Y. Liu, B. Zhou, Y. Jiang, and J. Sun, “Unified perceptual Comput. Vis., 2011, pp. 991–998.
parsing for scene understanding,” in Proc. Eur. Conf. Comput. [163] X. Chen, R. Mottaghi, X. Liu, S. Fidler, R. Urtasun, and A. Yuille,
Vis., 2018, pp. 418–434. “Detect what you can: Detecting and representing objects using
[137] B. Zoph et al., “Rethinking pre-training and self-training,” 2020, holistic models and body parts,” in Proc. IEEE Conf. Comput. Vis.
arXiv: 2006.06882. Pattern Recognit., 2014, pp. 1971–1978.
Authorized licensed use limited to: GITAM University. Downloaded on January 25,2023 at 11:09:35 UTC from IEEE Xplore. Restrictions apply.
MINAEE ET AL.: IMAGE SEGMENTATION USING DEEP LEARNING: A SURVEY 3541
[164] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez, [188] V. Goel, J. Weng, and P. Poupart, “Unsupervised video object
“The SYNTHIA dataset: A large collection of synthetic images segmentation for deep reinforcement learning,” in Proc. 32nd Int.
for semantic segmentation of urban scenes,” in Proc. IEEE Conf. Conf. Neural Inf. Process. Syst., 2018, pp. 5683–5694.
Comput. Vis. Pattern Recognit., 2016, pp. 3234–3243. [189] L. Ma, Y. Liu, X. Zhang, Y. Ye, G. Yin, and B. A. Johnson, “Deep
[165] X. Shen et al., “Automatic portrait segmentation for image styl- learning in remote sensing applications: A meta-analysis and
ization,” Comput. Graph. Forum, vol. 35, no. 2, pp. 93–102, 2016. review,” ISPRS J. Photogrammetry Remote Sens., vol. 152, pp. 166–
[166] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor seg- 177, 2019.
mentation and support inference from RGBD images,” in Proc. [190] L. Gao, Y. Zhang, F. Zou, J. Shao, and J. Lai, “Unsupervised
Eur. Conf. Comput. Vis., 2012, pp. 746–760. urban scene segmentation via domain adaptation,” Neurocomput-
[167] J. Xiao, A. Owens, and A. Torralba, “Sun3D: A database of big ing, vol. 406, pp. 295–301, 2020.
spaces reconstructed using SFM and object labels,” in Proc. IEEE [191] M. Paoletti, J. Haut, J. Plaza, and A. Plaza, “Deep learning classi-
Int. Conf. Comput. Vis., 2013, pp. 1625–1632. fiers for hyperspectral imaging: A review,” ISPRS J. Photogram-
[168] S. Song, S. P. Lichtenberg, and J. Xiao, “Sun RGB-D: A RGB-D metry Remote Sens., vol. 158, pp. 279–317, 2019.
scene understanding benchmark suite,” in Proc. IEEE Conf. Com- [192] J. F. Abrams et al., “Habitat-Net: Segmentation of habitat images
put. Vis. Pattern Recognit., 2015, pp. 567–576. using deep learning,” Ecological Informat., vol. 51, pp. 121–128,
[169] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. 2019.
Nießner, “ScanNet: Richly-annotated 3D reconstructions of [193] M. Kerkech, A. Hafiane, and R. Canals, “Vine disease detection
indoor scenes,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., in UAV multispectral images using optimized image registration
2017, pp. 5828–5839. and deep learning segmentation approach,” Comput. Electron.
[170] I. Armeni, A. Sax, A. Zamir, and S. Savarese, “Joint 2D-3D- Agriculture, vol. 174, 2020, Art. no. 105446.
semantic data for indoor scene understanding,” Feb. 2017, ArXiv [194] Y. Song, Z. Huang, C. Shen, H. Shi, and D. A. Lange, “Deep
e-prints. learning-based automated image segmentation for concrete pet-
[171] K. Lai, L. Bo, X. Ren, and D. Fox, “A large-scale hierarchical rographic analysis,” Cement Concrete Res., vol. 135, 2020, Art.
multi-view RGB-D object dataset,” in Proc. IEEE Int. Conf. Robot. no. 106118.
Autom., 2011, pp. 1817–1824. [195] N. Tajbakhsh, L. Jeyaseelan, Q. Li, J. N. Chiang, Z. Wu, and X.
[172] C.-Y. Fu, M. Shvets, and A. C. Berg, “RetinaMask: Learning to Ding, “Embracing imperfect datasets: A review of deep learning
predict masks improves state-of-the-art single-shot detection for solutions for medical image segmentation,” Med. Image Anal.,
free,” 2019, arXiv: 1901.03353. vol. 63, 2020, Art. no. 101693.
[173] P. O. Pinheiro, T.-Y. Lin, R. Collobert, and P. Dollar, “Learning to [196] A. Amyar, R. Modzelewski, H. Li, and S. Ruan, “Multi-task
refine object segments,” in Proc. Eur. Conf. Comput. Vis., 2016, deep learning based CT imaging analysis for COVID-19 pneu-
pp. 75–91. monia: Classification and segmentation,” Comput. Biol. Med.,
[174] H. Liu et al., “An end-to-end network for panoptic vol. 126, 2020, Art. no. 104037.
segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
2019, pp. 6172–6181.
[175] K. Sofiiuk, O. Barinova, and A. Konushin, “AdaptIS: Adaptive
instance selection network,” in Proc. IEEE Int. Conf. Comput. Vis., Shervin Minaee (Member, IEEE) received the
2019, pp. 7355–7363. PhD degree in electrical engineering and com-
[176] J. Lazarow, K. Lee, K. Shi, and Z. Tu, “Learning instance occlu- puter science from New York University, New
sion for panoptic segmentation,” in Proc. IEEE Conf. Comput. Vis. York, in 2018. He is currently a machine learning
Pattern Recognit., 2020, pp. 10 720–10 729. lead in the computer vision team at Snapchat,
[177] Z. Deng, S. Todorovic, and L. J. Latecki, “Semantic segmentation Inc. His research interests include computer
of RGBD images with mutex constraints,” in Proc. IEEE Int. Conf. vision, image segmentation, biometric recogni-
Comput. Vis., 2015, pp. 1733–1741. tion, and applied deep learning. He has published
[178] D. Eigen and R. Fergus, “Predicting depth, surface normals and more than 40 papers and patents during his PhD.
semantic labels with a common multi-scale convolutional He previously worked as a research scientist at
architecture,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, Samsung Research, AT&T Labs, Huawei Labs,
pp. 2650–2658. and as a data scientist at Expedia group. He has been a reviewer
[179] A. Mousavian, H. Pirsiavash, and J. Kosecka, “Joint semantic seg- for more than 20 computer vision related journals from IEEE, ACM,
mentation and depth estimation with deep convolutional Elsevier, and Springer. He has won several awards, including the Best
networks,” in Proc. 4th Int. Conf. 3D Vis., 2016, pp. 611–619. Research Presentation at Samsung Research America, in 2017 and the
[180] A. Kendall, V. Badrinarayanan, and R. Cipolla, “Bayesian Seg- Verizon Open Innovation Challenge Award, in 2016.
Net: Model uncertainty in deep convolutional encoder-decoder
architectures for scene understanding,” 2015, arXiv:1511.02680.
[181] X. Qi, R. Liao, J. Jia, S. Fidler, and R. Urtasun, “3D graph neural
networks for RGBD semantic segmentation,” in Proc. IEEE Int. Yuri Boykov (Member, IEEE) is currently a pro-
Conf. Comput. Vis., 2017, pp. 5199–5208. fessor at the Cheriton School of Computer Sci-
[182] W. Wang and U. Neumann, “Depth-aware CNN for RGB-D ence, University of Waterloo, Canada. His
segmentation,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 135–150. research interests include the area of com-
[183] S. Vandenhende, S. Georgoulis, and L. Van Gool, “MTI-net: puter vision and biomedical image analysis
Multi-scale task interaction networks for multi-task learning,” with focus on modeling and optimization for
2020, arXiv: 2001.06902. structured segmentation, restoration, registra-
[184] S.-J. Park, K.-S. Hong, and S. Lee, “RDFNet: RGB-D multi-level tion, stereo, motion, model fitting, recognition,
residual feature fusion for indoor semantic segmentation,” in photo-video editing and other data analysis
Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 4980–4989. problems. He is an editor for the International
[185] J. Jiao, Y. Wei, Z. Jie, H. Shi, R. W. Lau, and T. S. Huang, “Geometry- Journal of Computer Vision (IJCV). His work
aware distillation for indoor semantic segmentation,” in Proc. IEEE was listed among the 10 most influential papers in the IEEE Trans-
Conf. Comput. Vis. Pattern Recognit., 2019, pp. 2869–2878. actions on Pattern Analysis and Machine Intelligence (Top Picks for
[186] Z.-H. Zhou, “A brief introduction to weakly supervised 30 years). In 2017 Google Scholar listed his work on segmentation
learning,” Nat. Sci. Rev., vol. 5, no. 1, pp. 44–53, 2018. as a “classic paper in computer vision and pattern recognition” (from
[187] L. Jing and Y. Tian, “Self-supervised visual feature learning with 2006). In 2011, he received the Helmholtz Prize from the IEEE and
deep neural networks: A survey,” IEEE Trans. Pattern Anal. Mach. the Test of Time Award from the International Conference on Com-
Intell., early access, May 4, 2020, doi: 10.1109/TPAMI.2020.2992393. puter Vision.
Authorized licensed use limited to: GITAM University. Downloaded on January 25,2023 at 11:09:35 UTC from IEEE Xplore. Restrictions apply.
3542 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 44, NO. 7, JULY 2022
Fatih Porikli (Fellow, IEEE) received the PhD Nasser Kehtarnavaz (Fellow, IEEE) is currently
degree from New York University, New York, in an Erik Jonsson distinguished professor at the
2002. He is currently a senior director at Qual- Department of Electrical and Computer Engineer-
comm, San Diego. He was a full professor with ing and the director of the Embedded Machine
the Research School of Engineering, Australian Learning Laboratory at the University of Texas at
National University, Australia and, until recently, a Dallas, Richardson, Texas. His research interests
vice president at Huawei CBG Device; Hardware, include signal and image processing, machine
San Diego. He led the Computer Vision Research learning, deep learning, and real-time implemen-
Group at NICTA, Australia, and was tation on embedded processors. He has authored
a distinguished research scientist at Mitsubishi or coauthored ten books and more than 400 jour-
Electric Research Laboratories, Cambridge, Mas- nal papers, conference papers, patents, manuals,
sachusetts. He was the recipient of the R&D 100 Scientist of the Year and editorials in these areas. He is a fellow of SPIE, a licensed profes-
Award, in 2006. He won six best paper awards, authored more than 250 sional engineer, and editor-in-chief of the Journal of Real-Time Image
papers, co-edited two books, and invented more than 100 patents. He has Processing.
served as the general chair and technical program chair of many IEEE
conferences and as an associate editor of premier IEEE and Springer
journals for the past 15 years. Demetri Terzopoulos (Fellow, IEEE) received
the PhD degree in artificial intelligence from
the Massachusetts Institute of Technology (MIT),
Antonio Plaza (Fellow, IEEE) received the MSc Cambridge, Massachusetts, in 1984. He is cur-
and PhD degrees from the Department of Tech- rently a distinguished professor and chancellor’s
nology of Computers and Communications, Uni- professor of computer science at the University of
versity of Extremadura, Spain,1999 and 2002, California, Los Angeles, Los Angeles, California,
respectively, both in computer engineering. He is where he directs the UCLA Computer Graphics &
currently a professor at the Department of Tech- Vision Laboratory, and is co-founder and chief
nology of Computers and Communications, Uni- scientist of VoxelCloud, Inc. He is or was a
versity of Extremadura, Spain. He has authored Guggenheim fellow, a fellow of the ACM, IETI,
more than 600 publications, including 300 JCR Royal Society of Canada, and Royal Society of London, and a member of
journal papers (more than 170 in IEEE journals), the European Academy of Sciences, the New York Academy of Sciences,
24 book chapters, and more than 300 peer- and Sigma Xi. Among his many awards are an Academy Award from the
reviewed conference proceedings papers. He is a recipient of the Best Academy of Motion Picture Arts and Sciences for his pioneering work
Column Award of the IEEE Signal Processing Magazine, in 2015, the on physics-based computer animation, and the Computer Pioneer
2013 Best Paper Award of the IEEE Journal of Selected Topics in Award, Helmholtz Prize, and inaugural Computer Vision Distinguished
Applied Earth Observations and Remote Sensing journal, and the most Researcher Award from the IEEE for his pioneering and sustained
highly cited paper (2005–2010) in the Journal of Parallel and Distributed research on deformable models and their applications. Deformable
Computing. He is included in the 2018, 2019, and 2020 Highly Cited models, a term he coined, is listed in the IEEE Taxonomy.
Researchers List.
Authorized licensed use limited to: GITAM University. Downloaded on January 25,2023 at 11:09:35 UTC from IEEE Xplore. Restrictions apply.