0% found this document useful (0 votes)
132 views

Object Detection With Deep Learning: A Review

Uploaded by

Dhananjay Singh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
132 views

Object Detection With Deep Learning: A Review

Uploaded by

Dhananjay Singh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1

Object Detection With Deep Learning: A Review


Zhong-Qiu Zhao , Member, IEEE, Peng Zheng, Shou-Tao Xu, and Xindong Wu , Fellow, IEEE
Abstract— Due to object detection’s close relationship with and related learning systems, the progress in these fields
video analysis and image understanding, it has attracted much will develop neural network algorithms and will also have
research attention in recent years. Traditional object detection great impacts on object detection techniques that can be
methods are built on handcrafted features and shallow trainable
architectures. Their performance easily stagnates by construct- considered as learning systems [11]–[14], [S6]. However, due
ing complex ensembles that combine multiple low-level image to large variations in viewpoints, poses, occlusions, and light-
features with high-level context from object detectors and scene ing conditions, it is difficult to perfectly accomplish object
classifiers. With the rapid development in deep learning, more detection with an additional object localization task. Therefore,
powerful tools, which are able to learn semantic, high-level, much attention has been attracted to this field in recent
deeper features, are introduced to address the problems existing
in traditional architectures. These models behave differently in
years [15]–[18].
network architecture, training strategy, and optimization func- The problem definition of object detection is to determine
tion. In this paper, we provide a review of deep learning-based where objects are located in a given image (object localization)
object detection frameworks. Our review begins with a brief and which category each object belongs to (object classifica-
introduction on the history of deep learning and its representative tion). Therefore, the pipeline of traditional object detection
tool, namely, the convolutional neural network. Then, we focus
models can be mainly divided into three stages: informative
on typical generic object detection architectures along with some
modifications and useful tricks to improve detection performance region selection, feature extraction, and classification.
further. As distinct specific detection tasks exhibit different
characteristics, we also briefly survey several specific tasks, A. Informative Region Selection
including salient object detection, face detection, and pedestrian
detection. Experimental analyses are also provided to compare As different objects may appear in any positions of the
various methods and draw some meaningful conclusions. Finally, image and have different aspect ratios or sizes, it is a natural
several promising directions and tasks are provided to serve as choice to scan the whole image with a multiscale sliding
guidelines for future work in both object detection and relevant
neural network-based learning systems.
window. Although this exhaustive strategy can find out all
possible positions of the objects, its shortcomings are also
Index Terms— Deep learning, neural network, object detection. obvious. Due to a large number of candidate windows, it is
I. I NTRODUCTION computationally expensive and produces too many redundant
windows. However, if only a fixed number of sliding window
T O GAIN a complete image understanding, we should
not only concentrate on classifying different images but
also try to precisely estimate the concepts and locations
templates is applied, unsatisfactory regions may be produced.

of objects contained in each image. This task is referred B. Feature Extraction


as object detection [1], [S1], which usually consists of dif- To recognize different objects, we need to extract visual
ferent subtasks such as face detection [2], [S2], pedestrian features that can provide a semantic and robust represen-
detection [3], [S2], and skeleton detection [4], [S3]. As one of tation. Scale-invariant feature transform [19], histograms of
the fundamental computer vision problems, object detection oriented gradients (HOG) [20], and Haar-like [21] features are
is able to provide valuable information for semantic under- the representative ones. This is due to the fact that these
standing of images and videos and is related to many applica- features can produce representations associated with complex
tions, including image classification [5], [6], human behavior cells in human brain [19]. However, due to the diversity of
analysis [7], [S4], face recognition [8], [S5], and autonomous appearances, illumination conditions, and backgrounds, it is
driving [9], [10]. Meanwhile, inheriting from neural networks difficult to manually design a robust feature descriptor to
Manuscript received September 8, 2017; revised March 3, 2018 and perfectly describe all kinds of objects.
July 12, 2018; accepted October 15, 2018. This work was supported in part
by the National Natural Science Foundation of China under Grant 61672203, C. Classification
Grant 61375047, and Grant 91746209, in part by the National Key Research
and Development Program of China under Grant 2016YFB1000901, and in Besides, a classifier is needed to distinguish a target object
part by the Anhui Natural Science Funds for Distinguished Young Scholar from all the other categories and to make the representations
under Grant 170808J08. (Corresponding author: Zhong-Qiu Zhao.)
Z.-Q. Zhao, P. Zheng, and S.-T. Xu are with the College of Computer more hierarchical, semantic, and informative for visual recog-
Science and Information Engineering, Hefei University of Technology, Hefei nition. Usually, the supported vector machine (SVM) [22],
230009, China (e-mail: zhongqiuzhao@gmail.com). AdaBoost [23], and deformable part-based model (DPM) [24]
X. Wu is with the School of Computing and Informatics, University of
Louisiana at Lafayette, Lafayette, LA 70504 USA. are good choices. Among these classifiers, the DPM is a
This paper has supplementary downloadable material available at flexible model by combining object parts with deformation
http://ieeexplore.ieee.org, provided by the authors. cost to handle severe deformations. In DPM, with the aid
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. of a graphical model, carefully designed low-level features
Digital Object Identifier 10.1109/TNNLS.2018.2876865 and kinematically inspired part decompositions are combined.
2162-237X © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

feature fusion/boosting forest, respectively. The dotted lines


indicate that the corresponding domains are associated with
each other under certain conditions. It should be noticed
that the covered domains are diversified. Pedestrian and face
images have regular structures, while general objects and scene
images have more complex variations in geometric structures
and layouts. Therefore, different deep models are required by
various images.
There has been a relevant pioneer effort [35] which mainly
Fig. 1. Application domains of object detection.
focuses on relevant software tools to implement deep learning
techniques for image classification and object detection but
Discriminative learning of graphical models allows for build-
pays little attention on detailing specific algorithms. Different
ing high-precision part-based models for a variety of object
from it, our work not only reviews deep learning-based object
classes.
detection models and algorithms covering different applica-
Based on these discriminant local feature descriptors and
tion domains in detail but also provides their corresponding
shallow learnable architectures, state-of-the-art results have
experimental comparisons and meaningful analyses.
been obtained on PASCAL visual object classes (VOC) object
The rest of this paper is organized as follows. In Section II,
detection competition [25] and real-time embedded systems
a brief introduction on the history of deep learning and the
have been obtained with a low burden on hardware. However,
basic architecture of CNN is provided. Generic object detec-
small gains are obtained during 2010–2012 by only building
tion architectures are presented in Section III. Then, reviews
ensemble systems and employing minor variants of successful
of CNN applied in several specific tasks, including salient
methods [15]. This fact is due to the following reasons: 1) the
object detection, face detection, and pedestrian detection, are
generation of candidate bounding boxes (BBs) with a sliding
exhibited in Section IV–VI, respectively. Several promising
window strategy is redundant, inefficient, and inaccurate and
future directions are proposed in Section VII. At last, some
2) the semantic gap cannot be bridged by the combination
concluding remarks are presented in Section VIII.
of manually engineered low-level descriptors and discrimina-
tively trained shallow models.
II. B RIEF OVERVIEW OF D EEP L EARNING
Thanks to the emergency of deep neural networks
(DNNs) [6], [26], [S7], a more significant gain is obtained Prior to an overview on deep learning-based object detection
with the introduction of regions with convolutional neural approaches, we provide a review on the history of deep
network (CNN) features (R-CNN) [15]. DNNs, or the most learning along with an introduction on the basic architecture
representative CNNs, act in a quite different way from tra- and advantages of CNN.
ditional approaches. They have deeper architectures with the
capacity to learn more complex features than the shallow ones. A. History: Birth, Decline, and Prosperity
Also, the expressivity and robust training algorithms allow to Deep models can be referred to as neural networks with
learn informative object representations without the need to deep structures. The history of neural networks can date
design features manually [27]. back to the 1940s [36], and the original intention was to
Since the proposal of R-CNN, a great deal of improved simulate the human brain system to solve general learning
models have been suggested, including fast R-CNN that problems in a principled way. It was popular in the 1980s and
jointly optimizes classification and bounding box regres- 1990s with the proposal of the back-propagation algorithm
sion tasks [16], faster R-CNN that takes an additional sub- by Rumelhart et al. [37]. However, due to the overfitting of
network to generate region proposals [17], and you only training, lack of large-scale training data, limited computation
look once (YOLO) that accomplishes object detection via a power, and insignificance in performance compared with other
fixed-grid regression [18]. All of them bring different degrees machine learning tools, neural networks fell out of fashion in
of detection performance improvements over the primary the early 2000s.
R-CNN and make real-time and accurate object detection more Deep learning has become popular since 2006 [26], [S7],
achievable. with a breakthrough in speech recognition [38]. The recovery
In this paper, a systematic review is provided to of deep learning can be attributed to the following factors.
summarize representative models and their different char- 1) The emergence of large-scale annotated training data,
acteristics in several application domains, including generic such as ImageNet [39], to fully exhibit its very large
object detection [15]–[17], salient object detection [28], [29], learning capacity.
face detection [30]–[32], and pedestrian detection [33], [34]. 2) Fast development of high-performance parallel comput-
Their relationships are depicted in Fig. 1. Based on basic ing systems, such as GPU clusters.
CNN architectures, the generic object detection is achieved 3) Significant advances in the design of network structures
with bounding box regression, while salient object detec- and training strategies. With unsupervised and layerwise
tion is accomplished with local contrast enhancement and pretraining guided by autoencoder [40] or restricted
pixel-level segmentation. Face detection and pedestrian detec- Boltzmann machine [41], a good initialization is pro-
tion are closely related to generic object detection and vided. With dropout and data augmentation, the over-
mainly accomplished with multiscale adaption and multi- fitting problem in training has been relieved [6], [42].
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ZHAO et al.: OBJECT DETECTION WITH DEEP LEARNING 3

With batch normalization (BN), the training of very


DNNs becomes quite efficient [43]. Meanwhile, various
network structures, such as AlexNet [6], Overfeat [44],
GoogLeNet [45], Visual Geometry Group (VGG) [46],
and Residual Net (ResNet) [47], have been extensively
studied to improve the performance.
What prompts deep learning to have a huge impact on Fig. 2. Two types of frameworks: region proposal based and
the entire academic community? It may owe to the con- regression/classification based. SPP: spatial pyramid pooling [64], FRCN:
faster R-CNN [16], RPN: region proposal network [17], FCN: fully con-
tribution of Hinton’s group, whose continuous efforts have volutional network [65], BN: batch normalization [43], and Deconv layers:
demonstrated that deep learning would bring a revolutionary deconvolution layers [54] .
breakthrough on grand challenges rather than just obvious
improvements on small data sets. Their success results from structure [15], [53], can be learned from data automati-
training a large CNN on 1.2 million labeled images together cally and hidden factors of input data can be disentan-
with a few techniques [6] [e.g., rectified linear unit (ReLU) gled through multilevel nonlinear mappings.
operation [48] and “dropout” regularization]. 2) Compared with traditional shallow models, a deeper
architecture provides an exponentially increased expres-
B. Architecture and Advantages of CNN sive capability.
CNN is the most representative model of deep learning [27]. 3) The architecture of CNN provides an opportunity to
A typical CNN architecture, which is referred to as VGG16, jointly optimize several related tasks together (e.g., fast
can be found in Fig. S1 in the supplementary material. Each R-CNN combines classification and bounding box
layer of CNN is known as a feature map. The feature map regression into a multitask learning manner).
of the input layer is a 3-D matrix of pixel intensities for 4) Benefitting from the large learning capacity of deep
different color channels (e.g., RGB). The feature map of CNNs, some classical computer vision challenges can
any internal layer is an induced multichannel image, whose be recast as high-dimensional data transform problems
“pixel” can be viewed as a specific feature. Every neu- and solved from a different viewpoint.
ron is connected with a small portion of adjacent neurons Due to these advantages, CNN has been widely applied
from the previous layer (receptive field). Different types of into many research fields, such as image superresolu-
transformations [6], [49], [50] can be conducted on feature tion reconstruction [54], [55], image classification [5], [56],
maps, such as filtering and pooling. Filtering (convolution) image retrieval [57], [58], face recognition [8], [S5], pedes-
operation convolutes a filter matrix (learned weights) with trian detection [59]–[61], and video analysis [62], [63].
the values of a receptive field of neurons and takes a non-
linear function (such as sigmoid [51], ReLU) to obtain final
responses. Pooling operation, such as max pooling, average III. G ENERIC O BJECT D ETECTION
pooling, L2-pooling, and local contrast normalization [52],
summarizes the responses of a receptive field into one value Generic object detection aims at locating and classifying
to produce more robust feature descriptions. existing objects in any one image and labeling them with
With an interleave between convolution and pooling, an ini- rectangular BBs to show the confidences of existence. The
tial feature hierarchy is constructed, which can be fine-tuned frameworks of generic object detection methods can mainly
in a supervised manner by adding several fully connected (FC) be categorized into two types (see Fig. 2). One follows the tra-
layers to adapt to different visual tasks. According to the tasks ditional object detection pipeline, generating region proposals
involved, the final layer with different activation functions [6] at first and then classifying each proposal into different object
is added to get a specific conditional probability for each categories. The other regards object detection as a regression
output neuron. The whole network can be optimized on an or classification problem, adopting a unified framework to
objective function (e.g., mean squared error or cross-entropy achieve final results (categories and locations) directly. The
loss) via the stochastic gradient descent (SGD) method. The region proposal-based methods mainly include R-CNN [15],
typical VGG16 has totally 13 convolutional (conv) layers, spatial pyramid pooling (SPP)-net [64], Fast R-CNN [16],
3 FC layers, 3 max-pooling layers, and a softmax classification Faster R-CNN [17], region-based fully convolutional network
layer. The conv feature maps are produced by convoluting (R-FCN) [65], feature pyramid networks (FPN) [66], and
3*3 filter windows, and feature map resolutions are reduced Mask R-CNN [67], some of which are correlated with each
with 2 stride max-pooling layers. An arbitrary test image of the other (e.g., SPP-net modifies R-CNN with an SPP layer).
same size as training samples can be processed with the trained The regression/classification-based methods mainly include
network. Rescaling or cropping operations may be needed if MultiBox [68], AttentionNet [69], G-CNN [70], YOLO [18],
different sizes are provided [6]. Single Shot MultiBox Detector (SSD) [71], YOLOv2 [72],
The advantages of CNN against traditional methods can be deconvolutional single shot detector (DSSD) [73], and deeply
summarized as follows. supervised object detectors (DSOD) [74]. The correlations
1) Hierarchical feature representation, which is the between these two pipelines are bridged by the anchors
multilevel representations from pixel to high-level introduced in Faster R-CNN. Details of these methods are as
semantic features learned by a hierarchical multistage follows.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

In spite of its improvements over traditional methods and


significance in bringing CNN into practical object detection,
there are still some disadvantages.
1) Due to the existence of FC layers, the CNN requires a
fixed size (e.g., 227 × 227) input image, which directly
leads to the recomputation of the whole CNN for each
evaluated region, taking a great deal of time in the testing
Fig. 3. Flowchart of R-CNN [15], which consists of three stages: 1) extracts
BU region proposals, 2) computes features for each proposal using a CNN,
period.
and then 3) classifies each region with class-specific linear SVMs. 2) Training of R-CNN is a multistage pipeline. At first,
a convolutional network (ConvNet) on object proposals
A. Region Proposal-Based Framework is fine-tuned. Then, the softmax classifier learned by
fine-tuning is replaced by SVMs to fit in with ConvNet
The region proposal-based framework, a two-step process, features. Finally, bounding-box regressors are trained.
matches the attentional mechanism of the human brain to 3) Training is expensive in space and time. Features are
some extent, which gives a coarse scan of the whole scenario extracted from different region proposals and stored on
first and then focuses on regions of interest (RoIs). Among the disk. It will take a long time to process a relatively
the prerelated works [44], [75], [76], the most representative small training set with very deep networks, such as
one is Overfeat [44]. This model inserts CNN into the sliding VGG16. At the same time, the storage memory required
window method, which predicts BBs directly from locations by these features should also be a matter of concern.
of the topmost feature map after obtaining the confidences of 4) Although selective search can generate region propos-
underlying object categories. als with relatively high recalls, the obtained region
1) R-CNN: It is of significance to improve the quality proposals are still redundant and this procedure is
of candidate BBs and to take a deep architecture to extract time-consuming (around 2 s to extract 2000 region
high-level features. To solve these problems, R-CNN was proposals).
proposed by Girshick et al. [15] and obtained a mean average To solve these problems, many methods have been
precision (mAP) of 53.3% with more than 30% improvement proposed. Geodesic object proposals [80] takes a much faster
over the previous best result (DPM histograms of sparse geodesic-based segmentation to replace traditional graph
codes [77]) on PASCAL VOC 2012. Fig. 3 shows the flow- cuts. Mutiscale combinatorial grouping [81] searches different
chart of R-CNN, which can be divided into three stages as scales of the image for multiple hierarchical segmentations and
follows. combinatorially groups different regions to produce proposals.
a) Region Proposal Generation: The R-CNN adopts Instead of extracting visually distinct segments, the edge boxes
selective search [78] to generate about 2000 region proposals method [82] adopts the idea that objects are more likely to
for each image. The selective search method relies on simple exist in BBs with fewer contours straggling their boundaries.
bottom-up (BU) grouping and saliency cues to provide more Also, some studies tried to rerank or refine preextracted
accurate candidate boxes of arbitrary sizes quickly and to region proposals to remove unnecessary ones and obtained a
reduce the searching space in object detection [24], [39]. limited number of valuable ones, such as DeepBox [83] and
b) CNN-Based Deep Feature Extraction: In this stage, SharpMask [84].
each region proposal is warped or cropped into a fixed In addition, there are some improvements to solve the
resolution, and the CNN module in [6] is utilized to extract problem of inaccurate localization. Zhang et al. [85] utilized
a 4096-dimensional feature as the final representation. Due a Bayesian optimization-based search algorithm to guide
to large learning capacity, dominant expressive power, and the regressions of different BBs sequentially and trained
hierarchical structure of CNNs, a high-level, semantic, and class-specific CNN classifiers with a structured loss to penal-
robust feature representation for each region proposal can be ize the localization inaccuracy explicitly. Gupta et al. [86]
obtained. improved object detection for RGB-D images with seman-
c) Classification and Localization: With pretrained tically rich image and depth features and learned a new
category-specific linear SVMs for multiple classes, different geocentric embedding for depth images to encode each pixel.
region proposals are scored on a set of positive regions and The combination of object detectors and superpixel classi-
background (negative) regions. The scored regions are then fication framework gains a promising result on the seman-
adjusted with bounding box regression and filtered with a tic scene segmentation task. Ouyang et al. [87] proposed a
greedy nonmaximum suppression (NMS) to produce final BBs deformable deep CNN (DeepID-Net) that introduces a novel
for preserved object locations. deformation constrained pooling (def-pooling) layer to impose
When there are scarce or insufficient labeled data, geometric penalty on the deformation of various object parts
pretraining is usually conducted. Instead of unsupervised and makes an ensemble of models with different settings.
pretraining [79], R-CNN first conducts supervised pretraining Lenc and Vedaldi [88] provided an analysis on the role of
on ImageNet Large-Scale Visual Recognition Competition, proposal generation in CNN-based detectors and tried to
a very large auxiliary data set, and then takes a domain-specific replace this stage with a constant and trivial region generation
fine-tuning. This scheme has been adopted by most of the scheme. The goal is achieved by biasing sampling to match
subsequent approaches [16], [17]. the statistics of the ground truth BBs with K -means clustering.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ZHAO et al.: OBJECT DETECTION WITH DEEP LEARNING 5

Fig. 5. Architecture of Fast R-CNN [16].

pooling layer. The RoI pooling layer is a special case of the


SPP layer, which has only one pyramid level. Each feature
Fig. 4. Architecture of SPP-net for object detection [64]. vector is then fed into a sequence of FC layers before finally
branching into two sibling output layers. One output layer is
However, more candidate boxes are required to achieve com- responsible for producing softmax probabilities for all C + 1
parable results to those of R-CNN. categories (C object classes plus one “background” class)
2) SPP-Net: FC layers must take a fixed-size input. That and the other output layer encodes refined bounding-box
is why R-CNN chooses to warp or crop each region proposal positions with four real-valued numbers. All parameters in
into the same size. However, the object may exist partly in these procedures (except the generation of region proposals)
the cropped region and unwanted geometric distortion may be are optimized via a multitask loss in an end-to-end way.
produced due to the warping operation. These content losses or The multitasks loss L is defined in the following to jointly
distortions will reduce recognition accuracy, especially when train classification and bounding-box regression:
the scales of objects vary.
To solve this problem, He et al. [64] took the theory of L( p, u, t u , v) = L cls ( p, u) + λ[u ≥ 1]L loc(t u , v) (1)
spatial pyramid matching (SPM) [89], [90] into consideration
where L cls ( p, u) = − log pu calculates the log loss for ground
and proposed a novel CNN architecture named SPP-net. SPM
truth class u, and pu is driven from the discrete probability
takes several finer to coarser scales to partition the image into
distribution p = ( p0 , · · · , pC ) over the C +1 outputs from the
a number of divisions and aggregates quantized local features
last FC layer. L loc (t u , v) is defined over the predicted offsets
into mid-level representations.
t u = (txu , t yu , twu , thu ) and ground-truth bounding-box regression
The architecture of SPP-net for object detection can be
targets v = (v x , v y , v w , v h ), where x, y, w, and h denote
found in Fig. 4. Different from R-CNN, SPP-net reuses
the two coordinates of the box center, width, and height,
feature maps of the fifth conv layer (conv5) to project region
respectively. Each t u adopts the parameter settings in [15] to
proposals of arbitrary sizes to fixed-length feature vectors. The
specify an object proposal with a log-space height/width shift
feasibility of the reusability of these feature maps is due to
and scale-invariant translation. The Iverson bracket indicator
the fact that the feature maps not only involve the strength of
function [u ≥ 1] is employed to omit all background RoIs.
local responses but also have relationships with their spatial
To provide more robustness against outliers and eliminate the
positions [64]. The layer after the final conv layer is referred to
sensitivity in exploding gradients, a smooth L 1 loss is adopted
as the SPP layer. If the number of feature maps in conv5 is 256,
to fit bounding-box regressors as follows:
taking a three-level pyramid, the final feature vector for each   
region proposal obtained after the SPP layer has a dimension L loc (t u , v) = smooth L 1 tiu − v i (2)
of 256 × (12 + 22 + 42 ) = 5376. i∈x,y,w,h
SPP-net not only gains better results with a correct estima-
tion of different region proposals in their corresponding scales where

but also improves detection efficiency in the testing period 0.5x 2 if |x| < 1
with the sharing of computation cost before SPP layer among smooth L 1 (x) = (3)
|x| − 0.5 otherwise.
different proposals.
3) Fast R-CNN: Although SPP-net has achieved impressive To accelerate the pipeline of Fast R-CNN, another two
improvements in both accuracy and efficiency over R-CNN, tricks are of necessity. On the one hand, if training sam-
it still has some notable drawbacks. SPP-net takes almost the ples (i.e., RoIs) come from different images, backpropagation
same multistage pipeline as R-CNN, including feature extrac- through the SPP layer becomes highly inefficient. Fast R-CNN
tion, network fine-tuning, SVM training, and bounding-box samples minibatches hierarchically, namely, N images sam-
regressor fitting. Therefore, an additional expense on storage pled randomly at first and then R/N RoIs sampled in each
space is still required. In addition, the conv layers preceding image, where R represents the number of RoIs. Critically,
the SPP layer cannot be updated with the fine-tuning algorithm computation and memory are shared by RoIs from the same
introduced in [64]. As a result, an accuracy drop of very deep image in the forward and backward pass. On the other hand,
networks is unsurprising. To this end, Girshick [16] introduced much time is spent in computing the FC layers during the
a multitask loss on classification and bounding box regression forward pass [16]. The truncated singular value decomposition
and proposed a novel CNN architecture named Fast R-CNN. (SVD) [91] can be utilized to compress large FC layers and
The architecture of Fast R-CNN is exhibited in Fig. 5. to accelerate the testing procedure.
Similar to SPP-net, the whole image is processed with conv In the Fast R-CNN, regardless of region proposal genera-
layers to produce feature maps. Then, a fixed-length feature tion, the training of all network layers can be processed in
vector is extracted from each region proposal with an RoI a single stage with a multitask loss. It saves the additional
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

detection accuracy on PASCAL VOC 2007 and 2012. How-


ever, the alternate training algorithm is very time-consuming
and RPN produces objectlike regions (including backgrounds)
instead of object instances and is not skilled in dealing with
objects with extreme scales or shapes.
5) R-FCN: Divided by the RoI pooling layer, a preva-
lent family [16], [17] of deep networks for object detection
Fig. 6. RPN in Faster R-CNN [17]. K predefined anchor boxes are is composed of two subnetworks: a shared fully convolu-
convoluted with each sliding window to produce fixed-length vectors which tional subnetwork (independent of RoIs) and an unshared
are taken by cls and reg layer to obtain corresponding outputs.
RoI-wise subnetwork. This decomposition originates from
expense on storage space and improves both accuracy and pioneering classification architectures (e.g., AlexNet [6] and
efficiency with more reasonable training schemes. VGG16 [46]) which consist of a convolutional subnetwork and
4) Faster R-CNN: Despite the attempt to generate candi- several FC layers separated by a specific spatial pooling layer.
date boxes with biased sampling [88], state-of-the-art object Recent state-of-the-art image classification networks, such
detection networks mainly rely on additional methods, such as ResNets [47] and GoogLeNets [45], [93], are fully convo-
as selective search and Edgebox, to generate a candidate pool lutional. To adapt to these architectures, it is natural to con-
of isolated region proposals. Region proposal computation struct a fully convolutional object detection network without
is also a bottleneck in improving efficiency. To solve this RoI-wise subnetwork. However, it turns out to be inferior
problem, Ren et al. [17], [92] introduced an additional region with such a naive solution [47]. This inconsistency is due
proposal network (RPN), which acts in a nearly cost-free way to the dilemma of respecting translation variance in object
by sharing full-image conv features with detection network. detection compared with increasing translation invariance in
RPN is achieved with an FCN, which has the ability to image classification. In other words, shifting an object inside
predict object bounds and scores at each position simultane- an image should be indiscriminative in image classification
ously. Similar to [78], RPN takes an image of arbitrary size to while any translation of an object in a bounding box may
generate a set of rectangular object proposals. RPN operates be meaningful in object detection. A manual insertion of
on a specific conv layer with the preceding layers shared with the RoI pooling layer into convolutions can break down
the object detection network. translation invariance at the expense of additional unshared
The architecture of RPN is shown in Fig. 6. The network regionwise layers. Therefore, Dai et al. [65] proposed an
slides over the conv feature map and fully connects to an n ×n R-FCNs (see Fig. S2 in the supplementary material).
spatial window. A low-dimensional vector (512-dimensional Different from Faster R-CNN, for each category, the last
for VGG16) is obtained in each sliding window and fed into conv layer of R-FCN produces a total of k 2 position-sensitive
two sibling FC layers, namely, box-classification layer (cls) score maps with a fixed grid of k × k first and a position-
and box-regression layer (reg). This architecture is imple- sensitive RoI pooling layer is then appended to aggregate
mented with an n × n conv layer followed by two sibling the responses from these score maps. Finally, in each
1 × 1 conv layers. To increase nonlinearity, ReLU is applied RoI, k 2 position-sensitive scores are averaged to produce a
to the output of the n × n conv layer. C + 1-d vector and softmax responses across categories are
The regressions toward true BBs are achieved by comparing computed. Another 4k 2 -d conv layer is appended to obtain
proposals relative to reference boxes (anchors). In the Faster class-agnostic BBs.
R-CNN, anchors of three scales and three aspect ratios are With R-FCN, more powerful classification networks
adopted. The loss function is similar to (1) can be adopted to accomplish object detection in a fully
1    1  ∗   convolutional architecture by sharing nearly all the layers,
L( pi , ti ) = L cls pi , pi∗ + λ pi L reg ti , ti∗ and the state-of-the-art results are obtained on both PASCAL
Ncls Nreg
i i VOC and Microsoft COCO [94] data sets at a test speed
(4) of 170 ms per image.
where pi is the predicted probability of the i th anchor being an 6) FPN: Feature pyramids built upon image pyramids
object. The ground truth label pi∗ is 1 if the anchor is positive, (featurized image pyramids) have been widely applied in many
otherwise 0. ti stores four parameterized coordinates of the object detection systems to improve scale invariance [24], [64]
predicted bounding box while ti∗ is related to the ground-truth [Fig. 7(a)]. However, training time and memory consumption
box overlapping with a positive anchor. L cls is a binary log loss increase rapidly. To this end, some techniques take only
and L reg is a smoothed L 1 loss similar to (2). These two terms a single input scale to represent high-level semantics and
are normalized with the minibatch size (Ncls ) and the number increase the robustness to scale changes [Fig. 7(b)], and image
of anchor locations (Nreg ), respectively. In the form of FCNs, pyramids are built at test time which results in an inconsistency
Faster R-CNN can be trained end-to-end by backpropagation between train/test-time inferences [16], [17]. The in-network
and SGD in an alternate training manner. feature hierarchy in a deep ConvNet produces feature maps of
With the proposal of Faster R-CNN, region proposal-based different spatial resolutions while introduces large semantic
CNN architectures for object detection can really be trained in gaps caused by different depths [Fig. 7(c)]. To avoid using
an end-to-end way. Also, a frame rate of 5 frames per second low-level features, pioneer works [71], [95] usually build the
(fps) on a GPU is achieved with the state-of-the-art object pyramid starting from middle layers or just sum transformed
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ZHAO et al.: OBJECT DETECTION WITH DEEP LEARNING 7

Fig. 8. Mask R-CNN framework for instance segmentation [67].

representation requires fewer parameters but is more accurate


than that in [97]. Formally, besides the two losses in (1) for
classification and bounding box regression, an additional loss
Fig. 7. Main concern of FPN [66]. (a) It is slow to use an image pyramid
to build a feature pyramid. (b) Only single-scale features are adopted for for segmentation mask branch is defined to reach a multitask
faster detection. (c) Alternative to the featurized image pyramid is to reuse loss. This loss is only associated with ground-truth class and
the pyramidal feature hierarchy computed by a ConvNet. (d) FPN integrates relies on the classification branch to predict the category.
both (b) and (c). Blue outlines indicate feature maps and thicker outlines
denote semantically stronger features. Because RoI pooling, the core operation in Faster R-CNN,
performs a coarse spatial quantization for feature extraction,
feature responses, missing the higher resolution maps of the misalignment is introduced between the RoI and the features.
feature hierarchy. It affects classification little because of its robustness to small
Different from these approaches, FPN [66] holds an archi- translations. However, it has a large negative effect on pixel-
tecture with a BU pathway, a top-down (TD) pathway and to-pixel mask prediction. To solve this problem, Mask R-CNN
several lateral connections to combine low-resolution and adopts a simple and quantization-free layer, namely, RoIAlign,
semantically strong features with high-resolution and semanti- to preserve the explicit per-pixel spatial correspondence faith-
cally weak features [Fig. 7(d)]. The BU pathway, which is the fully. RoIAlign is achieved by replacing the harsh quantization
basic forward backbone ConvNet, produces a feature hierarchy of RoI pooling with bilinear interpolation [99], computing the
by downsampling the corresponding feature maps with a stride exact values of the input features at four regularly sampled
of 2. The layers owning the same size of output maps are locations in each RoI bin. In spite of its simplicity, this
grouped into the same network stage and the output of the last seemingly minor change improves mask accuracy greatly,
layer of each stage is chosen as the reference set of feature especially under strict localization metrics.
maps to build the following TD pathway. Given the Faster R-CNN framework, the mask branch
To build the TD pathway, feature maps from higher network only adds a small computational burden and its cooperation
stages are upsampled at first and then enhanced with those of with other tasks provides complementary information for
the same spatial size from the BU pathway via lateral connec- object detection. As a result, Mask R-CNN is simple to
tions. A 1 × 1 conv layer is appended to the upsampled map implement with promising instance segmentation and object
to reduce channel dimensions and the mergence is achieved detection results. In a word, Mask R-CNN is a flexible
by elementwise addition. Finally, a 3 × 3 convolution is also and efficient framework for instance-level recognition, which
appended to each merged map to reduce the aliasing effect can be easily generalized to other tasks (e.g., human pose
of upsampling and the final feature map is generated. This estimation [7], [S4]) with minimal modification.
process is iterated until the finest resolution map is generated. 8) Multitask Learning, Multiscale Representation, and Con-
As feature pyramid can extract rich semantics from all levels textual Modeling: Although the Faster R-CNN gets promising
and be trained end to end with all scales, the state-of-the- results with several hundred proposals, it still struggles in
art representation can be obtained without sacrificing speed small-size object detection and localization, mainly due to
and memory. Meanwhile, FPN is independent of the backbone the coarseness of its feature maps and limited information
CNN architectures and can be applied to different stages of provided in particular candidate boxes. The phenomenon is
object detection (e.g., region proposal generation) and to many more obvious on the Microsoft COCO data set which consists
other computer vision tasks (e.g., instance segmentation). of objects at a broad range of scales, less prototypical images,
7) Mask R-CNN: Instance segmentation [96] is a challeng- and requires more precise localization. To tackle these prob-
ing task which requires detecting all objects in an image and lems, it is of necessity to accomplish object detection with
segmenting each instance (semantic segmentation [97]). These multitask learning [100], multiscale representation [95], and
two tasks are usually regarded as two independent processes. context modeling [101] to combine complementary informa-
The multitask scheme will create spurious edge and exhibit tion from multiple sources.
systematic errors on overlapping instances [98]. To solve this Multitask learning learns a useful representation for mul-
problem, parallel to the existing branches in Faster R-CNN tiple correlated tasks from the same input [102], [103].
for classification and bounding box regression, the Mask R- Brahmbhatt et al. [100] introduced conv features trained for
CNN [67] adds a branch to predict segmentation masks in a object segmentation and “stuff” (amorphous categories such as
pixel-to-pixel manner (Fig. 8). ground and water) to guide accurate object detection of small
Different from the other two branches that are inevitably objects (StuffNet). Dai et al. [97] presented multitask network
collapsed into short output vectors by FC layers, the segmen- cascades of three networks, namely, class-agnostic region
tation mask branch encodes an m × m mask to maintain the proposal generation, pixel-level instance segmentation, and
explicit object spatial layout. This kind of fully convolutional regional instance classification. Li et al. [104] incorporated the
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

weakly supervised object segmentation cues and region-based Traditional CNN framework for object detection is not
object detection into a multistage architecture to fully exploit skilled in handling significant scale variation, occlusion,
the learned segmentation features. or truncation, especially when only 2-D object detection is
Multiscale representation combines activations from involved. To address this problem, Xiang et al. [60] proposed
multiple layers with skip-layer connections to provide a novel subcategory-aware RPN, which guides the generation
semantic information of different spatial resolutions [66]. of region proposals with subcategory information related to
Cai et al. [105] proposed the multiscale CNN (MS-CNN) object poses and jointly optimize object detection and subcat-
to ease the inconsistency between the sizes of objects egory classification.
and receptive fields with multiple scale-independent Ouyang et al. [115] found that the samples from differ-
output layers. Yang et al. [34] investigated two strategies, ent classes follow a long-tailed distribution, which indicates
namely, scale-dependent pooling (SDP) and layerwise that different classes with distinct numbers of samples have
cascaded rejection classifiers (CRCs), to exploit appropriate different degrees of impacts on feature learning. To this end,
scale-dependent conv features. Kong et al. [101] proposed objects are first clustered into visually similar class groups,
the HyperNet to calculate the shared features between RPN and then, a hierarchical feature learning scheme is adopted to
and object detection network by aggregating and compressing learn deep representations for each group separately.
hierarchical feature maps from different resolutions into a In order to minimize the computational cost and achieve
uniform space. the state-of-the-art performance, with the “deep and thin”
Contextual modeling improves detection performance by design principle and following the pipeline of Fast R-CNN,
exploiting features from or around RoIs of different support Hong et al. [116] proposed the architecture of PVANET,
regions and resolutions to deal with occlusions and local which adopts some building blocks including concatenated
similarities [95]. Zhu et al. [106] proposed the SegDeepM to ReLU [117], Inception [45], and HyperNet [101] to reduce
exploit object segmentation which reduces the dependency the expense on multiscale feature extraction and trains the
on initial candidate boxes with the Markov random field. network with BN [43], residual connections [47], and learning
Moysset et al. [108] took advantage of four directional 2-D rate scheduling based on plateau detection [47]. The PVANET
long short-term memories (LSTMs) [107] to convey global achieves the state-of-the-art performance and can be processed
context between different local regions and reduced trainable in real time on Titan X GPU (21 fps).
parameters with local parameter sharing. Zeng et al. [109] pro-
posed a novel gated bidirectional-net (GBD-Net) by introduc- B. Regression/Classification-Based Framework
ing gated functions to control message transmission between
Region proposal-based frameworks are composed of several
different support regions.
correlated stages, including region proposal generation, feature
The combination incorporates different components above
extraction with CNN, classification, and bounding box regres-
into the same model to improve detection performance fur-
sion, which are usually trained separately. Even in the recent
ther. Gidaris and Komodakis [110] proposed the multire-
end-to-end module Faster R-CNN, an alternative training is
gion CNN (MR-CNN) model to capture different aspects of
still required to obtain shared convolution parameters between
an object, the distinct appearances of various object parts,
RPN and detection network. As a result, the time spent in
and semantic segmentation-aware features. To obtain con-
handling different components becomes the bottleneck in the
textual and multiscale representations, Bell et al. [95] pro-
real-time application.
posed the inside–outside net (ION) by exploiting informa-
One-step frameworks based on global regression/
tion both inside and outside the RoI with spatial recurrent
classification, mapping straightly from image pixels to
neural networks [111] and skip pooling [101]. Zagoruyko
bounding box coordinates and class probabilities, can reduce
et al. [112] proposed the MultiPath architecture by introducing
time expense. We first review some pioneer CNN models
three modifications to the Fast R-CNN, including multiscale
and then focus on two significant frameworks, namely,
skip connections [95], a modified foveal structure [110], and
YOLO [18] and SSD [71].
a novel loss function summing different intersection over
1) Pioneer Works: Previous to YOLO and SSD, many
union (IoU) losses.
researchers have already tried to model object detection as
9) Thinking in Deep Learning-Based Object Detection: a regression or classification task.
Apart from the above-mentioned approaches, there are still Szegedy et al. [118] formulated the object detection task
many important factors for continued progress. as a DNN-based regression, generating a binary mask for the
There is a large imbalance between the number of annotated test image and extracting detections with a simple bounding
objects and background examples. To address this problem, box inference. However, the model has difficulty in handling
Shrivastava et al. [113] proposed an effective online mining overlapping objects, and BBs generated by direct upsampling
algorithm (OHEM) for automatic selection of the hard exam- is far from perfect.
ples, which leads to a more effective and efficient training. Pinheiro et al. [119] proposed a CNN model with two
Instead of concentrating on feature extraction, Ren et al. branches: one generates class agnostic segmentation masks
[114] made a detailed analysis on object classifiers and found and the other predicts the likelihood of a given patch centered
that it is of particular importance for object detection to con- on an object. Inference is efficient since class scores and
struct a deep and convolutional per-region classifier carefully, segmentation can be obtained in a single model with most
especially for ResNets [47] and GoogLeNets [45]. of the CNN operations shared.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ZHAO et al.: OBJECT DETECTION WITH DEEP LEARNING 9

During training, the following loss function is optimized:


2

S 
B
λcoord ½obj 2 2
i j [(x i − xˆi ) + (yi − yˆi ) ]
i=0 j =0
2 
2

S 
B
√  
+ λcoord ½obj
ij wi − ŵi ) + ( h i − hˆi
2

i=0 j =0
S2 
 B
Fig. 9. Main idea of YOLO [18]. + ½obj
i j (Ci − Ĉi )
2

i=0 j =0
Erhan et al. [68] and Szegedy et al. [120] proposed the 
S 
B2

regression-based MultiBox to produce scored class-agnostic + λnoobj ½noobj


ij (Ci − Ĉi )2
region proposals. A unified loss was introduced to bias both i=0 j =0
localization and confidences of multiple components to predict S2
 
the coordinates of class-agnostic BBs. However, a large num- + ½obj
i ( pi (c) − p̂i (c))2 . (6)
ber of additional parameters are introduced to the final layer. i=0 c∈classes
Yoo et al. [69] adopted an iterative classification approach
In a certain cell i , (x i , yi ) denote the center of the box relative
to handle object detection and proposed an impressive end-
to the bounds of the grid cell, (wi , h i ) are the normalized width
to-end CNN architecture named AttentionNet. Starting from
and height relative to the image size, Ci represents the confi-
the top-left and bottom-right corners of an image, Attention-
dence scores, ½i indicates the existence of objects, and ½i j
obj obj
Net points to a target object by generating quantized weak
denotes that the prediction is conducted by the j th bounding
directions and converges to an accurate object boundary box
box predictor. Note that only when an object is present in
with an ensemble of iterative predictions. However, the model
that grid cell, the loss function penalizes classification errors.
becomes quite inefficient when handling multiple categories
Similarly, when the predictor is “responsible” for the ground
with a progressive two-step procedure.
truth box (i.e., the highest IoU of any predictor in that grid cell
Najibi et al. [70] proposed a proposal-free iterative
is achieved), bounding box coordinate errors are penalized.
grid-based object detector (G-CNN), which models object
The YOLO consists of 24 conv layers and 2 FC layers,
detection as finding a path from a fixed grid to boxes tightly
of which some conv layers construct ensembles of inception
surrounding the objects [70]. Starting with a fixed multiscale
modules with 1 × 1 reduction layers followed by 3 × 3 conv
bounding box grid, G-CNN trains a regressor to move and
layers. The network can process images in real time at 45 fps
scale elements of the grid toward objects iteratively. However,
and a simplified version Fast YOLO can reach 155 fps
G-CNN has a difficulty in dealing with small or highly
with better results than other real-time detectors. Furthermore,
overlapping objects.
YOLO produces fewer false positives on the background,
2) YOLO: Redmon et al. [18] proposed a novel framework which makes the cooperation with Fast R-CNN become pos-
called YOLO, which makes the use of the whole topmost
sible. An improved version, YOLOv2, was later proposed
feature map to predict both confidences for multiple categories
in [72], which adopts several impressive strategies, such as
and BBs. The basic idea of YOLO is exhibited in Fig. 9. BN, anchor boxes, dimension cluster, and multiscale training.
YOLO divides the input image into an S × S grid and each 3) SSD: YOLO has a difficulty in dealing with small
grid cell is responsible for predicting the object centered objects in groups, which is caused by strong spatial con-
in that grid cell. Each grid cell predicts B BBs and their straints imposed on bounding box predictions [18]. Mean-
corresponding confidence scores. Formally, confidence scores while, YOLO struggles to generalize to objects in new/unusual
are defined as Pr(Object)∗IOUtruth
pred , which indicates how likely aspect ratios/configurations and produces relatively coarse
there exist objects (Pr(Object) ≥ 0) and shows confidences features due to multiple downsampling operations.
of its prediction (IOUtruth
pred ). At the same time, regardless Aiming at these problems, Liu et al. [71] proposed an SSD,
of the number of boxes, C conditional class probabilities which was inspired by the anchors adopted in MultiBox [68],
(Pr(Classi |Object)) should also be predicted in each grid cell. RPN [17], and multiscale representation [95]. Given a specific
It should be noticed that only the contribution from the grid feature map, instead of fixed grids adopted in YOLO, the SSD
cell containing an object is calculated. takes the advantage of a set of default anchor boxes with
At test time, class-specific confidence scores for each box different aspect ratios and scales to discretize the output space
are achieved by multiplying the individual box confidence of BBs. To handle objects with various sizes, the network
predictions with the conditional class probabilities as follows: fuses predictions from multiple feature maps with different
resolutions.
Pr(Object) ∗ IOUpred
truth
∗ Pr(Classi |Object)
The architecture of SSD is demonstrated in Fig. 10. Given
= Pr(Classi ) ∗ IOUpred
truth
(5) the VGG16 backbone architecture, SSD adds several feature
layers to the end of the network, which are responsible for
where the existing probability of class-specific objects in the predicting the offsets to default boxes with different scales
box and the fitness between the predicted box and the object and aspect ratios and their associated confidences. The net-
are both taken into consideration. work is trained with a weighted sum of localization loss
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

TABLE I
OVERVIEW OF P ROMINENT G ENERIC O BJECT D ETECTION A RCHITECTURES

Fig. 10. Architecture of SSD 300 [71]. SSD adds several feature layers to the end of VGG16 backbone network to predict the offsets to default anchor
boxes and their associated confidences. Final detection results are obtained by conducting NMS on multiscale refined BBs.

(e.g., Smooth L1) and confidence loss (e.g., Softmax), which is provided to evaluate their test consumption on PASCAL
is similar to (1). Final detection results are obtained by VOC 2007.
conducting NMS on multiscale refined BBs. 1) PASCAL VOC 2007/2012: PASCAL VOC 2007 and
Integrating with hard negative mining, data augmentation, 2012 data sets consist of 20 categories. The evaluation terms
and a larger number of carefully chosen default anchors, are AP in each single category and mAP across all the 20 cat-
SSD significantly outperforms the Faster R-CNN in terms of egories. Comparative results are exhibited in Tables II and III,
accuracy on PASCAL VOC and COCO while being three from which the following remarks can be obtained.
times faster. The SSD300 (input image size is 300 × 300)
1) If incorporated with a proper way, more powerful back-
runs at 59 fps, which is more accurate and efficient than
bone CNN models can definitely improve the object
YOLO. However, SSD is not skilled at dealing with small
detection performance (the comparison among R-CNN
objects, which can be relieved by adopting better feature
with AlexNet, R-CNN with VGG16 and SPP-net with
extractor backbone (e.g., ResNet101), adding deconvolution
ZF-Net [122]).
layers with skip connections to introduce additional large-scale
2) With the introduction of the SPP layer (SPP-net), end-
context [73], and designing better network structure (e.g., stem
to-end multitask architecture (FRCN), and RPN (Faster
block and dense block) [74].
R-CNN), object detection performance is improved
C. Experimental Evaluation gradually and apparently.
We compare various object detection methods on three 3) Due to a large number of trainable parameters, in order
benchmark data sets, including PASCAL VOC 2007 [25], to obtain multilevel robust features, data augmentation
PASCAL VOC 2012 [121], and Microsoft COCO [94]. is very important for deep learning-based models (Faster
The evaluated approaches include R-CNN [15], SPP- R-CNN with “07,” “07 + 12,” and “07 + 12 + coco”).
net [64], Fast R-CNN [16], networks on convolutional 4) Apart from basic models, there are still many
feature maps (NOC) [114], Bayes [85], MR-CNN& other factors affecting object detection performance,
S-CNN [105], Faster R-CNN [17], HyperNet [101], ION [95], such as multiscale and multiregion feature extrac-
MS-GR [104], StuffNet [100], SSD300 [71], SSD512 [71], tion (e.g., MR-CNN), modified classification networks
OHEM [113], SDP+CRC [34], G-CNN [70], SubCNN [60], (e.g., NOC), additional information from other corre-
GBD-Net [109], PVANET [116], YOLO [18], YOLOv2 [72], lated tasks (e.g., StuffNet, HyperNet), multiscale rep-
R-FCN [65], FPN [66], Mask R-CNN [67], DSSD [73], resentation (e.g., ION), and mining of hard negative
and DSOD [74]. If no specific instructions for the adopted samples (e.g., OHEM).
framework are provided, the utilized model is a VGG16 [46] 5) As YOLO is not skilled in producing object localizations
pretrained on 1000-way ImageNet classification task [39]. of high IoU, it obtains a very poor result on VOC 2012.
Due to the limitation of the paper length, we only provide an However, with the complementary information from Fast
overview, including proposal, learning method, loss function, R-CNN (YOLO+FRCN) and the aid of other strategies,
programing language, and platform, of the prominent such as anchor boxes, BN, and fine-grained features,
architectures in Table I. Detailed experimental settings, which the localization errors are corrected (YOLOv2).
can be found in the original papers, are missed. In addition 6) By combining many recent tricks and modeling the
to the comparisons of detection accuracy, another comparison whole network as a fully convolutional one, R-FCN
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ZHAO et al.: OBJECT DETECTION WITH DEEP LEARNING 11

TABLE II
C OMPARATIVE R ESULTS ON VOC 2007 T EST S ET (%)

TABLE III
C OMPARATIVE R ESULTS ON VOC 2012 T EST S ET (%)

achieves a more obvious improvement of detection per- 4) Due to the existence of a large number of nonstandard
formance over other approaches. small objects, the results on this data set are much worse
2) Microsoft COCO: Microsoft COCO is composed than those of VOC 2007/2012. With the introduction of
of 300 000 fully segmented images, in which each image has other powerful frameworks (e.g., ResNeXt [123]) and
an average of 7 object instances from a total of 80 categories. useful strategies (e.g., multitask learning [67], [124]),
As there are a lot of less iconic objects with a broad range the performance can be improved.
of scales and a stricter requirement on object localization, 5) The success of DSOD in training from scratch stresses
this data set is more challenging than PASCAL 2012. Object the importance of the network design to release the
detection performance is evaluated by AP computed under requirements for perfect pretrained classifiers on relevant
different degrees of IoUs and on different object sizes. The tasks and a large number of annotated samples.
results are given in Table IV.
Besides similar remarks to those of PASCAL VOC, some 3) Timing Analysis: Timing analysis (Table V) is conducted
other conclusions can be drawn as follows from Table IV. on Intel i7-6700K CPU with a single core and NVIDIA
Titan X GPU. Except for “SS” which is processed with CPU,
1) Multiscale training and test are beneficial in improv- the other procedures related to CNN are all evaluated on GPU.
ing object detection performance, which provide addi- From Table V, we can draw some conclusions as follows.
tional information in different resolutions (R-FCN).
FPN and DSSD provide some better ways to build 1) By computing CNN features on shared feature maps
feature pyramids to achieve multiscale representation. (SPP-net), test consumption is reduced largely. Test
The complementary information from other related tasks time is further reduced with the unified multitask learn-
is also helpful for accurate object localization (Mask ing (FRCN) and removal of additional region proposal
R-CNN with instance segmentation task). generation stage (Faster R-CNN). It is also helpful to
2) Overall, region proposal-based methods, such as Faster compress the parameters of FC layers with SVD [91]
R-CNN and R-FCN, perform better than regression/ (PAVNET and FRCN).
classification-based approaches, namely, YOLO and 2) It takes additional test time to extract multiscale fea-
SSD, due to the fact that quite a lot of localization tures and contextual information (ION and MR-RCNN&
errors are produced by regression/classification-based S-RCNN).
approaches. 3) It takes more time to train a more complex and deeper
3) Context modeling is helpful to locate small objects, network (ResNet101 against VGG16) and this time
which provides additional information by consult- consumption can be reduced by adding as many lay-
ing nearby objects and surroundings (GBD-Net and ers into shared fully convolutional layers as possible
multipath). (FRCN).
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

12 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

TABLE IV
C OMPARATIVE R ESULTS ON M ICROSOFT COCO T EST D EV S ET (%)

TABLE V
C OMPARISON OF T ESTING C ONSUMPTION ON VOC 07 T EST S ET

4) Regression-based models can usually be processed as a focus-of-attention mechanism, which prunes BU salient
in real time at the cost of a drop in accuracy points that are unlikely to be parts of the object [132].
compared with region proposal-based models. Also,
A. Deep Learning in Salient Object Detection
region proposal-based models can be modified into
real-time systems with the introduction of other Due to the significance for providing high-level and mul-
tricks [116] (PVANET), such as BN [43] and residual tiscale feature representation and the successful applications
connections [123]. in many correlated computer vision tasks, such as semantic
segmentation [131], edge detection [133], and generic object
detection [16], it is feasible and necessary to extend CNN to
IV. S ALIENT O BJECT D ETECTION
salient object detection.
Visual saliency detection, one of the most important and The early work by Vig et al. [29] follows a completely
challenging tasks in computer vision, aims to highlight the automatic data-driven approach to perform a large-scale search
most dominant object regions in an image. Numerous appli- for optimal features, namely, an ensemble of deep networks
cations incorporate the visual saliency to improve their perfor- with different layers and parameters. To address the problem
mance, such as image cropping [125] and segmentation [126], of limited training data, Kummerer et al. [134] proposed the
image retrieval [57], and object detection [66]. Deep Gaze by transferring from the AlexNet to generate a
Broadly, there are two branches of approaches in salient high-dimensional feature space and create a saliency map.
object detection, namely, BU [127] and TD [128]. Local A similar architecture was proposed by Huang et al. [135] to
feature contrast plays the central role in BU salient object integrate saliency prediction into pretrained object recognition
detection, regardless of the semantic contents of the scene. DNNs. The transfer is accomplished by fine-tuning DNNs’
To learn local feature contrast, various local and global fea- weights with an objective function based on the saliency
tures are extracted from pixels, e.g., edges [129] and spa- evaluation metrics, such as similarity, KL-divergence, and
tial information [130]. However, high-level and multiscale normalized scanpath saliency.
semantic information cannot be explored with these low-level Some works combined local and global visual
features. As a result, low-contrast salient maps instead of clues to improve salient object detection performance.
salient objects are obtained. TD salient object detection is Wang et al. [136] trained two independent deep CNNs
task-oriented and takes prior knowledge about object cate- (DNN-L and DNN-G) to capture local information and global
gories to guide the generation of salient maps. Taking semantic contrast and predicted saliency maps by integrating both
segmentation as an example, a saliency map is generated in the local estimation and global search. Cholakkal et al. [137]
segmentation to assign pixels to particular object categories via proposed a weakly supervised saliency detection framework
a TD approach [131]. In a word, TD saliency can be viewed to combine visual saliency from BU and TD saliency
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ZHAO et al.: OBJECT DETECTION WITH DEEP LEARNING 13

TABLE VI
C OMPARISON B ETWEEN S TATE - OF - THE -A RT M ETHODS

maps and refined the results with a multiscale superpixel- and a deeper one adapted from a deconvoluted VGG net-
averaging. Zhao et al. [138] proposed a multicontext deep work. Asconvolutional–deconvolution networks are not expert
learning framework, which utilizes a unified learning in recognizing objects of multiple scales, Kuen et al. [148]
framework to model global and local context jointly with proposed a recurrent attentional convolutional–deconvolution
the aid of superpixel segmentation. To predict saliency in network with several spatial transformer and recurrent network
videos, Bak et al. [139] fused two static saliency models, units to conquer this problem. To fuse local, global, and
namely, spatial stream net and temporal stream net, into a contextual information of salient objects, Tang et al. [149]
two-stream framework with a novel empirically grounded developed a deeply supervised recurrent CNN to perform a
data augmentation technique. full image-to-image saliency detection.
Complementary information from semantic segmentation
and context modeling is beneficial. To learn internal represen- B. Experimental Evaluation
tations of saliency efficiently, He et al. [140] proposed a novel
Four representative data sets, including Evaluation on Com-
superpixelwise CNN approach called SuperCNN, in which
plex Scene Saliency Dataset (ECSSD) [156], HKU-IS [146],
salient object detection is formulated as a binary labeling
PASCALS [157], and SOD [158], are used to evaluate several
problem. Based on a fully CNN, Li et al. [141] proposed a
state-of-the-art methods. ECSSD consists of 1000 structurally
multitask deep saliency model, in which intrinsic correlations
complex but semantically meaningful natural images. HKU-IS
between saliency detection and semantic segmentation are set
is a large-scale data set containing over 4000 challenging
up. However, due to the conv layers with large receptive
images. Most of these images have more than one salient
fields and pooling layers, blurry object boundaries and coarse
object and own low contrast. PASCALS is a subset chosen
saliency maps are produced. Tang and Wu [142] proposed
from the validation set of PASCAL VOC 2010 segmentation
a novel saliency detection framework (CRPSD) [142], which
data set and is composed of 850 natural images. The SOD data
combines the region-level saliency estimation and pixel-level
set possesses 300 images containing multiple salient objects.
saliency prediction together with three closely related CNNs.
The training and validation sets for different data sets are kept
Li and Yu [143]proposed a deep contrast network to combine
the same as those in [152].
segmentwise spatial pooling and pixel-level fully convolutional
Two standard metrics, namely, F-measure and the mean
streams [143].
absolute error (MAE), are utilized to evaluate the quality of a
The proper integration of multiscale feature maps is also
saliency map. Given precision and recall values precomputed
of significance for improving detection performance. Based
on the union of generated binary mask B and ground truth Z ,
on Fast R-CNN, Wang et al. [144] proposed the RegionNet
F-measure is defined as follows:
by performing salient object detection with end-to-end edge
preserving and multiscale contextual modeling. Liu et al. [28] (1 + β 2 )Presion × Recall
Fβ = (7)
proposed a multiresolution CNN (Mr-CNN) to predict eye fix- β 2 Presion + Recall
ations, which is achieved by learning both BU visual saliency
and TD visual factors from raw image data simultaneously. where β 2 is set to 0.3 in order to stress the importance of the
Cornia et al. [145] proposed an architecture that combines fea- precision value.
tures extracted at different levels of the CNN. Li and Yu [146] The MAE score is computed with the following equation:
proposed a multiscale deep CNN framework to extract three
1 
H 
W
scales of deep contrast features, namely, the mean-subtracted MAE = | Ŝ(i, j ) = Ẑ (i, j )| (8)
region, the bounding box of its immediate neighboring regions, H ×W
i=1 j =1
and the masked entire image, from each candidate region.
It is efficient and accurate to train a direct pixelwise where Ẑ and Ŝ represent the ground truth and the continuous
CNN architecture to predict salient objects with the aids saliency map, respectively. W and H are the width and
of recurrent neural networks and deconvolution networks. height of the salient area, respectively. This score stresses
Pan et al. [147] formulated saliency prediction as a mini- the importance of successfully detected salient objects over
mization optimization on the Euclidean distance between the detected nonsalient pixels [159].
predicted saliency map and the ground truth and proposed The following approaches are evaluated: contextual hyper-
two kinds of architectures: a shallow one trained from scratch graph modeling (CHM) [150], RC [151], discriminative
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

14 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

regional feature integration (DRFI) [152], MC [138], mul-


tiscale deep CNN features (MDF) [146], local estimation
and global search (LEGS) [136], DSR [149], multi-task deep
neural network [141], CRPSD [142], deep contrast learn-
ing (DCL) [143], encoded low level distance (ELD) [153],
nonlocal deep features (NLDF) [154], and deep supervision
with short connections (DSSC) [155]. Among these meth-
ods, CHM, RC, and DRFI are classical ones with the best
performance [159], while the other methods are all associated
with CNN. F-measure and MAE scores are given in Table VI.
From Table VI, we can find that CNN-based methods
perform better than classic methods. MC and MDF combine
the information from local and global context to reach a more
accurate saliency. ELD refers to low-level handcrafted features
for complementary information. LEGS adopts generic region
proposals to provide initial salient regions, which may be
insufficient for salient detection. DSR and MT act in different
ways by introducing a recurrent network and semantic seg-
mentation, which provide insights for future improvements. Fig. 11. ROC curves of state-of-the-art methods on FDDB. (a) Discrete ROC
CRPSD, DCL, NLDF, and DSSC are all based on multiscale curves. (b) Continuous ROC curves.
representations and superpixel segmentation, which provide
robust salient regions and smooth boundaries. DCL, NLDF, In addition, their performance is greatly restricted by manually
and DSSC perform the best on these four data sets. DSSC designed features and shallow architecture.
earns the best performance by modeling scale-to-scale short
connections. A. Deep Learning in Face Detection
Overall, as CNN mainly provides salient information in
Recently, some CNN-based face detection approaches have
local regions, most of the CNN-based methods need to model
been proposed [167]–[169]. As less accurate localization
visual saliency along region boundaries with the aid of super-
results from independent regressions of object coordinates,
pixel segmentation. Meanwhile, the extraction of multiscale
Yu et al. [167] proposed a novel IoU loss function for pre-
deep CNN features is of significance for measuring local
dicting the four bounds of box jointly. Farfade et al. [168]
conspicuity. Finally, it is necessary to strengthen local con-
proposed a deep dense face detector (DDFD) to conduct
nections between different CNN layers as well as to utilize
multiview face detection, which is able to detect faces in
complementary information from local and global context.
a wide range of orientations without the requirement of
pose/landmark annotations. Yang et al. [169] proposed a novel
V. FACE D ETECTION
deep learning-based face detection framework, which collects
Face detection is essential to many face applications the responses from local facial parts (e.g., eyes, nose, and
and acts as an important preprocessing procedure to mouths) to address face detection under severe occlusions
face recognition [160]–[162], face synthesis [163], [164], and and unconstrained pose variations. Yang et al. [170] proposed
facial expression analysis [165]. Different from generic object a scale-friendly detection network named ScaleFace, which
detection, this task is to recognize and locate face regions splits a large range of target scales into smaller subranges.
covering a very large range of scales (30–300 pts versus Different specialized subnetworks are constructed on these
10–1000 pts). At the same time, faces have their unique object subscales and combined into a single one to conduct end-to-
structural configurations (e.g., the distribution of different end optimization. Hao et al. [171] designed an efficient CNN
face parts) and characteristics (e.g., skin color). All these to predict the scale distribution histogram of the faces and took
differences lead to special attention to this task. However, large this histogram to guide the zoomed-in view and zoomed-out
visual variations of faces, such as occlusions, pose variations, view of the image. Since the faces are approximately in
and illumination changes, impose great challenges for this task uniform scale after zoom, compared with other state-of-the-art
in real applications. baselines, better performance is achieved with a less computa-
The most famous face detector proposed by Viola and tion cost. In addition, some generic detection frameworks are
Jones [166] trains cascaded classifiers with Haar-like features extended to face detection with different modifications, e.g.,
and AdaBoost, achieving good performance with real-time Faster R-CNN [30], [172], [173].
efficiency. However, this detector may degrade significantly Some authors trained CNNs with other complementary
in real-world applications due to larger visual variations of tasks, such as 3-D modeling and face landmarks, in a multitask
human faces. Different from this cascade structure, Felzen- learning manner. Huang et al. [174] proposed a unified end-to-
szwalb et al. [24] proposed a deformable part model (DPM) end FCN framework called DenseBox to jointly conduct face
for face detection. However, for these traditional face detection detection and landmark localization. Li et al. [175] proposed
methods, high computational expenses and large quantities a multitask discriminative learning framework that integrates
of annotations are required to achieve a reasonable result. a ConvNet with a fixed 3-D mean face model in an end-to-end
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ZHAO et al.: OBJECT DETECTION WITH DEEP LEARNING 15

manner. In the framework, two issues are addressed to trans- it can be observed that most of the CNN-based methods earn
fer from generic object detection to face detection, namely, similar true positive rates between 60% and 70% while DeepIR
eliminating predefined anchor boxes by a 3-D mean face and HR-ER perform much better than them. Among classic
model and replacing RoI pooling layer with a configuration methods, Joint Cascade is still competitive. As earlier works,
pooling layer. Zhang et al. [176] proposed a deep cascaded DDFD and CCF directly make use of generated feature maps
multitask framework named multitask cascaded convolutional and obtain relatively poor results. CascadeCNN builds cas-
networks (MTCNN) which exploits the inherent correlations caded CNNs to locate face regions, which is efficient but inac-
between face detection and alignment in the unconstrained curate. Faceness combines the decisions from different part
environment to boost up detection performance in a coarse- detectors, resulting in precise face localizations while being
to-fine manner. time-consuming. The outstanding performance of MTCNN,
Reducing computational expenses is of necessity in real Conv3-D, and Hyperface proves the effectiveness of multitask
applications. To achieve real-time detection on the mobile plat- learning. HR-ER and ScaleFace adaptively detect faces of
form, Kalinovskii and Spitsyn [177] proposed a new solution different scales and make a balance between accuracy and
of frontal face detection based on compact CNN cascades. This efficiency. DeepIR and Face-R-CNN are two extensions of the
method takes a cascade of three simple CNNs to generate, Faster R-CNN architecture to face detection, which validate
classify, and refine candidate object positions progressively. the significance and effectiveness of Faster R-CNN. Unitbox
To reduce the effects of large pose variations, Chen et al. [32] provides an alternative choice for performance improvements
proposed a cascaded CNN denoted by supervised transformer by carefully designing optimization loss.
network. This network takes a multitask RPN to predict From these results, we can draw the conclusion that
candidate face regions along with associated facial landmarks CNN-based methods are in the leading position. The perfor-
simultaneously and adopts a generic R-CNN to verify the mance can be improved by the following strategies: designing
existence of valid faces. Yang and Nevatia [8] proposed a novel optimization loss, modifying generic detection pipelines,
three-stage cascade structure based on FCNs, while in each building meaningful network cascades, adapting scale-aware
stage, a multiscale FCN is utilized to refine the positions of detection, and learning multitask shared CNN features.
possible faces. Qin et al. [178] proposed a unified framework VI. P EDESTRIAN D ETECTION
that achieves better results with the complementary informa-
tion from different jointly trained CNNs. Recently, pedestrian detection has been intensively
studied, which has a close relationship to pedestrian
tracking [189], [190], person reidentification [191], [192],
B. Experimental Evaluation
and robot navigation [193], [194]. Prior to the recent
The FDDB [179] data set has a total of 2845 pictures in progress in deep CNN (DCNN)-based methods [195], [196],
which 5171 faces are annotated with an elliptical shape. Two some researchers combined boosted decision forests
types of evaluations are used: the discrete score and continuous with hand-crafted features to obtain pedestrian
score. By varying the threshold of the decision rule, the detectors [197]–[199]. At the same time, to explicitly model
receiver operating characteristic (ROC) curve for the discrete the deformation and occlusion, part-based models [200] and
scores can reflect the dependence of the detected face fractions explicit occlusion handling [201], [202] are of concern.
on the number of false alarms. Compared with annotations, As there are many pedestrian instances of small sizes in typ-
any detection with an IoU ratio exceeding 0.5 is treated as ical scenarios of pedestrian detection (e.g., automatic driving
positive. Each annotation is only associated with one detection. and intelligent surveillance), the application of RoI pooling
The ROC curve for the continuous scores is the reflection of layer in generic object detection pipeline may result in “plain”
face localization quality. features due to collapsing bins. In the meantime, the main
The evaluated models cover DDFD [168], Cascade-CNN source of false predictions in pedestrian detection is the
[180], aggregate channel features (ACF)-multiscale [181], confusion of hard background instances, which is in contrast
Pico [182], Head-Hunter [183], Joint Cascade [31], SURF- to the interference from multiple categories in generic object
multiview [184], Viola–Jones [166], NPDFace [185], detection. As a result, different configurations and components
Faceness [169], convolutional channel features (CCF) [186], are required to accomplish accurate pedestrian detection.
MTCNN [176], Conv3-D [175], Hyperface [187],
UnitBox [167], locally decorrelated channel A. Deep Learning in Pedestrian Detection
features (LDCF+) [S2], DeepIR [173], hybrid-resolution Although DCNNs have obtained excellent performance on
model with elliptical regressor (HR-ER) [188], Face-R- generic object detection [16], [72], none of these approaches
CNN [172], and ScaleFace [170]. ACF-multiscale, Pico, have achieved better results than the best hand-crafted feature-
HeadHunter, Joint Cascade, SURF-multiview, Viola-Jones, based method [198] for a long time, even when part-based
NPDFace, and LDCF+ are built on classic hand-crafted information and occlusion handling are incorporated [202].
features while the rest methods are based on deep CNN Thereby, some studies have been conducted to analyze
features. The ROC curves are shown in Fig. 11. the reasons. Zhang et al. [203] attempted to adapt generic
In Fig. 11(a), in spite of relatively competitive results pro- Faster R-CNN [17] to pedestrian detection. They modified the
duced by LDCF+, it can be observed that most of the classic downstream classifier by adding boosted forests to shared,
methods perform with similar results and are outperformed by high-resolution conv feature maps and taking an RPN to han-
CNN-based methods by a significant margin. In Fig. 11(b), dle small instances and hard negative examples. To deal with
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

16 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

complex occlusions existing in pedestrian images, inspired Evaluated methods include Checkerboards+ [198],
by DPM [24], Tian et al. [204] proposed a deep learning LDCF++ [S2], SCF+AlexNet [210], SA-FastRCNN [211],
framework called DeepParts, which makes decisions based MS-CNN [105], DeepParts [204], CompACT-Deep [195],
on an ensemble of extensive part detectors. DeepParts has RPN+BF [203], and F-DNN+SS [207]. The first two
advantages in dealing with weakly labeled data, low IoU methods are based on hand-crafted features while the rest
positive proposals, and partial occlusion. ones rely on deep CNN features. All results are exhibited
Other researchers also tried to combine complementary in Table VII. From this table, we observe that different
information from multiple data sources. CompACT-Deep from other tasks, classic handcrafted features can still earn
adopts a complexity-aware cascade to combine hand-crafted competitive results with boosted decision forests [203],
features and fine-tuned DCNNs [195]. Based on Faster ACF [197], and HOG+LUV channels [S2]. As an early
R-CNN, Liu et al. [205] proposed multispectral DNNs for attempt to adapt CNN to pedestrian detection, the features
pedestrian detection to combine complementary information generated by SCF+AlexNet are not so discriminant and
from color and thermal images. Tian et al. [206] proposed produce relatively poor results. Based on multiple CNNs,
a task-assistant CNN to jointly learn multiple tasks with DeepParts and CompACT-Deep accomplish detection tasks via
multiple data sources and to combine pedestrian attributes different strategies, namely, local part integration and cascade
with semantic scene attributes together. Du et al. [207] pro- network. The responses from different local part detectors
posed a DNN fusion architecture for fast and robust pedes- make DeepParts robust to partial occlusions. However, due
trian detection. Based on the candidate BBs generated with to complexity, it is too time-consuming to achieve real-time
SSD detectors [71], multiple binary classifiers are processed detection. The multiscale representation of MS-CNN improves
parallelly to conduct soft-rejection-based network fusion by the accuracy of pedestrian locations. SA-FastRCNN extends
consulting their aggregated degree of confidences. Fast R-CNN to automatically detect pedestrians according
However, most of these approaches are much more sophisti- to their different scales, which has trouble when there are
cated than the standard R-CNN framework. CompACT-Deep partial occlusions. RPN+BF combines the detectors produced
consists of a variety of hand-crafted features, a small CNN by Faster R-CNN with boosting decision forest to accurately
model, and a large VGG16 model [195]. DeepParts contains locate different pedestrians. F-DNN+SS, which is composed
45 fine-tuned DCNN models, and a set of strategies, includ- of multiple parallel classifiers with soft rejections, performs
ing bounding box shifting handling and part selection, are the best followed by RPN+BF, SA-FastRCNN, and MS-CNN.
required to arrive at the reported results [204]. Therefore, In short, CNN-based methods can provide more accu-
the modification and simplification are of significance to rate candidate boxes and multilevel semantic information for
reduce the burden on both software and hardware to satisfy identifying and locating pedestrians. Meanwhile, handcrafted
real-time detection demand. Tome et al. [59] proposed a novel features are complementary and can be combined with CNN
solution to adapt generic object detection pipeline to pedestrian to achieve better results. The improvements over existing CNN
detection by optimizing most of its stages. Hu et al. [208] methods can be obtained by carefully designing the framework
trained an ensemble of boosted decision models by reusing and classifiers, extracting multiscale and part-based semantic
the conv feature maps, and a further improvement was gained information and searching for complementary information
with simple pixel labeling and additional complementary from other related tasks, such as segmentation.
hand-crafted features. Tome et al. [209] proposed a reduced
memory region-based deep CNN architecture, which fuses VII. P ROMISING F UTURE D IRECTIONS AND TASKS
regional responses from both ACF detectors and SVM classi-
In spite of rapid development and achieved promising
fiers into R-CNN. Ribeiro et al. [33] addressed the problem of
progress of object detection, there are still many open issues
human-aware navigation and proposed a vision-based person
for the future work.
tracking system guided by multiple camera sensors.
The first one is small object detection such as occurring
B. Experimental Evaluation in COCO data set and in face detection task. To improve
The evaluation is conducted on the most popular Caltech localization accuracy on small objects under partial occlusions,
Pedestrian data set [3]. The data set was collected from the it is necessary to modify network architectures from the
videos of a vehicle driving through an urban environment and following aspects.
consists of 250 000 frames with about 2300 unique pedestrians 1) Multitask Joint Optimization and Multimodal
and 350 000 annotated BBs. Three kinds of labels, namely, Information Fusion: Due to the correlations between
“Person (clear identifications),” “Person? (unclear identifica- different tasks within and outside object detection,
tions),” and “People (large group of individuals),” are assigned multitask joint optimization has already been studied
to different BBs. The performance is measured with the by many researchers [16], [17]. However, apart from
log-average miss rate (L-AMR) which is computed evenly the tasks mentioned in Section III-A8, it is desirable to
spaced in log-space in the range 10−2 to 1 by averaging miss think over the characteristics of different subtasks of
rate at the rate of nine false positives per image [3]. According object detection (e.g., superpixel semantic segmentation
to the differences in the height and visible part of the BBs, in salient object detection) and extend multitask
a total of nine popular settings are adopted to evaluate different optimization to other applications such as instance
properties of these models. Details of these settings are as segmentation [66], multiobject tracking [202], and
in [3]. multiperson pose estimation [S4]. In addition, given
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ZHAO et al.: OBJECT DETECTION WITH DEEP LEARNING 17

TABLE VII
D ETAILED B REAKDOWN P ERFORMANCE C OMPARISONS OF S TATE - OF - THE -A RT M ODELS ON C ALTECH P EDESTRIAN D ATA S ET.
A LL N UMBERS A RE R EPORTED IN L-AMR

a specific application, the information from different previous stages in cascade are fixed when training a new
modalities, such as text [212], thermal data [205], and stage. Therefore, the optimizations of different CNNs
images [65], can be fused together to achieve a more are isolated, which stresses the necessity of end-to-end
discriminant network. optimization for CNN cascade. At the same time, it is
2) Scale Adaption: Objects usually exist in different scales, also a matter of concern to build contextual associated
which are more apparent in face detection and pedes- cascade networks with existing layers.
trian detection. To increase the robustness to scale 2) Unsupervised and Weakly Supervised Learning: It is
changes, it is demanded to train scale-invariant, mul- very time-consuming to manually draw large quantities
tiscale or scale-adaptive detectors. For scale-invariant of BBs. To release this burden, semantic prior [55],
detectors, more powerful backbone architectures (e.g., unsupervised object discovery [221], multiple instance
ResNext [123]), negative sample mining [113], reverse learning [222], and DNN prediction [47] can be inte-
connection [213], and subcategory modeling [60] are all grated to make the best use of image-level supervision
beneficial. For multiscale detectors, both the FPN [66] to assign object category tags to corresponding object
that produces multiscale feature maps and the generative regions and refine object boundaries. Furthermore,
adversarial network [214] that narrows representation weakly annotations (e.g., center-click annotations [223])
differences between small objects and the large ones are also helpful for achieving high-quality detectors with
with a low-cost architecture provide insights into gen- modest annotation efforts, especially aided by the mobile
erating meaningful feature pyramid. For scale-adaptive platform.
detectors, it is useful to combine knowledge graph [215], 3) Network Optimization: Given specific applications and
attentional mechanism [216], cascade network [180], platforms, it is significant to make a balance among
and scale distribution estimation [171] to detect objects speed, memory, and accuracy by selecting an optimal
adaptively. detection architecture [116], [224]. However, despite
3) Spatial Correlations and Contextual Modeling: Spatial that detection accuracy is reduced, it is more mean-
distribution plays an important role in object detec- ingful to learn compact models with a fewer number
tion. Therefore, region proposal generation and grid of parameters [209]. This situation can be relieved by
regression are taken to obtain probable object loca- introducing better pretraining schemes [225], knowledge
tions. However, the correlations between multiple pro- distillation [226], and hint learning [227]. DSOD also
posals and object categories are ignored. In addition, provides a promising guideline to train from scratch
the global structure information is abandoned by the to bridge the gap between different image sources and
position-sensitive score maps in R-FCN. To solve these tasks [74].
problems, we can refer to diverse subset selection [217] The third one is to extend typical methods for
and sequential reasoning tasks [218] for possible solu- 2-D object detection to adapt 3-D object detection
tions. It is also meaningful to mask salient parts and and video object detection, with the requirements from
couple them with the global structure in a joint-learning autonomous driving, intelligent transportation, and intelligent
manner [219]. surveillance.
The second one is to release the burden on manual labor 1) 3-D Object Detection: With the applications of 3-D
and accomplish real-time object detection, with the emergence sensors (e.g., Light Detection and Ranging and cam-
of the large-scale image and video data. The following three era), additional depth information can be utilized to
aspects can be taken into account. better understand the images in 2-D and extend the
1) Cascade Network: In a cascade network, a cas- image-level knowledge to the real world. However, sel-
cade of detectors is built in different stages or dom of these 3-D-aware techniques aim to place correct
layers [180], [220]. Easily distinguishable examples are 3-D BBs around detected objects. To achieve better
rejected at shallow layers so that features and classifiers bounding results, multiview representation [181] and
at later stages can handle more difficult samples with 3-D proposal network [228] may provide some guide-
the aid of the decisions from previous stages. However, lines to encode depth information with the aid of inertial
current cascades are built in a greedy manner, where sensors (accelerometer and gyrometer) [229].
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

18 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

2) Video Object Detection: Temporal information across [16] R. Girshick, “Fast R-CNN,” in Proc. ICCV, 2015, pp. 1440–1448.
different frames plays an important role in under- [17] S. Ren et al., “Faster R-CNN: Towards real-time object detection with
region proposal networks,” in Proc. NIPS, 2015, pp. 91–99.
standing the behaviors of different objects. However, [18] J. Redmon et al., “You only look once: Unified, real-time object
the accuracy suffers from degenerated object appear- detection,” in Proc. CVPR, 2016, pp. 779–788.
[19] D. G. Lowe, “Distinctive image features from scale-invariant key-
ances (e.g., motion blur and video defocus) in videos points,” Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, 2004.
and the network is usually not trained end to end. To this [20] N. Dalal and B. Triggs, “Histograms of oriented gradients for human
end, spatiotemporal tubelets [230], optical flow [199], detection,” in Proc. CVPR, 2005, pp. 886–893.
[21] R. Lienhart and J. Maydt, “An extended set of Haar-like features for
and LSTM [107] should be considered to fundamentally rapid object detection,” in Proc. ICIP, 2002, p. 1.
model object associations between consecutive frames. [22] C. Cortes and V. Vapnik, “Support vector machine,” Mach. Learn.,
vol. 20, no. 3, pp. 273–297, 1995.
[23] Y. Freund and R. E. Schapire, “A decision-theoretic generalization of
VIII. C ONCLUSION on-line learning and an application to boosting,” J. Comput. Syst. Sci.,
Due to its powerful learning ability and advantages in vol. 55, no. 1, pp. 119–139, 1997.
[24] P. F. Felzenszwalb et al., “Object detection with discriminatively
dealing with occlusion, scale transformation, and background trained part-based models,” IEEE Trans. Pattern Anal. Mach. Intell.,
switches, deep learning-based object detection has been a vol. 32, no. 9, pp. 1627–1645, Sep. 2010.
[25] M. Everingham et al., “The pascal visual object classes (VOC) chal-
research hotspot in recent years. This paper provides a detailed lenge,” Int. J. Comput. Vis., vol. 88, no. 2, pp. 303–338, 2008.
review on deep learning-based object detection frameworks [26] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of
that handle different subproblems, such as occlusion, clutter, data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507,
2006.
and low resolution, with different degrees of modifications [27] Y. LeCun et al., “Deep learning,” Nature, vol. 521, pp. 436–444,
on R-CNN. The review starts on generic object detection May 2015.
[28] N. Liu et al., “Predicting eye fixations using convolutional neural
pipelines which provide base architectures for other related networks,” in Proc. CVPR, 2015, pp. 362–370.
tasks. Then, three other common tasks, namely, salient object [29] E. Vig et al., “Large-scale optimization of hierarchical features
detection, face detection, and pedestrian detection, are also for saliency prediction in natural images,” in Proc. CVPR, 2014,
pp. 2798–2805.
briefly reviewed. Finally, we propose several promising future [30] H. Jiang and E. Learned-Miller, “Face detection with the faster
directions to gain a thorough understanding of the object detec- R-CNN,” in Proc. FG, 2017, pp. 650–657.
[31] D. Chen et al., “Joint cascade face detection and alignment,” in Proc.
tion landscape. This review is also meaningful for the develop- ECCV, 2014, pp. 109–122.
ments in neural networks and related learning systems, which [32] D. Chen et al., “Supervised transformer network for efficient face
provides valuable insights and guidelines for future progress. detection,” in Proc. ECCV, 2016, pp. 122–138.
[33] A. Mateus et al.. (2016). “Efficient and robust pedestrian detection
using deep learning for human-aware navigation.” [Online]. Available:
R EFERENCES https://arxiv.org/abs/1607.04441
[34] F. Yang et al., “Exploit all the layers: Fast and accurate CNN object
[1] P. F. Felzenszwalb et al., “Object detection with discriminatively detector with scale dependent pooling and cascaded rejection classi-
trained part-based models,” IEEE Trans. Pattern Anal. Mach. Intell., fiers,” in Proc. CVPR, 2016, pp. 2129–2137.
vol. 32, no. 9, pp. 1627–1645, Sep. 2010. [35] P. N. Druzhkov and V. D. Kustikova, “A survey of deep learning meth-
[2] K.-K. Sung and T. Poggio, “Example-based learning for view-based ods and software tools for image classification and object detection,”
human face detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, Pattern Recognit. Image Anal., vol. 26, no. 1, pp. 9–15, 2016.
no. 1, pp. 39–51, Jan. 1998. [36] W. Pitts and W. S. McCulloch, “How we know universals the perception
[3] C. Wojek et al., “Pedestrian detection: An evaluation of the state of auditory and visual forms,” Bull. Math. Biophys., vol. 9, no. 3,
of the art,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 4, pp. 127–147, 1947.
pp. 743–761, Apr. 2012. [37] D. E. Rumelhart et al., “Learning representations by back-propagating
[4] H. Kobatake and Y. Yoshinaga, “Detection of spicules on mammogram errors,” Nature, vol. 323, pp. 533–536, Oct. 1986.
based on skeleton analysis,” IEEE Trans. Med. Imag., vol. 15, no. 3, [38] G. Hinton et al., “Deep neural networks for acoustic modeling in speech
pp. 235–245, Jun. 1996. recognition: The shared views of four research groups,” IEEE Signal
[5] Y. Jia et al., “Caffe: Convolutional architecture for fast feature embed- Process. Mag., vol. 29, no. 6, pp. 82–97, Nov. 2012.
ding,” in Proc. ACM MM, 2014, pp. 675–678. [39] J. Deng et al., “ImageNet: A large-scale hierarchical image database,”
[6] A. Krizhevsky et al., “ImageNet classification with deep convolutional in Proc. CVPR, 2009, pp. 248–255.
neural networks,” in Proc. NIPS, 2012, pp. 1097–1105. [40] L. Deng et al., “Binary coding of speech spectrograms using a deep
[7] Z. Cao et al., “Realtime multi-person 2D pose estimation using part auto-encoder,” in Proc. INTERSPEECH, 2010, pp. 1692–1695.
affinity fields,” in Proc. CVPR, 2017, pp. 1302–1310. [41] G. Dahl et al., “Phone recognition with the mean-covariance restricted
[8] Z. Yang and R. Nevatia, “A multi-scale cascade fully convolutional Boltzmann machine,” in Proc. NIPS, 2010, pp. 469–477.
network face detector,” in Proc. ICPR, 2016, pp. 633–638. [42] G. E. Hinton et al.. (2012). “Improving neural networks by pre-
[9] C. Chen et al., “DeepDriving: Learning affordance for direct perception venting co-adaptation of feature detectors.” [Online]. Available:
in autonomous driving,” in Proc. ICCV, 2015, pp. 2722–2730. https://arxiv.org/abs/1207.0580
[10] X. Chen et al., “Multi-view 3D object detection network for [43] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
autonomous driving,” in Proc. CVPR, 2017, pp. 6526–6534. network training by reducing internal covariate shift,” in Proc. ICML,
[11] A. Dundar et al., “Embedded streaming deep neural networks acceler- 2015, pp. 448–456.
ator with applications,” IEEE Trans. Neural Netw. Learn. Syst., vol. 28, [44] P. Sermanet et al.. (2013). “OverFeat: Integrated recognition, localiza-
no. 7, pp. 1572–1583, Jul. 2017. tion and detection using convolutional networks.” [Online]. Available:
[12] R. J. Cintra et al., “Low-complexity approximate convolutional neural https://arxiv.org/abs/1312.6229
networks,” IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 12, [45] C. Szegedy et al., “Going deeper with convolutions,” in Proc. CVPR,
pp. 5981–5992, 2018. 2015, pp. 1–9.
[13] S. H. Khan et al., “Cost-sensitive learning of deep feature representa- [46] K. Simonyan and A. Zisserman. (2014).“Very deep convolutional
tions from imbalanced data,” IEEE Trans. Neural Netw. Learn. Syst., networks for large-scale image recognition.” [Online]. Available:
vol. 29, no. 8, pp. 3573–3587, Aug. 2018. https://arxiv.org/abs/1409.1556
[14] A. Stuhlsatz et al., “Feature extraction with deep neural networks by [47] K. He et al., “Deep residual learning for image recognition,” in Proc.
a generalized discriminant analysis,” IEEE Trans. Neural Netw. Learn. CVPR, 2016, pp. 770–778.
Syst., vol. 23, no. 4, pp. 596–608, Apr. 2012. [48] V. Nair and G. E. Hinton, “Rectified linear units improve restricted
[15] R. Girshick et al., “Rich feature hierarchies for accurate object Boltzmann machines,” in Proc. ICML, 2010, pp. 807–814.
detection and semantic segmentation,” in Proc. CVPR, 2014, [49] M. Oquab et al., “Weakly supervised object recognition with
pp. 580–587. convolutional neural networks,” in Proc. NIPS, 2014, pp. 1–10.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ZHAO et al.: OBJECT DETECTION WITH DEEP LEARNING 19

[50] M. Oquab et al., “Learning and transferring mid-level image [87] W. Ouyang et al., “DeepID-Net: Deformable deep convolutional neural
representations using convolutional neural networks,” in Proc. CVPR, networks for object detection,” in Proc. CVPR, 2015, pp. 2403–2412.
2014, pp. 1717–1724. [88] K. Lenc and A. Vedaldi. (2015). “R-CNN minus R.” [Online].
[51] F. M. Wadley, “Probit analysis: A statistical treatment of the sigmoid Available: https://arxiv.org/abs/1506.06981
response curve,” Ann. Entomol. Soc. Amer., vol. 67, no. 4, pp. 549–553, [89] S. Lazebnik et al., “Beyond bags of features: Spatial pyramid
1947. matching for recognizing natural scene categories,” in Proc. CVPR,
[52] K. Kavukcuoglu et al., “Learning invariant features through 2006, pp. 2169–2178.
topographic filter maps,” in Proc. CVPR, 2009, pp. 1605–1612. [90] F. Perronnin et al., “Improving the Fisher kernel for large-scale image
[53] K. Kavukcuoglu et al., “Learning convolutional feature hierarchies for classification,” in Proc. ECCV, 2010, pp. 143–156.
visual recognition,” in Proc. NIPS, 2010, pp. 1090–1098. [91] J. Xue et al., “Restructuring of deep neural network acoustic models
[54] M. D. Zeiler et al., “Deconvolutional networks,” in Proc. CVPR, 2010, with singular value decomposition,” in Proc. INTERSPEECH, 2013,
pp. 2528–2535. pp. 2365–2369.
[55] H. Noh et al., “Learning deconvolution network for semantic [92] S. Ren et al., “Faster R-CNN: Towards real-time object detection with
segmentation,” in Proc. ICCV, 2015, pp. 1520–1528. region proposal networks,” IEEE Trans. Pattern Anal. Mach. Intell.,
[56] Z.-Q. Zhao et al., “Plant leaf identification via a growing convolution vol. 39, no. 6, pp. 1137–1149, Jun. 2017.
neural network with progressive sample learning,” in Proc. ACCV, [93] C. Szegedy et al., “Rethinking the inception architecture for computer
2014, pp. 348–361. vision,” in Proc. CVPR, 2016, pp. 2818–2826.
[57] A. Babenko et al., “Neural codes for image retrieval,” in Proc. ECCV, [94] T.-Y. Lin et al., “Microsoft COCO: Common objects in context,” in
2014, pp. 584–599. Proc. ECCV, 2014, pp. 740–755.
[58] J. Wan et al., “Deep learning for content-based image retrieval: [95] S. Bell et al., “Inside-outside net: Detecting objects in context with
A comprehensive study,” in ACM MM, 2014, pp. 157–166. skip pooling and recurrent neural networks,” in Proc. CVPR, 2016,
[59] D. Tomè et al., “Deep convolutional neural networks for pedestrian pp. 2874–2883.
detection,” Signal Process., Image Commun., vol. 47, pp. 482–489, [96] A. Arnab and P. H. S. Torr, “Pixelwise instance segmentation with a
Sep. 2016. dynamically instantiated network,” in Proc. CVPR, 2017, pp. 879–888.
[60] Y. Xiang et al., “Subcategory-aware convolutional neural networks for [97] J. Dai et al., “Instance-aware semantic segmentation via multi-task
object proposals and detection,” in Proc. WACV, 2017, pp. 924–933. network cascades,” in Proc. CVPR, 2016, pp. 3150–3158.
[61] Z.-Q. Zhao et al., “Pedestrian detection based on fast R-CNN and [98] Y. Li et al., “Fully convolutional instance-aware semantic
batch normalization,” in Proc. ICIC, 2017, pp. 735–746. segmentation,” in Proc. CVPR, 2017, pp. 4438–4446.
[62] J. Ngiam et al., “Multimodal deep learning,” in Proc. ICML, 2011, [99] M. Jaderberg et al., “Spatial transformer networks,” in Proc. Adv.
pp. 689–696. Neural Inf. Process. Syst., 2015, pp. 2017–2025.
[63] Z. Wu et al., “Modeling spatial-temporal clues in a hybrid deep [100] S. Brahmbhatt et al., “StuffNet: Using ‘Stuff’ to improve object
learning framework for video classification,” in Proc. ACM MM, 2015, detection,” in Proc. WACV, 2017, pp. 934–943.
pp. 461–470. [101] T. Kong et al., “HyperNet: Towards accurate region proposal generation
[64] K. He et al., “Spatial pyramid pooling in deep convolutional networks and joint object detection,” in Proc. CVPR, 2016, pp. 845–853.
for visual recognition,” IEEE Trans. Pattern Anal. Mach. Intell., [102] A. Pentina et al., “Curriculum learning of multiple tasks,” in Proc.
vol. 37, no. 9, pp. 1904–1916, Sep. 2015. CVPR, 2015, pp. 5492–5500.
[65] J. Dai et al., “R-FCN: Object detection via region-based fully [103] J. Yim et al., “Rotating your face using multi-task deep neural
convolutional networks,” in Proc. NIPS, 2016, pp. 379–387. network,” in Proc. CVPR, 2015, pp. 676–684.
[66] T.-Y. Lin et al., “Feature pyramid networks for object detection,” in [104] J. Li et al.. (2016). “Multi-stage object detection with group recursive
Proc. CVPR, 2017, pp. 936–944. learning.” [Online]. Available: https://arxiv.org/abs/1608.05159
[67] K. He et al., “Mask R-CNN,” in Proc. ICCV, 2017, pp. 2980–2988. [105] Z. Cai et al., “A unified multi-scale deep convolutional neural network
[68] D. Erhan et al., “Scalable object detection using deep neural networks,” for fast object detection,” in Proc. ECCV, 2016, pp. 354–370.
in Proc. CVPR, 2014, pp. 2155–2162. [106] Y. Zhu et al., “segDeepM: Exploiting segmentation and context in
[69] D. Yoo et al., “AttentionNet: Aggregating weak directions for accurate deep neural networks for object detection,” in Proc. CVPR, 2015,
object detection,” in Proc. CVPR, 2015, pp. 2659–2667. pp. 4703–4711.
[70] M. Najibi et al., “G-CNN: An iterative grid based object detector,” in [107] W. Byeon et al., “Scene labeling with LSTM recurrent neural
Proc. CVPR, 2016, pp. 2369–2377. networks,” in Proc. CVPR, 2015, pp. 3547–3555.
[71] W. Liu et al., “SSD: Single shot multibox detector,” in Proc. ECCV, [108] B. Moysset et al.. (2016). “Learning to detect and localize
2016, pp. 21–37. many objects from few examples.” [Online]. Available:
[72] J. Redmon and A. Farhadi. (2016). “YOLO9000: Better, faster, https://arxiv.org/abs/1611.05664
stronger.” [Online]. Available: https://arxiv.org/abs/1612.08242 [109] X. Zeng et al., “Gated bi-directional CNN for object detection,” in
[73] C.-Y. Fu et al.. (2017). “DSSD : Deconvolutional single shot detector.” Proc. ECCV, 2016, pp. 354–369.
[Online]. Available: https://arxiv.org/abs/1701.06659 [110] S. Gidaris and N. Komodakis, “Object detection via a multi-region
[74] Z. Shen et al., “DSOD: Learning deeply supervised object detectors and semantic segmentation-aware CNN model,” in Proc. CVPR, 2015,
from scratch,” in Proc. ICCV, 2017, p. 7. pp. 1134–1142.
[75] G. E. Hinton et al., “Transforming auto-encoders,” in Proc. ICANN, [111] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural
2011, pp. 44–51. networks,” IEEE Trans. Signal Process., vol. 45, no. 11,
[76] G. W. Taylor et al., “Learning invariance through imitation,” in Proc. pp. 2673–2681, Nov. 1997.
CVPR, 2011, pp. 2729–2736. [112] S. Zagoruyko et al. (2016). “A multiPath network for object detection.”
[77] X. Ren and D. Ramanan, “Histograms of sparse codes for object [Online]. Available: https://arxiv.org/abs/1604.02135
detection,” in Proc. CVPR, 2013, pp. 3246–3253. [113] A. Shrivastava et al., “Training region-based object detectors
[78] J. R. R. Uijlings et al., “Selective search for object recognition,” Int. with online hard example mining,” in Proc. CVPR, 2016,
J. Comput. Vis., vol. 104, no. 2, pp. 154–171, Apr. 2013. pp. 761–769.
[79] P. Sermanet et al., “Pedestrian detection with unsupervised multi-stage [114] S. Ren et al., “Object detection networks on convolutional feature
feature learning,” in Proc. CVPR, 2013, pp. 3626–3633. maps,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 7,
[80] P. Krähenbühl and V. Koltun, “Geodesic object proposals,” in Proc. pp. 1476–1481, Jul. 2017.
ECCV, 2014, pp. 725–739. [115] W. Ouyang et al., “Factors in finetuning deep model for object detection
[81] P. Arbeláez et al., “Multiscale combinatorial grouping,” in Proc. with long-tail distribution,” in Proc. CVPR, 2016, pp. 864–873.
CVPR, 2014, pp. 328–335. [116] S. Hong et al.. (2016). “PVANet: Lightweight deep neural
[82] C. L. Zitnick and P. Dollár, “Edge boxes: Locating object proposals networks for real-time object detection.” [Online]. Available:
from edges,” in Proc. ECCV, 2014, pp. 391–405. https://arxiv.org/abs/1611.08588
[83] W. Kuo et al., “Deepbox: Learning objectness with convolutional [117] W. Shang et al., “Understanding and improving convolutional neural
networks,” in Proc. ICCV, 2015, pp. 2479–2487. networks via concatenated rectified linear units,” in Proc. ICML, 2016,
[84] P. O. Pinheiro et al., “Learning to refine object segments,” in Proc.
pp. 1–9.
ECCV, 2016, pp. 75–91. [118] C. Szegedy et al., “Deep neural networks for object detection,” in
[85] Y. Zhang et al., “Improving object detection with deep convolutional Proc. NIPS, 2013, pp. 1–9.
networks via Bayesian optimization and structured prediction,” in [119] P. O. Pinheiro et al., “Learning to segment object candidates,” in Proc.
Proc. CVPR, 2015, pp. 249–258. NIPS, 2015, pp. 1990–1998.
[86] S. Gupta et al., “Learning rich features from RGB-D images for object [120] C. Szegedy et al.. (2014). “Scalable, high-quality object detection.”
detection and segmentation,” in Proc. ECCV, 2014, pp. 345–360. [Online]. Available: https://arxiv.org/abs/1412.1441
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

20 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

[121] M. Everingham et al.. (2011). The PASCAL Visual Object [153] G. Lee et al., “Deep saliency with encoded low level distance map
Classes Challenge 2012 (VOC2012) Results (2012). [Online]. and high level features,” in Proc. CVPR, 2016, pp. 660–668.
Available: http://www.pascal-network.org/challenges/VOC/voc2011/ [154] Z. Luo et al., “Non-local deep features for salient object detection,”
workshop/index.html in Proc. CVPR, 2017, pp. 6593–6601.
[122] M. D. Zeiler and R. Fergus, “Visualizing and understanding [155] Q. Hou et al.. (2016). “Deeply supervised salient object
convolutional networks,” in Proc. ECCV, 2014, pp. 818–833. detection with short connections.” [Online]. Available:
[123] S. Xie et al., “Aggregated residual transformations for deep neural https://arxiv.org/abs/1611.04849
networks,” in Proc. CVPR, 2017, pp. 5987–5995. [156] Q. Yan et al., “Hierarchical saliency detection,” in Proc. CVPR, 2013,
[124] "J. Dai et al. (2017). “Deformable convolutional networks.” [Online]. pp. 1155–1162.
Available: https://arxiv.org/abs/1703.06211 [157] Y. Li et al., “The secrets of salient object segmentation,” in Proc.
[125] C. Rother et al., “AutoCollage,” ACM Trans. Graph., vol. 25, no. 3, CVPR, 2014, pp. 280–287.
pp. 847–852, 2006. [158] V. Movahedi and J. H. Elder, “Design and perceptual validation
[126] C. Jung and C. Kim, “A unified spectral-domain approach for saliency of performance measures for salient object segmentation,” in Proc.
detection and its application to automatic object segmentation,” IEEE CVPRW, 2010, pp. 49–56.
Trans. Image Process., vol. 21, no. 3, pp. 1272–1283, Mar. 2012. [159] A. Borji et al., “Salient object detection: A bench-
[127] W.-C. Tu et al., “Real-time salient object detection with a minimum mark,” IEEE Trans. Image Process., vol. 24, no. 12,
spanning tree,” in Proc. CVPR, 2016, pp. 2334–2342. pp. 5706–5722, Dec. 2015.
[128] J. Yang and M.-H. Yang, “Top-down visual saliency via joint CRF and [160] C. Peng et al., “Graphical representation for heterogeneous face
dictionary learning,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 2,
no. 3, pp. 576–588, Mar. 2017. pp. 301–312, Feb. 2017.
[129] P. L. Rosin, “A simple method for detecting salient regions,” Pattern [161] C. Peng et al., “Face recognition from multiple stylistic sketches:
Recognit., vol. 42, no. 11, pp. 2363–2371, Nov. 2009. Scenarios, datasets, and evaluation,” in Proc. ECCV, 2016, pp. 3–18.
[130] T. Liu et al., “Learning to detect a salient object,” IEEE Trans. Pattern [162] X. Gao et al., “Face sketch–photo synthesis and retrieval using sparse
Anal. Mach. Intell., vol. 33, no. 2, pp. 353–367, Feb. 2011. representation,” IEEE Trans. Circuits Syst. Video Technol., vol. 22,
[131] J. Long et al., “Fully convolutional networks for semantic no. 8, pp. 1213–1226, Aug. 2012.
segmentation,” in Proc. CVPR, 2015, pp. 3431–3440. [163] N. Wang et al., “A comprehensive survey to face hallucination,” Int.
[132] D. Gao et al., “Discriminant saliency, the detection of suspicious J. Comput. Vis., vol. 106, no. 1, pp. 9–30, 2014.
coincidences, and applications to visual recognition,” IEEE Trans. [164] C. Peng et al., “Multiple representations-based face sketch–photo
Pattern Anal. Mach. Intell., vol. 31, no. 6, pp. 989–1005, Jun. 2009. synthesis,” IEEE Trans. Neural Netw. Learn. Syst., vol. 27, no. 11,
[133] S. Xie and Z. Tu, “Holistically-nested edge detection,” in Proc. ICCV, pp. 2201–2215, Nov. 2016.
2015, pp. 1395–1403. [165] A. Majumder et al., “Automatic facial expression recognition system
[134] M. Kümmerer et al.. (2014). “Deep gaze I: Boosting saliency using deep network-based data fusion,” IEEE Trans. Cybern., vol. 48,
prediction with feature maps trained on ImageNet.” [Online]. Available: no. 1, pp. 103–114, Jan. 2018.
https://arxiv.org/abs/1411.1045 [166] P. Viola and M. J. Jones, “Robust real-time face detection,” Int. J.
[135] X. Huang et al., “SALICON: Reducing the semantic gap in saliency Comput. Vis., vol. 57, no. 2, pp. 137–154, 2004.
prediction by adapting deep neural networks,” in Proc. ICCV, 2015, [167] J. Yu et al., “Unitbox: An advanced object detection network,” in
pp. 262–270. Proc. ACM MM, 2016, pp. 516–520.
[136] L. Wang et al., “Deep networks for saliency detection via local [168] S. S. Farfade et al., “Multi-view face detection using deep convolutional
estimation and global search,” in Proc. CVPR, 2015, pp. 3183–3192. neural networks,” in Proc. ICMR, 2015, pp. 643–650.
[137] H. Cholakkal et al.. (2016). “Backtracking spatial pyramid pooling [169] S. Yang et al., “From facial parts responses to face detection: A deep
(SPP)-based image classifier for weakly supervised top-down salient learning approach,” in Proc. ICCV, 2015, pp. 3676–3684.
object detection.” [Online]. Available: https://arxiv.org/abs/1611.05345 [170] S. Yang et al., “Face detection through scale-friendly deep
[138] R. Zhao et al., “Saliency detection by multi-context deep learning,” convolutional networks,” in Proc. CVPR, 2017, pp. 1–12.
in Proc. CVPR, 2015, pp. 1265–1274. [171] Z. Hao et al., “Scale-aware face detection,” in Proc. CVPR, 2017,
[139] C. Bak et al.. (2016). “Spatio-temporal saliency networks pp. 1913–1922.
for dynamic saliency prediction.” [Online]. Available: [172] H. Wang et al.. (2017). “Face R-CNN.” [Online]. Available:
https://arxiv.org/abs/1607.04730 https://arxiv.org/abs/1706.01061
[140] S. He et al., “SuperCNN: A superpixelwise convolutional neural [173] X. Sun et al.. (2017). “Face detection using deep learning:
network for salient object detection,” Int. J. Comput. Vis., vol. 115, An improved faster RCNN approach.” [Online]. Available:
no. 3, pp. 330–344, 2015. https://arxiv.org/abs/1701.08289
[141] X. Li et al., “DeepSaliency: Multi-task deep neural network model for [174] L. Huang et al.. (2015). “DenseBox: Unifying landmark
salient object detection,” IEEE Trans. Image Process., vol. 25, no. 8, localization with end to end object detection.” [Online]. Available:
pp. 3919–3930, Aug. 2016. https://arxiv.org/abs/1509.04874
[142] Y. Tang and X. Wu, “Saliency detection via combining region-level [175] Y. Li et al., “face detection with end-to-end integration of a ConvNet
and pixel-level predictions with CNNs,” in Proc. ECCV, 2016, and a 3D model,” in Proc. ECCV, 2016, pp. 420–436.
pp. 809–825. [176] K. Zhang et al., “Joint face detection and alignment using multitask
[143] G. Li and Y. Yu, “Deep contrast learning for salient object detection,” cascaded convolutional networks,” IEEE Signal Process. Lett., vol. 23,
in Proc. CVPR, 2016, pp. 478–487. no. 10, pp. 1499–1503, Oct. 2016.
[144] X. Wang et al.. (2016). “Edge preserving and multi-scale contextual [177] I. A. Kalinovsky and V. G. Spitsyn, “Compact convolutional neural
neural network for salient object detection.” [Online]. Available: network cascade for face detection,” in Proc. CEUR Workshop, 2016,
https://arxiv.org/abs/1608.08029 pp. 375–387.
[145] M. Cornia et al., “A deep multi-level network for saliency prediction,” [178] H. Qin et al., “Joint training of cascaded CNN for face detection,” in
in Proc. ICPR, 2016, pp. 3488–3493. Proc. CVPR, 2016, pp. 3456–3465.
[146] G. Li and Y. Yu, “Visual saliency detection based on multiscale [179] V. Jain and E. Learned-Miller, “FDDB: A benchmark for face detection
deep CNN features,” IEEE Trans. Image Process., vol. 25, no. 11, in unconstrained settings,” Univ. Massachusetts, Amherst, Amherst,
pp. 5012–5024, Nov. 2016. MA, USA, Tech. Rep. UM-CS-2010-009, 2010.
[147] J. Pan et al., “Shallow and deep convolutional networks for saliency [180] H. Li et al., “A convolutional neural network cascade for face
prediction,” in Proc. CVPR, 2016, pp. 598–606. detection,” in Proc. CVPR, 2015, pp. 5325–5334.
[148] J. Kuen et al., “Recurrent attentional networks for saliency detection,” [181] B. Yang et al., “Aggregate channel features for multi-view face
in Proc. CVPR, 2016, pp. 3668–3677. detection,” in Proc. IJCB, 2014, pp. 1–8.
[149] Y. Tang et al., “Deeply-supervised recurrent convolutional neural [182] N. Markuš et al.. (2013). “Object detection with pixel intensity
network for saliency detection,” in Proc. ACM MM, 2016, pp. 397–401. comparisons organized in decision trees.” [Online]. Available:
[150] X. Li et al., “Contextual hypergraph modeling for salient object https://arxiv.org/abs/1305.4537
detection,” in Proc. ICCV, 2013, pp. 3328–3335. [183] M. Mathias et al., “Face detection without bells and whistles,” in
[151] M.-M. Cheng et al., “Global contrast based salient region detection,” Proc. ECCV, 2014.
IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 3, pp. 569–582, [184] J. Li and Y. Zhang, “Learning surf cascade for fast and accurate object
Mar. 2015. detection,” in Proc. CVPR, 2013.
[152] H. Jiang et al., “Salient object detection: A discriminative [185] S. Liao et al., “A fast and accurate unconstrained face detector,”
regional feature integration approach,” in Proc. CVPR, 2013, IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 2,
pp. 2083–2090. pp. 211–223, Feb. 2016.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ZHAO et al.: OBJECT DETECTION WITH DEEP LEARNING 21

[186] B. Yang et al., “Convolutional channel features,” in Proc. ICCV, 2015, [220] B. Yang et al., “CRAFT objects from images,” in Proc. CVPR, 2016,
pp. 82–90. pp. 6043–6051.
[187] R. Ranjan et al.. (2016). “HyperFace: A deep multi-task [221] I. Croitoru et al., “Unsupervised learning from video to detect fore-
learning framework for face detection, landmark localization, ground objects in single images,” in Proc. ICCV, 2017, pp. 4345–4353.
pose estimation, and gender recognition.” [Online]. Available: [222] C. Wang et al., “Weakly supervised object localization with latent
https://arxiv.org/abs/1603.01249 category learning,” in Proc. ECCV, 2014, pp. 431–445.
[188] P. Hu and D. Ramanan, “Finding tiny faces,” in Proc. CVPR, 2017, [223] D. P. Papadopoulos et al., “Training object class detectors with click
pp. 1522–1530. supervision,” in Proc. CVPR, 2017, pp. 180–189.
[189] Z. Jiang and D. Q. Huynh, “Multiple pedestrian tracking from [224] J. Huang et al., “Speed/accuracy trade-offs for modern convolutional
monocular videos in an interacting multiple model framework,” IEEE object detectors,” in Proc. CVPR, 2017, pp. 3296–3297.
Trans. Image Process., vol. 27, no. 3, pp. 1361–1375, Mar. 2018. [225] Q. Li et al., “Mimicking very efficient network for object detection,”
[190] D. M. Gavrila and S. Munder, “Multi-cue pedestrian detection and in Proc. CVPR, 2017, pp. 7341–7349.
tracking from a moving vehicle,” Int. J. Comput. Vis., vol. 73, no. 1, [226] G. Hinton et al., “Distilling the knowledge in a neural network,”
pp. 41–59, Jun. 2007. Comput. Sci., vol. 14, no. 7, pp. 38–39, 2015.
[191] S. Xu et al., “Jointly attentive spatial-temporal pooling networks [227] A. Romero et al., “FitNets: Hints for thin deep nets,” in Proc. ICLR,
for video-based person re-identification,” in Proc. ICCV, 2017, 2015, pp. 1–13.
pp. 4743–4752. [228] X. Chen et al., “3D object proposals for accurate object class
[192] Z. Liu et al., “Stepwise metric promotion for unsupervised video detection,” in Proc. NIPS, 2015, pp. 424–432.
person re-identification,” in Proc. ICCV, 2017, pp. 2448–2457. [229] J. Dong et al., “Visual-inertial-semantic scene representation for 3D
[193] A. Khan et al., “Cooperative robots to observe moving targets: object detection,” in Proc. CVPR, 2017, pp. 960–970.
Review,” IEEE Trans. Cybern., vol. 48, no. 1, pp. 187–198, Jan. 2018. [230] K. Kang et al., “Object detection in videos with tubelet proposal
[194] A. Geiger et al., “Vision meets robotics: The KITTI dataset,” Int. J. networks,” in Proc. CVPR, 2017, pp. 889–897.
Robot. Res., vol. 32, no. 11, pp. 1231–1237, 2013.
[195] Z. Cai et al., “Learning complexity-aware cascades for deep pedestrian
detection,” in Proc. ICCV, 2015, pp. 3361–3369. Zhong-Qiu Zhao (M’10) received the Ph.D. degree
[196] Y. Tian et al., “Deep learning strong parts for pedestrian detection,” in pattern recognition and intelligent system from
in Proc. CVPR, 2015, pp. 1904–1912. the University of Science and Technology of China,
[197] P. Dollár et al., “Fast feature pyramids for object detection,” IEEE Hefei, China, in 2007.
Trans. Pattern Anal. Mach. Intell., vol. 36, no. 8, pp. 1532–1545, From 2008 to 2009, he held a post-doctoral
Aug. 2014. position in image processing with the CNRS
[198] S. Zhang et al., “Filtered channel features for pedestrian detection,” UMR6168 Lab Sciences de Information et des
in Proc. CVPR, 2015, pp. 1751–1760. Systèmes, La Garde, France. From 2013 to 2014,
[199] S. Paisitkriangkrai et al., “Pedestrian detection with spatially pooled he was a Research Fellow in image processing with
features and structured ensemble learning,” IEEE Trans. Pattern Anal. the Department of Computer Science, Hong Kong
Mach. Intell., vol. 38, no. 6, pp. 1243–1257, Jun. 2016. Baptist University, Hong Kong. He is currently a
[200] L. Lin et al., “Discriminatively trained And-Or graph models for Professor with the Hefei University of Technology, Hefei. His current research
object shape detection,” IEEE Trans. Pattern Anal. Mach. Intell., interests include pattern recognition, image processing, and computer vision.
vol. 37, no. 5, pp. 959–972, May 2015.
[201] M. Mathias et al., “Handling occlusions with Franken-classifiers,” in
Proc. ICCV, 2013, pp. 1505–1512. Peng Zheng received the bachelor’s degree from
[202] S. Tang et al., “Detection and tracking of occluded people,” Int. J. the Hefei University of Technology, Hefei, China,
Comput. Vis., vol. 110, no. 1, pp. 58–69, 2014. in 2010, where he is currently pursuing the Ph.D.
[203] L. Zhang et al., “Is faster R-CNN doing well for pedestrian detection?” degree.
in Proc. ECCV, 2016, pp. 443–457. His current research interests include pattern
[204] Y. Tian et al., “Deep learning strong parts for pedestrian detection,” recognition, image processing, and computer vision.
in Proc. ICCV, 2015.
[205] J. Liu et al.. (2016). “Multispectral deep neural networks for pedestrian
detection.” [Online]. Available: https://arxiv.org/abs/1611.02644
[206] Y. Tian et al., “Pedestrian detection aided by deep learning semantic
tasks,” in Proc. CVPR, 2015, pp. 5079–5087.
[207] X. Du et al., “Fused DNN: A deep neural network fusion approach
to fast and robust pedestrian detection,” in Proc. WACV, 2017,
pp. 953–961. Shou-Tao Xu is currently pursuing the master’s
[208] Q. Hu et al., “Pushing the limits of deep CNNs for pedestrian degree with the Hefei University of Technology,
detection,” IEEE Trans. Circuits Syst. Video Technol., vol. 28, no. 6, Hefei, China.
pp. 1358–1368, Jun. 2018. His current research interests include pattern
[209] D. Tomé et al., “Reduced memory region based deep convolutional recognition, image processing, deep learning, and
neural network detection,” in Proc. ICCE, Berlin, Germany, 2016, computer vision.
pp. 15–19.
[210] J. Hosang et al., “Taking a deeper look at pedestrians,” in Proc. CVPR,
2015, pp. 4073–4082.
[211] J. Li et al.. (2015). “Scale-aware fast R-CNN for pedestrian detection.”
[Online]. Available: https://arxiv.org/abs/1510.08160
[212] Y. Gao et al., “Visual-textual joint relevance learning for tag-based
social image search,” IEEE Trans. Image Process., vol. 22, no. 1,
pp. 363–376, Jan. 2013. Xindong Wu (F’11) received the Ph.D. degree in
[213] T. Kong et al., “RON: Reverse connection with objectness prior artificial intelligence from The University of Edin-
networks for object detection,” in Proc. CVPR, 2017, pp. 5244–5252. burgh, Edinburgh, U.K.
[214] I. J. Goodfellow et al., “Generative adversarial nets,” in Proc. NIPS,
He is currently an Alfred and Helen Lamson
2014, pp. 2672–2680.
[215] Y. Fang et al., “Object detection meets knowledge graphs,” in Proc. Endowed Professor of computer science with the
IJCAI, 2017, pp. 1661–1667. University of Louisiana at Lafayette, Lafayette, LA,
[216] S. Welleck et al., “Saliency-based sequential image attention with USA. His current research interests include data
multiset prediction,” in Proc. NIPS, 2017, pp. 5173–5183. mining, knowledge-based systems, and Web infor-
[217] S. Azadi et al., “Learning detection with diverse proposals,” in Proc. mation exploration.
CVPR, 2017, pp. 7369–7377. Dr. Wu is a Fellow of the AAAS. He is the
[218] S. Sukhbaatar et al., “End-to-end memory networks,” in Proc. NIPS, Steering Committee Chair of the IEEE International
2015, pp. 2440–2448. Conference on Data Mining. He served as the Editor-in-Chief for the IEEE
[219] P. Dabkowski and Y. Gal, “Real time image saliency for black box T RANSACTIONS ON K NOWLEDGE AND D ATA E NGINEERING (IEEE Com-
classifiers,” in Proc. NIPS, 2017, pp. 6967–6976. puter Society) between 2005 and 2008.

You might also like