Object Detection With Deep Learning: A Review
Object Detection With Deep Learning: A Review
weakly supervised object segmentation cues and region-based Traditional CNN framework for object detection is not
object detection into a multistage architecture to fully exploit skilled in handling significant scale variation, occlusion,
the learned segmentation features. or truncation, especially when only 2-D object detection is
Multiscale representation combines activations from involved. To address this problem, Xiang et al. [60] proposed
multiple layers with skip-layer connections to provide a novel subcategory-aware RPN, which guides the generation
semantic information of different spatial resolutions [66]. of region proposals with subcategory information related to
Cai et al. [105] proposed the multiscale CNN (MS-CNN) object poses and jointly optimize object detection and subcat-
to ease the inconsistency between the sizes of objects egory classification.
and receptive fields with multiple scale-independent Ouyang et al. [115] found that the samples from differ-
output layers. Yang et al. [34] investigated two strategies, ent classes follow a long-tailed distribution, which indicates
namely, scale-dependent pooling (SDP) and layerwise that different classes with distinct numbers of samples have
cascaded rejection classifiers (CRCs), to exploit appropriate different degrees of impacts on feature learning. To this end,
scale-dependent conv features. Kong et al. [101] proposed objects are first clustered into visually similar class groups,
the HyperNet to calculate the shared features between RPN and then, a hierarchical feature learning scheme is adopted to
and object detection network by aggregating and compressing learn deep representations for each group separately.
hierarchical feature maps from different resolutions into a In order to minimize the computational cost and achieve
uniform space. the state-of-the-art performance, with the “deep and thin”
Contextual modeling improves detection performance by design principle and following the pipeline of Fast R-CNN,
exploiting features from or around RoIs of different support Hong et al. [116] proposed the architecture of PVANET,
regions and resolutions to deal with occlusions and local which adopts some building blocks including concatenated
similarities [95]. Zhu et al. [106] proposed the SegDeepM to ReLU [117], Inception [45], and HyperNet [101] to reduce
exploit object segmentation which reduces the dependency the expense on multiscale feature extraction and trains the
on initial candidate boxes with the Markov random field. network with BN [43], residual connections [47], and learning
Moysset et al. [108] took advantage of four directional 2-D rate scheduling based on plateau detection [47]. The PVANET
long short-term memories (LSTMs) [107] to convey global achieves the state-of-the-art performance and can be processed
context between different local regions and reduced trainable in real time on Titan X GPU (21 fps).
parameters with local parameter sharing. Zeng et al. [109] pro-
posed a novel gated bidirectional-net (GBD-Net) by introduc- B. Regression/Classification-Based Framework
ing gated functions to control message transmission between
Region proposal-based frameworks are composed of several
different support regions.
correlated stages, including region proposal generation, feature
The combination incorporates different components above
extraction with CNN, classification, and bounding box regres-
into the same model to improve detection performance fur-
sion, which are usually trained separately. Even in the recent
ther. Gidaris and Komodakis [110] proposed the multire-
end-to-end module Faster R-CNN, an alternative training is
gion CNN (MR-CNN) model to capture different aspects of
still required to obtain shared convolution parameters between
an object, the distinct appearances of various object parts,
RPN and detection network. As a result, the time spent in
and semantic segmentation-aware features. To obtain con-
handling different components becomes the bottleneck in the
textual and multiscale representations, Bell et al. [95] pro-
real-time application.
posed the inside–outside net (ION) by exploiting informa-
One-step frameworks based on global regression/
tion both inside and outside the RoI with spatial recurrent
classification, mapping straightly from image pixels to
neural networks [111] and skip pooling [101]. Zagoruyko
bounding box coordinates and class probabilities, can reduce
et al. [112] proposed the MultiPath architecture by introducing
time expense. We first review some pioneer CNN models
three modifications to the Fast R-CNN, including multiscale
and then focus on two significant frameworks, namely,
skip connections [95], a modified foveal structure [110], and
YOLO [18] and SSD [71].
a novel loss function summing different intersection over
1) Pioneer Works: Previous to YOLO and SSD, many
union (IoU) losses.
researchers have already tried to model object detection as
9) Thinking in Deep Learning-Based Object Detection: a regression or classification task.
Apart from the above-mentioned approaches, there are still Szegedy et al. [118] formulated the object detection task
many important factors for continued progress. as a DNN-based regression, generating a binary mask for the
There is a large imbalance between the number of annotated test image and extracting detections with a simple bounding
objects and background examples. To address this problem, box inference. However, the model has difficulty in handling
Shrivastava et al. [113] proposed an effective online mining overlapping objects, and BBs generated by direct upsampling
algorithm (OHEM) for automatic selection of the hard exam- is far from perfect.
ples, which leads to a more effective and efficient training. Pinheiro et al. [119] proposed a CNN model with two
Instead of concentrating on feature extraction, Ren et al. branches: one generates class agnostic segmentation masks
[114] made a detailed analysis on object classifiers and found and the other predicts the likelihood of a given patch centered
that it is of particular importance for object detection to con- on an object. Inference is efficient since class scores and
struct a deep and convolutional per-region classifier carefully, segmentation can be obtained in a single model with most
especially for ResNets [47] and GoogLeNets [45]. of the CNN operations shared.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
i=0 j =0
S2
B
Fig. 9. Main idea of YOLO [18]. + ½obj
i j (Ci − Ĉi )
2
i=0 j =0
Erhan et al. [68] and Szegedy et al. [120] proposed the
S
B2
TABLE I
OVERVIEW OF P ROMINENT G ENERIC O BJECT D ETECTION A RCHITECTURES
Fig. 10. Architecture of SSD 300 [71]. SSD adds several feature layers to the end of VGG16 backbone network to predict the offsets to default anchor
boxes and their associated confidences. Final detection results are obtained by conducting NMS on multiscale refined BBs.
(e.g., Smooth L1) and confidence loss (e.g., Softmax), which is provided to evaluate their test consumption on PASCAL
is similar to (1). Final detection results are obtained by VOC 2007.
conducting NMS on multiscale refined BBs. 1) PASCAL VOC 2007/2012: PASCAL VOC 2007 and
Integrating with hard negative mining, data augmentation, 2012 data sets consist of 20 categories. The evaluation terms
and a larger number of carefully chosen default anchors, are AP in each single category and mAP across all the 20 cat-
SSD significantly outperforms the Faster R-CNN in terms of egories. Comparative results are exhibited in Tables II and III,
accuracy on PASCAL VOC and COCO while being three from which the following remarks can be obtained.
times faster. The SSD300 (input image size is 300 × 300)
1) If incorporated with a proper way, more powerful back-
runs at 59 fps, which is more accurate and efficient than
bone CNN models can definitely improve the object
YOLO. However, SSD is not skilled at dealing with small
detection performance (the comparison among R-CNN
objects, which can be relieved by adopting better feature
with AlexNet, R-CNN with VGG16 and SPP-net with
extractor backbone (e.g., ResNet101), adding deconvolution
ZF-Net [122]).
layers with skip connections to introduce additional large-scale
2) With the introduction of the SPP layer (SPP-net), end-
context [73], and designing better network structure (e.g., stem
to-end multitask architecture (FRCN), and RPN (Faster
block and dense block) [74].
R-CNN), object detection performance is improved
C. Experimental Evaluation gradually and apparently.
We compare various object detection methods on three 3) Due to a large number of trainable parameters, in order
benchmark data sets, including PASCAL VOC 2007 [25], to obtain multilevel robust features, data augmentation
PASCAL VOC 2012 [121], and Microsoft COCO [94]. is very important for deep learning-based models (Faster
The evaluated approaches include R-CNN [15], SPP- R-CNN with “07,” “07 + 12,” and “07 + 12 + coco”).
net [64], Fast R-CNN [16], networks on convolutional 4) Apart from basic models, there are still many
feature maps (NOC) [114], Bayes [85], MR-CNN& other factors affecting object detection performance,
S-CNN [105], Faster R-CNN [17], HyperNet [101], ION [95], such as multiscale and multiregion feature extrac-
MS-GR [104], StuffNet [100], SSD300 [71], SSD512 [71], tion (e.g., MR-CNN), modified classification networks
OHEM [113], SDP+CRC [34], G-CNN [70], SubCNN [60], (e.g., NOC), additional information from other corre-
GBD-Net [109], PVANET [116], YOLO [18], YOLOv2 [72], lated tasks (e.g., StuffNet, HyperNet), multiscale rep-
R-FCN [65], FPN [66], Mask R-CNN [67], DSSD [73], resentation (e.g., ION), and mining of hard negative
and DSOD [74]. If no specific instructions for the adopted samples (e.g., OHEM).
framework are provided, the utilized model is a VGG16 [46] 5) As YOLO is not skilled in producing object localizations
pretrained on 1000-way ImageNet classification task [39]. of high IoU, it obtains a very poor result on VOC 2012.
Due to the limitation of the paper length, we only provide an However, with the complementary information from Fast
overview, including proposal, learning method, loss function, R-CNN (YOLO+FRCN) and the aid of other strategies,
programing language, and platform, of the prominent such as anchor boxes, BN, and fine-grained features,
architectures in Table I. Detailed experimental settings, which the localization errors are corrected (YOLOv2).
can be found in the original papers, are missed. In addition 6) By combining many recent tricks and modeling the
to the comparisons of detection accuracy, another comparison whole network as a fully convolutional one, R-FCN
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE II
C OMPARATIVE R ESULTS ON VOC 2007 T EST S ET (%)
TABLE III
C OMPARATIVE R ESULTS ON VOC 2012 T EST S ET (%)
achieves a more obvious improvement of detection per- 4) Due to the existence of a large number of nonstandard
formance over other approaches. small objects, the results on this data set are much worse
2) Microsoft COCO: Microsoft COCO is composed than those of VOC 2007/2012. With the introduction of
of 300 000 fully segmented images, in which each image has other powerful frameworks (e.g., ResNeXt [123]) and
an average of 7 object instances from a total of 80 categories. useful strategies (e.g., multitask learning [67], [124]),
As there are a lot of less iconic objects with a broad range the performance can be improved.
of scales and a stricter requirement on object localization, 5) The success of DSOD in training from scratch stresses
this data set is more challenging than PASCAL 2012. Object the importance of the network design to release the
detection performance is evaluated by AP computed under requirements for perfect pretrained classifiers on relevant
different degrees of IoUs and on different object sizes. The tasks and a large number of annotated samples.
results are given in Table IV.
Besides similar remarks to those of PASCAL VOC, some 3) Timing Analysis: Timing analysis (Table V) is conducted
other conclusions can be drawn as follows from Table IV. on Intel i7-6700K CPU with a single core and NVIDIA
Titan X GPU. Except for “SS” which is processed with CPU,
1) Multiscale training and test are beneficial in improv- the other procedures related to CNN are all evaluated on GPU.
ing object detection performance, which provide addi- From Table V, we can draw some conclusions as follows.
tional information in different resolutions (R-FCN).
FPN and DSSD provide some better ways to build 1) By computing CNN features on shared feature maps
feature pyramids to achieve multiscale representation. (SPP-net), test consumption is reduced largely. Test
The complementary information from other related tasks time is further reduced with the unified multitask learn-
is also helpful for accurate object localization (Mask ing (FRCN) and removal of additional region proposal
R-CNN with instance segmentation task). generation stage (Faster R-CNN). It is also helpful to
2) Overall, region proposal-based methods, such as Faster compress the parameters of FC layers with SVD [91]
R-CNN and R-FCN, perform better than regression/ (PAVNET and FRCN).
classification-based approaches, namely, YOLO and 2) It takes additional test time to extract multiscale fea-
SSD, due to the fact that quite a lot of localization tures and contextual information (ION and MR-RCNN&
errors are produced by regression/classification-based S-RCNN).
approaches. 3) It takes more time to train a more complex and deeper
3) Context modeling is helpful to locate small objects, network (ResNet101 against VGG16) and this time
which provides additional information by consult- consumption can be reduced by adding as many lay-
ing nearby objects and surroundings (GBD-Net and ers into shared fully convolutional layers as possible
multipath). (FRCN).
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE IV
C OMPARATIVE R ESULTS ON M ICROSOFT COCO T EST D EV S ET (%)
TABLE V
C OMPARISON OF T ESTING C ONSUMPTION ON VOC 07 T EST S ET
4) Regression-based models can usually be processed as a focus-of-attention mechanism, which prunes BU salient
in real time at the cost of a drop in accuracy points that are unlikely to be parts of the object [132].
compared with region proposal-based models. Also,
A. Deep Learning in Salient Object Detection
region proposal-based models can be modified into
real-time systems with the introduction of other Due to the significance for providing high-level and mul-
tricks [116] (PVANET), such as BN [43] and residual tiscale feature representation and the successful applications
connections [123]. in many correlated computer vision tasks, such as semantic
segmentation [131], edge detection [133], and generic object
detection [16], it is feasible and necessary to extend CNN to
IV. S ALIENT O BJECT D ETECTION
salient object detection.
Visual saliency detection, one of the most important and The early work by Vig et al. [29] follows a completely
challenging tasks in computer vision, aims to highlight the automatic data-driven approach to perform a large-scale search
most dominant object regions in an image. Numerous appli- for optimal features, namely, an ensemble of deep networks
cations incorporate the visual saliency to improve their perfor- with different layers and parameters. To address the problem
mance, such as image cropping [125] and segmentation [126], of limited training data, Kummerer et al. [134] proposed the
image retrieval [57], and object detection [66]. Deep Gaze by transferring from the AlexNet to generate a
Broadly, there are two branches of approaches in salient high-dimensional feature space and create a saliency map.
object detection, namely, BU [127] and TD [128]. Local A similar architecture was proposed by Huang et al. [135] to
feature contrast plays the central role in BU salient object integrate saliency prediction into pretrained object recognition
detection, regardless of the semantic contents of the scene. DNNs. The transfer is accomplished by fine-tuning DNNs’
To learn local feature contrast, various local and global fea- weights with an objective function based on the saliency
tures are extracted from pixels, e.g., edges [129] and spa- evaluation metrics, such as similarity, KL-divergence, and
tial information [130]. However, high-level and multiscale normalized scanpath saliency.
semantic information cannot be explored with these low-level Some works combined local and global visual
features. As a result, low-contrast salient maps instead of clues to improve salient object detection performance.
salient objects are obtained. TD salient object detection is Wang et al. [136] trained two independent deep CNNs
task-oriented and takes prior knowledge about object cate- (DNN-L and DNN-G) to capture local information and global
gories to guide the generation of salient maps. Taking semantic contrast and predicted saliency maps by integrating both
segmentation as an example, a saliency map is generated in the local estimation and global search. Cholakkal et al. [137]
segmentation to assign pixels to particular object categories via proposed a weakly supervised saliency detection framework
a TD approach [131]. In a word, TD saliency can be viewed to combine visual saliency from BU and TD saliency
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE VI
C OMPARISON B ETWEEN S TATE - OF - THE -A RT M ETHODS
maps and refined the results with a multiscale superpixel- and a deeper one adapted from a deconvoluted VGG net-
averaging. Zhao et al. [138] proposed a multicontext deep work. Asconvolutional–deconvolution networks are not expert
learning framework, which utilizes a unified learning in recognizing objects of multiple scales, Kuen et al. [148]
framework to model global and local context jointly with proposed a recurrent attentional convolutional–deconvolution
the aid of superpixel segmentation. To predict saliency in network with several spatial transformer and recurrent network
videos, Bak et al. [139] fused two static saliency models, units to conquer this problem. To fuse local, global, and
namely, spatial stream net and temporal stream net, into a contextual information of salient objects, Tang et al. [149]
two-stream framework with a novel empirically grounded developed a deeply supervised recurrent CNN to perform a
data augmentation technique. full image-to-image saliency detection.
Complementary information from semantic segmentation
and context modeling is beneficial. To learn internal represen- B. Experimental Evaluation
tations of saliency efficiently, He et al. [140] proposed a novel
Four representative data sets, including Evaluation on Com-
superpixelwise CNN approach called SuperCNN, in which
plex Scene Saliency Dataset (ECSSD) [156], HKU-IS [146],
salient object detection is formulated as a binary labeling
PASCALS [157], and SOD [158], are used to evaluate several
problem. Based on a fully CNN, Li et al. [141] proposed a
state-of-the-art methods. ECSSD consists of 1000 structurally
multitask deep saliency model, in which intrinsic correlations
complex but semantically meaningful natural images. HKU-IS
between saliency detection and semantic segmentation are set
is a large-scale data set containing over 4000 challenging
up. However, due to the conv layers with large receptive
images. Most of these images have more than one salient
fields and pooling layers, blurry object boundaries and coarse
object and own low contrast. PASCALS is a subset chosen
saliency maps are produced. Tang and Wu [142] proposed
from the validation set of PASCAL VOC 2010 segmentation
a novel saliency detection framework (CRPSD) [142], which
data set and is composed of 850 natural images. The SOD data
combines the region-level saliency estimation and pixel-level
set possesses 300 images containing multiple salient objects.
saliency prediction together with three closely related CNNs.
The training and validation sets for different data sets are kept
Li and Yu [143]proposed a deep contrast network to combine
the same as those in [152].
segmentwise spatial pooling and pixel-level fully convolutional
Two standard metrics, namely, F-measure and the mean
streams [143].
absolute error (MAE), are utilized to evaluate the quality of a
The proper integration of multiscale feature maps is also
saliency map. Given precision and recall values precomputed
of significance for improving detection performance. Based
on the union of generated binary mask B and ground truth Z ,
on Fast R-CNN, Wang et al. [144] proposed the RegionNet
F-measure is defined as follows:
by performing salient object detection with end-to-end edge
preserving and multiscale contextual modeling. Liu et al. [28] (1 + β 2 )Presion × Recall
Fβ = (7)
proposed a multiresolution CNN (Mr-CNN) to predict eye fix- β 2 Presion + Recall
ations, which is achieved by learning both BU visual saliency
and TD visual factors from raw image data simultaneously. where β 2 is set to 0.3 in order to stress the importance of the
Cornia et al. [145] proposed an architecture that combines fea- precision value.
tures extracted at different levels of the CNN. Li and Yu [146] The MAE score is computed with the following equation:
proposed a multiscale deep CNN framework to extract three
1
H
W
scales of deep contrast features, namely, the mean-subtracted MAE = | Ŝ(i, j ) = Ẑ (i, j )| (8)
region, the bounding box of its immediate neighboring regions, H ×W
i=1 j =1
and the masked entire image, from each candidate region.
It is efficient and accurate to train a direct pixelwise where Ẑ and Ŝ represent the ground truth and the continuous
CNN architecture to predict salient objects with the aids saliency map, respectively. W and H are the width and
of recurrent neural networks and deconvolution networks. height of the salient area, respectively. This score stresses
Pan et al. [147] formulated saliency prediction as a mini- the importance of successfully detected salient objects over
mization optimization on the Euclidean distance between the detected nonsalient pixels [159].
predicted saliency map and the ground truth and proposed The following approaches are evaluated: contextual hyper-
two kinds of architectures: a shallow one trained from scratch graph modeling (CHM) [150], RC [151], discriminative
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
manner. In the framework, two issues are addressed to trans- it can be observed that most of the CNN-based methods earn
fer from generic object detection to face detection, namely, similar true positive rates between 60% and 70% while DeepIR
eliminating predefined anchor boxes by a 3-D mean face and HR-ER perform much better than them. Among classic
model and replacing RoI pooling layer with a configuration methods, Joint Cascade is still competitive. As earlier works,
pooling layer. Zhang et al. [176] proposed a deep cascaded DDFD and CCF directly make use of generated feature maps
multitask framework named multitask cascaded convolutional and obtain relatively poor results. CascadeCNN builds cas-
networks (MTCNN) which exploits the inherent correlations caded CNNs to locate face regions, which is efficient but inac-
between face detection and alignment in the unconstrained curate. Faceness combines the decisions from different part
environment to boost up detection performance in a coarse- detectors, resulting in precise face localizations while being
to-fine manner. time-consuming. The outstanding performance of MTCNN,
Reducing computational expenses is of necessity in real Conv3-D, and Hyperface proves the effectiveness of multitask
applications. To achieve real-time detection on the mobile plat- learning. HR-ER and ScaleFace adaptively detect faces of
form, Kalinovskii and Spitsyn [177] proposed a new solution different scales and make a balance between accuracy and
of frontal face detection based on compact CNN cascades. This efficiency. DeepIR and Face-R-CNN are two extensions of the
method takes a cascade of three simple CNNs to generate, Faster R-CNN architecture to face detection, which validate
classify, and refine candidate object positions progressively. the significance and effectiveness of Faster R-CNN. Unitbox
To reduce the effects of large pose variations, Chen et al. [32] provides an alternative choice for performance improvements
proposed a cascaded CNN denoted by supervised transformer by carefully designing optimization loss.
network. This network takes a multitask RPN to predict From these results, we can draw the conclusion that
candidate face regions along with associated facial landmarks CNN-based methods are in the leading position. The perfor-
simultaneously and adopts a generic R-CNN to verify the mance can be improved by the following strategies: designing
existence of valid faces. Yang and Nevatia [8] proposed a novel optimization loss, modifying generic detection pipelines,
three-stage cascade structure based on FCNs, while in each building meaningful network cascades, adapting scale-aware
stage, a multiscale FCN is utilized to refine the positions of detection, and learning multitask shared CNN features.
possible faces. Qin et al. [178] proposed a unified framework VI. P EDESTRIAN D ETECTION
that achieves better results with the complementary informa-
tion from different jointly trained CNNs. Recently, pedestrian detection has been intensively
studied, which has a close relationship to pedestrian
tracking [189], [190], person reidentification [191], [192],
B. Experimental Evaluation
and robot navigation [193], [194]. Prior to the recent
The FDDB [179] data set has a total of 2845 pictures in progress in deep CNN (DCNN)-based methods [195], [196],
which 5171 faces are annotated with an elliptical shape. Two some researchers combined boosted decision forests
types of evaluations are used: the discrete score and continuous with hand-crafted features to obtain pedestrian
score. By varying the threshold of the decision rule, the detectors [197]–[199]. At the same time, to explicitly model
receiver operating characteristic (ROC) curve for the discrete the deformation and occlusion, part-based models [200] and
scores can reflect the dependence of the detected face fractions explicit occlusion handling [201], [202] are of concern.
on the number of false alarms. Compared with annotations, As there are many pedestrian instances of small sizes in typ-
any detection with an IoU ratio exceeding 0.5 is treated as ical scenarios of pedestrian detection (e.g., automatic driving
positive. Each annotation is only associated with one detection. and intelligent surveillance), the application of RoI pooling
The ROC curve for the continuous scores is the reflection of layer in generic object detection pipeline may result in “plain”
face localization quality. features due to collapsing bins. In the meantime, the main
The evaluated models cover DDFD [168], Cascade-CNN source of false predictions in pedestrian detection is the
[180], aggregate channel features (ACF)-multiscale [181], confusion of hard background instances, which is in contrast
Pico [182], Head-Hunter [183], Joint Cascade [31], SURF- to the interference from multiple categories in generic object
multiview [184], Viola–Jones [166], NPDFace [185], detection. As a result, different configurations and components
Faceness [169], convolutional channel features (CCF) [186], are required to accomplish accurate pedestrian detection.
MTCNN [176], Conv3-D [175], Hyperface [187],
UnitBox [167], locally decorrelated channel A. Deep Learning in Pedestrian Detection
features (LDCF+) [S2], DeepIR [173], hybrid-resolution Although DCNNs have obtained excellent performance on
model with elliptical regressor (HR-ER) [188], Face-R- generic object detection [16], [72], none of these approaches
CNN [172], and ScaleFace [170]. ACF-multiscale, Pico, have achieved better results than the best hand-crafted feature-
HeadHunter, Joint Cascade, SURF-multiview, Viola-Jones, based method [198] for a long time, even when part-based
NPDFace, and LDCF+ are built on classic hand-crafted information and occlusion handling are incorporated [202].
features while the rest methods are based on deep CNN Thereby, some studies have been conducted to analyze
features. The ROC curves are shown in Fig. 11. the reasons. Zhang et al. [203] attempted to adapt generic
In Fig. 11(a), in spite of relatively competitive results pro- Faster R-CNN [17] to pedestrian detection. They modified the
duced by LDCF+, it can be observed that most of the classic downstream classifier by adding boosted forests to shared,
methods perform with similar results and are outperformed by high-resolution conv feature maps and taking an RPN to han-
CNN-based methods by a significant margin. In Fig. 11(b), dle small instances and hard negative examples. To deal with
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
complex occlusions existing in pedestrian images, inspired Evaluated methods include Checkerboards+ [198],
by DPM [24], Tian et al. [204] proposed a deep learning LDCF++ [S2], SCF+AlexNet [210], SA-FastRCNN [211],
framework called DeepParts, which makes decisions based MS-CNN [105], DeepParts [204], CompACT-Deep [195],
on an ensemble of extensive part detectors. DeepParts has RPN+BF [203], and F-DNN+SS [207]. The first two
advantages in dealing with weakly labeled data, low IoU methods are based on hand-crafted features while the rest
positive proposals, and partial occlusion. ones rely on deep CNN features. All results are exhibited
Other researchers also tried to combine complementary in Table VII. From this table, we observe that different
information from multiple data sources. CompACT-Deep from other tasks, classic handcrafted features can still earn
adopts a complexity-aware cascade to combine hand-crafted competitive results with boosted decision forests [203],
features and fine-tuned DCNNs [195]. Based on Faster ACF [197], and HOG+LUV channels [S2]. As an early
R-CNN, Liu et al. [205] proposed multispectral DNNs for attempt to adapt CNN to pedestrian detection, the features
pedestrian detection to combine complementary information generated by SCF+AlexNet are not so discriminant and
from color and thermal images. Tian et al. [206] proposed produce relatively poor results. Based on multiple CNNs,
a task-assistant CNN to jointly learn multiple tasks with DeepParts and CompACT-Deep accomplish detection tasks via
multiple data sources and to combine pedestrian attributes different strategies, namely, local part integration and cascade
with semantic scene attributes together. Du et al. [207] pro- network. The responses from different local part detectors
posed a DNN fusion architecture for fast and robust pedes- make DeepParts robust to partial occlusions. However, due
trian detection. Based on the candidate BBs generated with to complexity, it is too time-consuming to achieve real-time
SSD detectors [71], multiple binary classifiers are processed detection. The multiscale representation of MS-CNN improves
parallelly to conduct soft-rejection-based network fusion by the accuracy of pedestrian locations. SA-FastRCNN extends
consulting their aggregated degree of confidences. Fast R-CNN to automatically detect pedestrians according
However, most of these approaches are much more sophisti- to their different scales, which has trouble when there are
cated than the standard R-CNN framework. CompACT-Deep partial occlusions. RPN+BF combines the detectors produced
consists of a variety of hand-crafted features, a small CNN by Faster R-CNN with boosting decision forest to accurately
model, and a large VGG16 model [195]. DeepParts contains locate different pedestrians. F-DNN+SS, which is composed
45 fine-tuned DCNN models, and a set of strategies, includ- of multiple parallel classifiers with soft rejections, performs
ing bounding box shifting handling and part selection, are the best followed by RPN+BF, SA-FastRCNN, and MS-CNN.
required to arrive at the reported results [204]. Therefore, In short, CNN-based methods can provide more accu-
the modification and simplification are of significance to rate candidate boxes and multilevel semantic information for
reduce the burden on both software and hardware to satisfy identifying and locating pedestrians. Meanwhile, handcrafted
real-time detection demand. Tome et al. [59] proposed a novel features are complementary and can be combined with CNN
solution to adapt generic object detection pipeline to pedestrian to achieve better results. The improvements over existing CNN
detection by optimizing most of its stages. Hu et al. [208] methods can be obtained by carefully designing the framework
trained an ensemble of boosted decision models by reusing and classifiers, extracting multiscale and part-based semantic
the conv feature maps, and a further improvement was gained information and searching for complementary information
with simple pixel labeling and additional complementary from other related tasks, such as segmentation.
hand-crafted features. Tome et al. [209] proposed a reduced
memory region-based deep CNN architecture, which fuses VII. P ROMISING F UTURE D IRECTIONS AND TASKS
regional responses from both ACF detectors and SVM classi-
In spite of rapid development and achieved promising
fiers into R-CNN. Ribeiro et al. [33] addressed the problem of
progress of object detection, there are still many open issues
human-aware navigation and proposed a vision-based person
for the future work.
tracking system guided by multiple camera sensors.
The first one is small object detection such as occurring
B. Experimental Evaluation in COCO data set and in face detection task. To improve
The evaluation is conducted on the most popular Caltech localization accuracy on small objects under partial occlusions,
Pedestrian data set [3]. The data set was collected from the it is necessary to modify network architectures from the
videos of a vehicle driving through an urban environment and following aspects.
consists of 250 000 frames with about 2300 unique pedestrians 1) Multitask Joint Optimization and Multimodal
and 350 000 annotated BBs. Three kinds of labels, namely, Information Fusion: Due to the correlations between
“Person (clear identifications),” “Person? (unclear identifica- different tasks within and outside object detection,
tions),” and “People (large group of individuals),” are assigned multitask joint optimization has already been studied
to different BBs. The performance is measured with the by many researchers [16], [17]. However, apart from
log-average miss rate (L-AMR) which is computed evenly the tasks mentioned in Section III-A8, it is desirable to
spaced in log-space in the range 10−2 to 1 by averaging miss think over the characteristics of different subtasks of
rate at the rate of nine false positives per image [3]. According object detection (e.g., superpixel semantic segmentation
to the differences in the height and visible part of the BBs, in salient object detection) and extend multitask
a total of nine popular settings are adopted to evaluate different optimization to other applications such as instance
properties of these models. Details of these settings are as segmentation [66], multiobject tracking [202], and
in [3]. multiperson pose estimation [S4]. In addition, given
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE VII
D ETAILED B REAKDOWN P ERFORMANCE C OMPARISONS OF S TATE - OF - THE -A RT M ODELS ON C ALTECH P EDESTRIAN D ATA S ET.
A LL N UMBERS A RE R EPORTED IN L-AMR
a specific application, the information from different previous stages in cascade are fixed when training a new
modalities, such as text [212], thermal data [205], and stage. Therefore, the optimizations of different CNNs
images [65], can be fused together to achieve a more are isolated, which stresses the necessity of end-to-end
discriminant network. optimization for CNN cascade. At the same time, it is
2) Scale Adaption: Objects usually exist in different scales, also a matter of concern to build contextual associated
which are more apparent in face detection and pedes- cascade networks with existing layers.
trian detection. To increase the robustness to scale 2) Unsupervised and Weakly Supervised Learning: It is
changes, it is demanded to train scale-invariant, mul- very time-consuming to manually draw large quantities
tiscale or scale-adaptive detectors. For scale-invariant of BBs. To release this burden, semantic prior [55],
detectors, more powerful backbone architectures (e.g., unsupervised object discovery [221], multiple instance
ResNext [123]), negative sample mining [113], reverse learning [222], and DNN prediction [47] can be inte-
connection [213], and subcategory modeling [60] are all grated to make the best use of image-level supervision
beneficial. For multiscale detectors, both the FPN [66] to assign object category tags to corresponding object
that produces multiscale feature maps and the generative regions and refine object boundaries. Furthermore,
adversarial network [214] that narrows representation weakly annotations (e.g., center-click annotations [223])
differences between small objects and the large ones are also helpful for achieving high-quality detectors with
with a low-cost architecture provide insights into gen- modest annotation efforts, especially aided by the mobile
erating meaningful feature pyramid. For scale-adaptive platform.
detectors, it is useful to combine knowledge graph [215], 3) Network Optimization: Given specific applications and
attentional mechanism [216], cascade network [180], platforms, it is significant to make a balance among
and scale distribution estimation [171] to detect objects speed, memory, and accuracy by selecting an optimal
adaptively. detection architecture [116], [224]. However, despite
3) Spatial Correlations and Contextual Modeling: Spatial that detection accuracy is reduced, it is more mean-
distribution plays an important role in object detec- ingful to learn compact models with a fewer number
tion. Therefore, region proposal generation and grid of parameters [209]. This situation can be relieved by
regression are taken to obtain probable object loca- introducing better pretraining schemes [225], knowledge
tions. However, the correlations between multiple pro- distillation [226], and hint learning [227]. DSOD also
posals and object categories are ignored. In addition, provides a promising guideline to train from scratch
the global structure information is abandoned by the to bridge the gap between different image sources and
position-sensitive score maps in R-FCN. To solve these tasks [74].
problems, we can refer to diverse subset selection [217] The third one is to extend typical methods for
and sequential reasoning tasks [218] for possible solu- 2-D object detection to adapt 3-D object detection
tions. It is also meaningful to mask salient parts and and video object detection, with the requirements from
couple them with the global structure in a joint-learning autonomous driving, intelligent transportation, and intelligent
manner [219]. surveillance.
The second one is to release the burden on manual labor 1) 3-D Object Detection: With the applications of 3-D
and accomplish real-time object detection, with the emergence sensors (e.g., Light Detection and Ranging and cam-
of the large-scale image and video data. The following three era), additional depth information can be utilized to
aspects can be taken into account. better understand the images in 2-D and extend the
1) Cascade Network: In a cascade network, a cas- image-level knowledge to the real world. However, sel-
cade of detectors is built in different stages or dom of these 3-D-aware techniques aim to place correct
layers [180], [220]. Easily distinguishable examples are 3-D BBs around detected objects. To achieve better
rejected at shallow layers so that features and classifiers bounding results, multiview representation [181] and
at later stages can handle more difficult samples with 3-D proposal network [228] may provide some guide-
the aid of the decisions from previous stages. However, lines to encode depth information with the aid of inertial
current cascades are built in a greedy manner, where sensors (accelerometer and gyrometer) [229].
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
2) Video Object Detection: Temporal information across [16] R. Girshick, “Fast R-CNN,” in Proc. ICCV, 2015, pp. 1440–1448.
different frames plays an important role in under- [17] S. Ren et al., “Faster R-CNN: Towards real-time object detection with
region proposal networks,” in Proc. NIPS, 2015, pp. 91–99.
standing the behaviors of different objects. However, [18] J. Redmon et al., “You only look once: Unified, real-time object
the accuracy suffers from degenerated object appear- detection,” in Proc. CVPR, 2016, pp. 779–788.
[19] D. G. Lowe, “Distinctive image features from scale-invariant key-
ances (e.g., motion blur and video defocus) in videos points,” Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, 2004.
and the network is usually not trained end to end. To this [20] N. Dalal and B. Triggs, “Histograms of oriented gradients for human
end, spatiotemporal tubelets [230], optical flow [199], detection,” in Proc. CVPR, 2005, pp. 886–893.
[21] R. Lienhart and J. Maydt, “An extended set of Haar-like features for
and LSTM [107] should be considered to fundamentally rapid object detection,” in Proc. ICIP, 2002, p. 1.
model object associations between consecutive frames. [22] C. Cortes and V. Vapnik, “Support vector machine,” Mach. Learn.,
vol. 20, no. 3, pp. 273–297, 1995.
[23] Y. Freund and R. E. Schapire, “A decision-theoretic generalization of
VIII. C ONCLUSION on-line learning and an application to boosting,” J. Comput. Syst. Sci.,
Due to its powerful learning ability and advantages in vol. 55, no. 1, pp. 119–139, 1997.
[24] P. F. Felzenszwalb et al., “Object detection with discriminatively
dealing with occlusion, scale transformation, and background trained part-based models,” IEEE Trans. Pattern Anal. Mach. Intell.,
switches, deep learning-based object detection has been a vol. 32, no. 9, pp. 1627–1645, Sep. 2010.
[25] M. Everingham et al., “The pascal visual object classes (VOC) chal-
research hotspot in recent years. This paper provides a detailed lenge,” Int. J. Comput. Vis., vol. 88, no. 2, pp. 303–338, 2008.
review on deep learning-based object detection frameworks [26] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of
that handle different subproblems, such as occlusion, clutter, data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507,
2006.
and low resolution, with different degrees of modifications [27] Y. LeCun et al., “Deep learning,” Nature, vol. 521, pp. 436–444,
on R-CNN. The review starts on generic object detection May 2015.
[28] N. Liu et al., “Predicting eye fixations using convolutional neural
pipelines which provide base architectures for other related networks,” in Proc. CVPR, 2015, pp. 362–370.
tasks. Then, three other common tasks, namely, salient object [29] E. Vig et al., “Large-scale optimization of hierarchical features
detection, face detection, and pedestrian detection, are also for saliency prediction in natural images,” in Proc. CVPR, 2014,
pp. 2798–2805.
briefly reviewed. Finally, we propose several promising future [30] H. Jiang and E. Learned-Miller, “Face detection with the faster
directions to gain a thorough understanding of the object detec- R-CNN,” in Proc. FG, 2017, pp. 650–657.
[31] D. Chen et al., “Joint cascade face detection and alignment,” in Proc.
tion landscape. This review is also meaningful for the develop- ECCV, 2014, pp. 109–122.
ments in neural networks and related learning systems, which [32] D. Chen et al., “Supervised transformer network for efficient face
provides valuable insights and guidelines for future progress. detection,” in Proc. ECCV, 2016, pp. 122–138.
[33] A. Mateus et al.. (2016). “Efficient and robust pedestrian detection
using deep learning for human-aware navigation.” [Online]. Available:
R EFERENCES https://arxiv.org/abs/1607.04441
[34] F. Yang et al., “Exploit all the layers: Fast and accurate CNN object
[1] P. F. Felzenszwalb et al., “Object detection with discriminatively detector with scale dependent pooling and cascaded rejection classi-
trained part-based models,” IEEE Trans. Pattern Anal. Mach. Intell., fiers,” in Proc. CVPR, 2016, pp. 2129–2137.
vol. 32, no. 9, pp. 1627–1645, Sep. 2010. [35] P. N. Druzhkov and V. D. Kustikova, “A survey of deep learning meth-
[2] K.-K. Sung and T. Poggio, “Example-based learning for view-based ods and software tools for image classification and object detection,”
human face detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, Pattern Recognit. Image Anal., vol. 26, no. 1, pp. 9–15, 2016.
no. 1, pp. 39–51, Jan. 1998. [36] W. Pitts and W. S. McCulloch, “How we know universals the perception
[3] C. Wojek et al., “Pedestrian detection: An evaluation of the state of auditory and visual forms,” Bull. Math. Biophys., vol. 9, no. 3,
of the art,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 4, pp. 127–147, 1947.
pp. 743–761, Apr. 2012. [37] D. E. Rumelhart et al., “Learning representations by back-propagating
[4] H. Kobatake and Y. Yoshinaga, “Detection of spicules on mammogram errors,” Nature, vol. 323, pp. 533–536, Oct. 1986.
based on skeleton analysis,” IEEE Trans. Med. Imag., vol. 15, no. 3, [38] G. Hinton et al., “Deep neural networks for acoustic modeling in speech
pp. 235–245, Jun. 1996. recognition: The shared views of four research groups,” IEEE Signal
[5] Y. Jia et al., “Caffe: Convolutional architecture for fast feature embed- Process. Mag., vol. 29, no. 6, pp. 82–97, Nov. 2012.
ding,” in Proc. ACM MM, 2014, pp. 675–678. [39] J. Deng et al., “ImageNet: A large-scale hierarchical image database,”
[6] A. Krizhevsky et al., “ImageNet classification with deep convolutional in Proc. CVPR, 2009, pp. 248–255.
neural networks,” in Proc. NIPS, 2012, pp. 1097–1105. [40] L. Deng et al., “Binary coding of speech spectrograms using a deep
[7] Z. Cao et al., “Realtime multi-person 2D pose estimation using part auto-encoder,” in Proc. INTERSPEECH, 2010, pp. 1692–1695.
affinity fields,” in Proc. CVPR, 2017, pp. 1302–1310. [41] G. Dahl et al., “Phone recognition with the mean-covariance restricted
[8] Z. Yang and R. Nevatia, “A multi-scale cascade fully convolutional Boltzmann machine,” in Proc. NIPS, 2010, pp. 469–477.
network face detector,” in Proc. ICPR, 2016, pp. 633–638. [42] G. E. Hinton et al.. (2012). “Improving neural networks by pre-
[9] C. Chen et al., “DeepDriving: Learning affordance for direct perception venting co-adaptation of feature detectors.” [Online]. Available:
in autonomous driving,” in Proc. ICCV, 2015, pp. 2722–2730. https://arxiv.org/abs/1207.0580
[10] X. Chen et al., “Multi-view 3D object detection network for [43] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
autonomous driving,” in Proc. CVPR, 2017, pp. 6526–6534. network training by reducing internal covariate shift,” in Proc. ICML,
[11] A. Dundar et al., “Embedded streaming deep neural networks acceler- 2015, pp. 448–456.
ator with applications,” IEEE Trans. Neural Netw. Learn. Syst., vol. 28, [44] P. Sermanet et al.. (2013). “OverFeat: Integrated recognition, localiza-
no. 7, pp. 1572–1583, Jul. 2017. tion and detection using convolutional networks.” [Online]. Available:
[12] R. J. Cintra et al., “Low-complexity approximate convolutional neural https://arxiv.org/abs/1312.6229
networks,” IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 12, [45] C. Szegedy et al., “Going deeper with convolutions,” in Proc. CVPR,
pp. 5981–5992, 2018. 2015, pp. 1–9.
[13] S. H. Khan et al., “Cost-sensitive learning of deep feature representa- [46] K. Simonyan and A. Zisserman. (2014).“Very deep convolutional
tions from imbalanced data,” IEEE Trans. Neural Netw. Learn. Syst., networks for large-scale image recognition.” [Online]. Available:
vol. 29, no. 8, pp. 3573–3587, Aug. 2018. https://arxiv.org/abs/1409.1556
[14] A. Stuhlsatz et al., “Feature extraction with deep neural networks by [47] K. He et al., “Deep residual learning for image recognition,” in Proc.
a generalized discriminant analysis,” IEEE Trans. Neural Netw. Learn. CVPR, 2016, pp. 770–778.
Syst., vol. 23, no. 4, pp. 596–608, Apr. 2012. [48] V. Nair and G. E. Hinton, “Rectified linear units improve restricted
[15] R. Girshick et al., “Rich feature hierarchies for accurate object Boltzmann machines,” in Proc. ICML, 2010, pp. 807–814.
detection and semantic segmentation,” in Proc. CVPR, 2014, [49] M. Oquab et al., “Weakly supervised object recognition with
pp. 580–587. convolutional neural networks,” in Proc. NIPS, 2014, pp. 1–10.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
[50] M. Oquab et al., “Learning and transferring mid-level image [87] W. Ouyang et al., “DeepID-Net: Deformable deep convolutional neural
representations using convolutional neural networks,” in Proc. CVPR, networks for object detection,” in Proc. CVPR, 2015, pp. 2403–2412.
2014, pp. 1717–1724. [88] K. Lenc and A. Vedaldi. (2015). “R-CNN minus R.” [Online].
[51] F. M. Wadley, “Probit analysis: A statistical treatment of the sigmoid Available: https://arxiv.org/abs/1506.06981
response curve,” Ann. Entomol. Soc. Amer., vol. 67, no. 4, pp. 549–553, [89] S. Lazebnik et al., “Beyond bags of features: Spatial pyramid
1947. matching for recognizing natural scene categories,” in Proc. CVPR,
[52] K. Kavukcuoglu et al., “Learning invariant features through 2006, pp. 2169–2178.
topographic filter maps,” in Proc. CVPR, 2009, pp. 1605–1612. [90] F. Perronnin et al., “Improving the Fisher kernel for large-scale image
[53] K. Kavukcuoglu et al., “Learning convolutional feature hierarchies for classification,” in Proc. ECCV, 2010, pp. 143–156.
visual recognition,” in Proc. NIPS, 2010, pp. 1090–1098. [91] J. Xue et al., “Restructuring of deep neural network acoustic models
[54] M. D. Zeiler et al., “Deconvolutional networks,” in Proc. CVPR, 2010, with singular value decomposition,” in Proc. INTERSPEECH, 2013,
pp. 2528–2535. pp. 2365–2369.
[55] H. Noh et al., “Learning deconvolution network for semantic [92] S. Ren et al., “Faster R-CNN: Towards real-time object detection with
segmentation,” in Proc. ICCV, 2015, pp. 1520–1528. region proposal networks,” IEEE Trans. Pattern Anal. Mach. Intell.,
[56] Z.-Q. Zhao et al., “Plant leaf identification via a growing convolution vol. 39, no. 6, pp. 1137–1149, Jun. 2017.
neural network with progressive sample learning,” in Proc. ACCV, [93] C. Szegedy et al., “Rethinking the inception architecture for computer
2014, pp. 348–361. vision,” in Proc. CVPR, 2016, pp. 2818–2826.
[57] A. Babenko et al., “Neural codes for image retrieval,” in Proc. ECCV, [94] T.-Y. Lin et al., “Microsoft COCO: Common objects in context,” in
2014, pp. 584–599. Proc. ECCV, 2014, pp. 740–755.
[58] J. Wan et al., “Deep learning for content-based image retrieval: [95] S. Bell et al., “Inside-outside net: Detecting objects in context with
A comprehensive study,” in ACM MM, 2014, pp. 157–166. skip pooling and recurrent neural networks,” in Proc. CVPR, 2016,
[59] D. Tomè et al., “Deep convolutional neural networks for pedestrian pp. 2874–2883.
detection,” Signal Process., Image Commun., vol. 47, pp. 482–489, [96] A. Arnab and P. H. S. Torr, “Pixelwise instance segmentation with a
Sep. 2016. dynamically instantiated network,” in Proc. CVPR, 2017, pp. 879–888.
[60] Y. Xiang et al., “Subcategory-aware convolutional neural networks for [97] J. Dai et al., “Instance-aware semantic segmentation via multi-task
object proposals and detection,” in Proc. WACV, 2017, pp. 924–933. network cascades,” in Proc. CVPR, 2016, pp. 3150–3158.
[61] Z.-Q. Zhao et al., “Pedestrian detection based on fast R-CNN and [98] Y. Li et al., “Fully convolutional instance-aware semantic
batch normalization,” in Proc. ICIC, 2017, pp. 735–746. segmentation,” in Proc. CVPR, 2017, pp. 4438–4446.
[62] J. Ngiam et al., “Multimodal deep learning,” in Proc. ICML, 2011, [99] M. Jaderberg et al., “Spatial transformer networks,” in Proc. Adv.
pp. 689–696. Neural Inf. Process. Syst., 2015, pp. 2017–2025.
[63] Z. Wu et al., “Modeling spatial-temporal clues in a hybrid deep [100] S. Brahmbhatt et al., “StuffNet: Using ‘Stuff’ to improve object
learning framework for video classification,” in Proc. ACM MM, 2015, detection,” in Proc. WACV, 2017, pp. 934–943.
pp. 461–470. [101] T. Kong et al., “HyperNet: Towards accurate region proposal generation
[64] K. He et al., “Spatial pyramid pooling in deep convolutional networks and joint object detection,” in Proc. CVPR, 2016, pp. 845–853.
for visual recognition,” IEEE Trans. Pattern Anal. Mach. Intell., [102] A. Pentina et al., “Curriculum learning of multiple tasks,” in Proc.
vol. 37, no. 9, pp. 1904–1916, Sep. 2015. CVPR, 2015, pp. 5492–5500.
[65] J. Dai et al., “R-FCN: Object detection via region-based fully [103] J. Yim et al., “Rotating your face using multi-task deep neural
convolutional networks,” in Proc. NIPS, 2016, pp. 379–387. network,” in Proc. CVPR, 2015, pp. 676–684.
[66] T.-Y. Lin et al., “Feature pyramid networks for object detection,” in [104] J. Li et al.. (2016). “Multi-stage object detection with group recursive
Proc. CVPR, 2017, pp. 936–944. learning.” [Online]. Available: https://arxiv.org/abs/1608.05159
[67] K. He et al., “Mask R-CNN,” in Proc. ICCV, 2017, pp. 2980–2988. [105] Z. Cai et al., “A unified multi-scale deep convolutional neural network
[68] D. Erhan et al., “Scalable object detection using deep neural networks,” for fast object detection,” in Proc. ECCV, 2016, pp. 354–370.
in Proc. CVPR, 2014, pp. 2155–2162. [106] Y. Zhu et al., “segDeepM: Exploiting segmentation and context in
[69] D. Yoo et al., “AttentionNet: Aggregating weak directions for accurate deep neural networks for object detection,” in Proc. CVPR, 2015,
object detection,” in Proc. CVPR, 2015, pp. 2659–2667. pp. 4703–4711.
[70] M. Najibi et al., “G-CNN: An iterative grid based object detector,” in [107] W. Byeon et al., “Scene labeling with LSTM recurrent neural
Proc. CVPR, 2016, pp. 2369–2377. networks,” in Proc. CVPR, 2015, pp. 3547–3555.
[71] W. Liu et al., “SSD: Single shot multibox detector,” in Proc. ECCV, [108] B. Moysset et al.. (2016). “Learning to detect and localize
2016, pp. 21–37. many objects from few examples.” [Online]. Available:
[72] J. Redmon and A. Farhadi. (2016). “YOLO9000: Better, faster, https://arxiv.org/abs/1611.05664
stronger.” [Online]. Available: https://arxiv.org/abs/1612.08242 [109] X. Zeng et al., “Gated bi-directional CNN for object detection,” in
[73] C.-Y. Fu et al.. (2017). “DSSD : Deconvolutional single shot detector.” Proc. ECCV, 2016, pp. 354–369.
[Online]. Available: https://arxiv.org/abs/1701.06659 [110] S. Gidaris and N. Komodakis, “Object detection via a multi-region
[74] Z. Shen et al., “DSOD: Learning deeply supervised object detectors and semantic segmentation-aware CNN model,” in Proc. CVPR, 2015,
from scratch,” in Proc. ICCV, 2017, p. 7. pp. 1134–1142.
[75] G. E. Hinton et al., “Transforming auto-encoders,” in Proc. ICANN, [111] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural
2011, pp. 44–51. networks,” IEEE Trans. Signal Process., vol. 45, no. 11,
[76] G. W. Taylor et al., “Learning invariance through imitation,” in Proc. pp. 2673–2681, Nov. 1997.
CVPR, 2011, pp. 2729–2736. [112] S. Zagoruyko et al. (2016). “A multiPath network for object detection.”
[77] X. Ren and D. Ramanan, “Histograms of sparse codes for object [Online]. Available: https://arxiv.org/abs/1604.02135
detection,” in Proc. CVPR, 2013, pp. 3246–3253. [113] A. Shrivastava et al., “Training region-based object detectors
[78] J. R. R. Uijlings et al., “Selective search for object recognition,” Int. with online hard example mining,” in Proc. CVPR, 2016,
J. Comput. Vis., vol. 104, no. 2, pp. 154–171, Apr. 2013. pp. 761–769.
[79] P. Sermanet et al., “Pedestrian detection with unsupervised multi-stage [114] S. Ren et al., “Object detection networks on convolutional feature
feature learning,” in Proc. CVPR, 2013, pp. 3626–3633. maps,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 7,
[80] P. Krähenbühl and V. Koltun, “Geodesic object proposals,” in Proc. pp. 1476–1481, Jul. 2017.
ECCV, 2014, pp. 725–739. [115] W. Ouyang et al., “Factors in finetuning deep model for object detection
[81] P. Arbeláez et al., “Multiscale combinatorial grouping,” in Proc. with long-tail distribution,” in Proc. CVPR, 2016, pp. 864–873.
CVPR, 2014, pp. 328–335. [116] S. Hong et al.. (2016). “PVANet: Lightweight deep neural
[82] C. L. Zitnick and P. Dollár, “Edge boxes: Locating object proposals networks for real-time object detection.” [Online]. Available:
from edges,” in Proc. ECCV, 2014, pp. 391–405. https://arxiv.org/abs/1611.08588
[83] W. Kuo et al., “Deepbox: Learning objectness with convolutional [117] W. Shang et al., “Understanding and improving convolutional neural
networks,” in Proc. ICCV, 2015, pp. 2479–2487. networks via concatenated rectified linear units,” in Proc. ICML, 2016,
[84] P. O. Pinheiro et al., “Learning to refine object segments,” in Proc.
pp. 1–9.
ECCV, 2016, pp. 75–91. [118] C. Szegedy et al., “Deep neural networks for object detection,” in
[85] Y. Zhang et al., “Improving object detection with deep convolutional Proc. NIPS, 2013, pp. 1–9.
networks via Bayesian optimization and structured prediction,” in [119] P. O. Pinheiro et al., “Learning to segment object candidates,” in Proc.
Proc. CVPR, 2015, pp. 249–258. NIPS, 2015, pp. 1990–1998.
[86] S. Gupta et al., “Learning rich features from RGB-D images for object [120] C. Szegedy et al.. (2014). “Scalable, high-quality object detection.”
detection and segmentation,” in Proc. ECCV, 2014, pp. 345–360. [Online]. Available: https://arxiv.org/abs/1412.1441
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
[121] M. Everingham et al.. (2011). The PASCAL Visual Object [153] G. Lee et al., “Deep saliency with encoded low level distance map
Classes Challenge 2012 (VOC2012) Results (2012). [Online]. and high level features,” in Proc. CVPR, 2016, pp. 660–668.
Available: http://www.pascal-network.org/challenges/VOC/voc2011/ [154] Z. Luo et al., “Non-local deep features for salient object detection,”
workshop/index.html in Proc. CVPR, 2017, pp. 6593–6601.
[122] M. D. Zeiler and R. Fergus, “Visualizing and understanding [155] Q. Hou et al.. (2016). “Deeply supervised salient object
convolutional networks,” in Proc. ECCV, 2014, pp. 818–833. detection with short connections.” [Online]. Available:
[123] S. Xie et al., “Aggregated residual transformations for deep neural https://arxiv.org/abs/1611.04849
networks,” in Proc. CVPR, 2017, pp. 5987–5995. [156] Q. Yan et al., “Hierarchical saliency detection,” in Proc. CVPR, 2013,
[124] "J. Dai et al. (2017). “Deformable convolutional networks.” [Online]. pp. 1155–1162.
Available: https://arxiv.org/abs/1703.06211 [157] Y. Li et al., “The secrets of salient object segmentation,” in Proc.
[125] C. Rother et al., “AutoCollage,” ACM Trans. Graph., vol. 25, no. 3, CVPR, 2014, pp. 280–287.
pp. 847–852, 2006. [158] V. Movahedi and J. H. Elder, “Design and perceptual validation
[126] C. Jung and C. Kim, “A unified spectral-domain approach for saliency of performance measures for salient object segmentation,” in Proc.
detection and its application to automatic object segmentation,” IEEE CVPRW, 2010, pp. 49–56.
Trans. Image Process., vol. 21, no. 3, pp. 1272–1283, Mar. 2012. [159] A. Borji et al., “Salient object detection: A bench-
[127] W.-C. Tu et al., “Real-time salient object detection with a minimum mark,” IEEE Trans. Image Process., vol. 24, no. 12,
spanning tree,” in Proc. CVPR, 2016, pp. 2334–2342. pp. 5706–5722, Dec. 2015.
[128] J. Yang and M.-H. Yang, “Top-down visual saliency via joint CRF and [160] C. Peng et al., “Graphical representation for heterogeneous face
dictionary learning,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 2,
no. 3, pp. 576–588, Mar. 2017. pp. 301–312, Feb. 2017.
[129] P. L. Rosin, “A simple method for detecting salient regions,” Pattern [161] C. Peng et al., “Face recognition from multiple stylistic sketches:
Recognit., vol. 42, no. 11, pp. 2363–2371, Nov. 2009. Scenarios, datasets, and evaluation,” in Proc. ECCV, 2016, pp. 3–18.
[130] T. Liu et al., “Learning to detect a salient object,” IEEE Trans. Pattern [162] X. Gao et al., “Face sketch–photo synthesis and retrieval using sparse
Anal. Mach. Intell., vol. 33, no. 2, pp. 353–367, Feb. 2011. representation,” IEEE Trans. Circuits Syst. Video Technol., vol. 22,
[131] J. Long et al., “Fully convolutional networks for semantic no. 8, pp. 1213–1226, Aug. 2012.
segmentation,” in Proc. CVPR, 2015, pp. 3431–3440. [163] N. Wang et al., “A comprehensive survey to face hallucination,” Int.
[132] D. Gao et al., “Discriminant saliency, the detection of suspicious J. Comput. Vis., vol. 106, no. 1, pp. 9–30, 2014.
coincidences, and applications to visual recognition,” IEEE Trans. [164] C. Peng et al., “Multiple representations-based face sketch–photo
Pattern Anal. Mach. Intell., vol. 31, no. 6, pp. 989–1005, Jun. 2009. synthesis,” IEEE Trans. Neural Netw. Learn. Syst., vol. 27, no. 11,
[133] S. Xie and Z. Tu, “Holistically-nested edge detection,” in Proc. ICCV, pp. 2201–2215, Nov. 2016.
2015, pp. 1395–1403. [165] A. Majumder et al., “Automatic facial expression recognition system
[134] M. Kümmerer et al.. (2014). “Deep gaze I: Boosting saliency using deep network-based data fusion,” IEEE Trans. Cybern., vol. 48,
prediction with feature maps trained on ImageNet.” [Online]. Available: no. 1, pp. 103–114, Jan. 2018.
https://arxiv.org/abs/1411.1045 [166] P. Viola and M. J. Jones, “Robust real-time face detection,” Int. J.
[135] X. Huang et al., “SALICON: Reducing the semantic gap in saliency Comput. Vis., vol. 57, no. 2, pp. 137–154, 2004.
prediction by adapting deep neural networks,” in Proc. ICCV, 2015, [167] J. Yu et al., “Unitbox: An advanced object detection network,” in
pp. 262–270. Proc. ACM MM, 2016, pp. 516–520.
[136] L. Wang et al., “Deep networks for saliency detection via local [168] S. S. Farfade et al., “Multi-view face detection using deep convolutional
estimation and global search,” in Proc. CVPR, 2015, pp. 3183–3192. neural networks,” in Proc. ICMR, 2015, pp. 643–650.
[137] H. Cholakkal et al.. (2016). “Backtracking spatial pyramid pooling [169] S. Yang et al., “From facial parts responses to face detection: A deep
(SPP)-based image classifier for weakly supervised top-down salient learning approach,” in Proc. ICCV, 2015, pp. 3676–3684.
object detection.” [Online]. Available: https://arxiv.org/abs/1611.05345 [170] S. Yang et al., “Face detection through scale-friendly deep
[138] R. Zhao et al., “Saliency detection by multi-context deep learning,” convolutional networks,” in Proc. CVPR, 2017, pp. 1–12.
in Proc. CVPR, 2015, pp. 1265–1274. [171] Z. Hao et al., “Scale-aware face detection,” in Proc. CVPR, 2017,
[139] C. Bak et al.. (2016). “Spatio-temporal saliency networks pp. 1913–1922.
for dynamic saliency prediction.” [Online]. Available: [172] H. Wang et al.. (2017). “Face R-CNN.” [Online]. Available:
https://arxiv.org/abs/1607.04730 https://arxiv.org/abs/1706.01061
[140] S. He et al., “SuperCNN: A superpixelwise convolutional neural [173] X. Sun et al.. (2017). “Face detection using deep learning:
network for salient object detection,” Int. J. Comput. Vis., vol. 115, An improved faster RCNN approach.” [Online]. Available:
no. 3, pp. 330–344, 2015. https://arxiv.org/abs/1701.08289
[141] X. Li et al., “DeepSaliency: Multi-task deep neural network model for [174] L. Huang et al.. (2015). “DenseBox: Unifying landmark
salient object detection,” IEEE Trans. Image Process., vol. 25, no. 8, localization with end to end object detection.” [Online]. Available:
pp. 3919–3930, Aug. 2016. https://arxiv.org/abs/1509.04874
[142] Y. Tang and X. Wu, “Saliency detection via combining region-level [175] Y. Li et al., “face detection with end-to-end integration of a ConvNet
and pixel-level predictions with CNNs,” in Proc. ECCV, 2016, and a 3D model,” in Proc. ECCV, 2016, pp. 420–436.
pp. 809–825. [176] K. Zhang et al., “Joint face detection and alignment using multitask
[143] G. Li and Y. Yu, “Deep contrast learning for salient object detection,” cascaded convolutional networks,” IEEE Signal Process. Lett., vol. 23,
in Proc. CVPR, 2016, pp. 478–487. no. 10, pp. 1499–1503, Oct. 2016.
[144] X. Wang et al.. (2016). “Edge preserving and multi-scale contextual [177] I. A. Kalinovsky and V. G. Spitsyn, “Compact convolutional neural
neural network for salient object detection.” [Online]. Available: network cascade for face detection,” in Proc. CEUR Workshop, 2016,
https://arxiv.org/abs/1608.08029 pp. 375–387.
[145] M. Cornia et al., “A deep multi-level network for saliency prediction,” [178] H. Qin et al., “Joint training of cascaded CNN for face detection,” in
in Proc. ICPR, 2016, pp. 3488–3493. Proc. CVPR, 2016, pp. 3456–3465.
[146] G. Li and Y. Yu, “Visual saliency detection based on multiscale [179] V. Jain and E. Learned-Miller, “FDDB: A benchmark for face detection
deep CNN features,” IEEE Trans. Image Process., vol. 25, no. 11, in unconstrained settings,” Univ. Massachusetts, Amherst, Amherst,
pp. 5012–5024, Nov. 2016. MA, USA, Tech. Rep. UM-CS-2010-009, 2010.
[147] J. Pan et al., “Shallow and deep convolutional networks for saliency [180] H. Li et al., “A convolutional neural network cascade for face
prediction,” in Proc. CVPR, 2016, pp. 598–606. detection,” in Proc. CVPR, 2015, pp. 5325–5334.
[148] J. Kuen et al., “Recurrent attentional networks for saliency detection,” [181] B. Yang et al., “Aggregate channel features for multi-view face
in Proc. CVPR, 2016, pp. 3668–3677. detection,” in Proc. IJCB, 2014, pp. 1–8.
[149] Y. Tang et al., “Deeply-supervised recurrent convolutional neural [182] N. Markuš et al.. (2013). “Object detection with pixel intensity
network for saliency detection,” in Proc. ACM MM, 2016, pp. 397–401. comparisons organized in decision trees.” [Online]. Available:
[150] X. Li et al., “Contextual hypergraph modeling for salient object https://arxiv.org/abs/1305.4537
detection,” in Proc. ICCV, 2013, pp. 3328–3335. [183] M. Mathias et al., “Face detection without bells and whistles,” in
[151] M.-M. Cheng et al., “Global contrast based salient region detection,” Proc. ECCV, 2014.
IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 3, pp. 569–582, [184] J. Li and Y. Zhang, “Learning surf cascade for fast and accurate object
Mar. 2015. detection,” in Proc. CVPR, 2013.
[152] H. Jiang et al., “Salient object detection: A discriminative [185] S. Liao et al., “A fast and accurate unconstrained face detector,”
regional feature integration approach,” in Proc. CVPR, 2013, IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 2,
pp. 2083–2090. pp. 211–223, Feb. 2016.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
[186] B. Yang et al., “Convolutional channel features,” in Proc. ICCV, 2015, [220] B. Yang et al., “CRAFT objects from images,” in Proc. CVPR, 2016,
pp. 82–90. pp. 6043–6051.
[187] R. Ranjan et al.. (2016). “HyperFace: A deep multi-task [221] I. Croitoru et al., “Unsupervised learning from video to detect fore-
learning framework for face detection, landmark localization, ground objects in single images,” in Proc. ICCV, 2017, pp. 4345–4353.
pose estimation, and gender recognition.” [Online]. Available: [222] C. Wang et al., “Weakly supervised object localization with latent
https://arxiv.org/abs/1603.01249 category learning,” in Proc. ECCV, 2014, pp. 431–445.
[188] P. Hu and D. Ramanan, “Finding tiny faces,” in Proc. CVPR, 2017, [223] D. P. Papadopoulos et al., “Training object class detectors with click
pp. 1522–1530. supervision,” in Proc. CVPR, 2017, pp. 180–189.
[189] Z. Jiang and D. Q. Huynh, “Multiple pedestrian tracking from [224] J. Huang et al., “Speed/accuracy trade-offs for modern convolutional
monocular videos in an interacting multiple model framework,” IEEE object detectors,” in Proc. CVPR, 2017, pp. 3296–3297.
Trans. Image Process., vol. 27, no. 3, pp. 1361–1375, Mar. 2018. [225] Q. Li et al., “Mimicking very efficient network for object detection,”
[190] D. M. Gavrila and S. Munder, “Multi-cue pedestrian detection and in Proc. CVPR, 2017, pp. 7341–7349.
tracking from a moving vehicle,” Int. J. Comput. Vis., vol. 73, no. 1, [226] G. Hinton et al., “Distilling the knowledge in a neural network,”
pp. 41–59, Jun. 2007. Comput. Sci., vol. 14, no. 7, pp. 38–39, 2015.
[191] S. Xu et al., “Jointly attentive spatial-temporal pooling networks [227] A. Romero et al., “FitNets: Hints for thin deep nets,” in Proc. ICLR,
for video-based person re-identification,” in Proc. ICCV, 2017, 2015, pp. 1–13.
pp. 4743–4752. [228] X. Chen et al., “3D object proposals for accurate object class
[192] Z. Liu et al., “Stepwise metric promotion for unsupervised video detection,” in Proc. NIPS, 2015, pp. 424–432.
person re-identification,” in Proc. ICCV, 2017, pp. 2448–2457. [229] J. Dong et al., “Visual-inertial-semantic scene representation for 3D
[193] A. Khan et al., “Cooperative robots to observe moving targets: object detection,” in Proc. CVPR, 2017, pp. 960–970.
Review,” IEEE Trans. Cybern., vol. 48, no. 1, pp. 187–198, Jan. 2018. [230] K. Kang et al., “Object detection in videos with tubelet proposal
[194] A. Geiger et al., “Vision meets robotics: The KITTI dataset,” Int. J. networks,” in Proc. CVPR, 2017, pp. 889–897.
Robot. Res., vol. 32, no. 11, pp. 1231–1237, 2013.
[195] Z. Cai et al., “Learning complexity-aware cascades for deep pedestrian
detection,” in Proc. ICCV, 2015, pp. 3361–3369. Zhong-Qiu Zhao (M’10) received the Ph.D. degree
[196] Y. Tian et al., “Deep learning strong parts for pedestrian detection,” in pattern recognition and intelligent system from
in Proc. CVPR, 2015, pp. 1904–1912. the University of Science and Technology of China,
[197] P. Dollár et al., “Fast feature pyramids for object detection,” IEEE Hefei, China, in 2007.
Trans. Pattern Anal. Mach. Intell., vol. 36, no. 8, pp. 1532–1545, From 2008 to 2009, he held a post-doctoral
Aug. 2014. position in image processing with the CNRS
[198] S. Zhang et al., “Filtered channel features for pedestrian detection,” UMR6168 Lab Sciences de Information et des
in Proc. CVPR, 2015, pp. 1751–1760. Systèmes, La Garde, France. From 2013 to 2014,
[199] S. Paisitkriangkrai et al., “Pedestrian detection with spatially pooled he was a Research Fellow in image processing with
features and structured ensemble learning,” IEEE Trans. Pattern Anal. the Department of Computer Science, Hong Kong
Mach. Intell., vol. 38, no. 6, pp. 1243–1257, Jun. 2016. Baptist University, Hong Kong. He is currently a
[200] L. Lin et al., “Discriminatively trained And-Or graph models for Professor with the Hefei University of Technology, Hefei. His current research
object shape detection,” IEEE Trans. Pattern Anal. Mach. Intell., interests include pattern recognition, image processing, and computer vision.
vol. 37, no. 5, pp. 959–972, May 2015.
[201] M. Mathias et al., “Handling occlusions with Franken-classifiers,” in
Proc. ICCV, 2013, pp. 1505–1512. Peng Zheng received the bachelor’s degree from
[202] S. Tang et al., “Detection and tracking of occluded people,” Int. J. the Hefei University of Technology, Hefei, China,
Comput. Vis., vol. 110, no. 1, pp. 58–69, 2014. in 2010, where he is currently pursuing the Ph.D.
[203] L. Zhang et al., “Is faster R-CNN doing well for pedestrian detection?” degree.
in Proc. ECCV, 2016, pp. 443–457. His current research interests include pattern
[204] Y. Tian et al., “Deep learning strong parts for pedestrian detection,” recognition, image processing, and computer vision.
in Proc. ICCV, 2015.
[205] J. Liu et al.. (2016). “Multispectral deep neural networks for pedestrian
detection.” [Online]. Available: https://arxiv.org/abs/1611.02644
[206] Y. Tian et al., “Pedestrian detection aided by deep learning semantic
tasks,” in Proc. CVPR, 2015, pp. 5079–5087.
[207] X. Du et al., “Fused DNN: A deep neural network fusion approach
to fast and robust pedestrian detection,” in Proc. WACV, 2017,
pp. 953–961. Shou-Tao Xu is currently pursuing the master’s
[208] Q. Hu et al., “Pushing the limits of deep CNNs for pedestrian degree with the Hefei University of Technology,
detection,” IEEE Trans. Circuits Syst. Video Technol., vol. 28, no. 6, Hefei, China.
pp. 1358–1368, Jun. 2018. His current research interests include pattern
[209] D. Tomé et al., “Reduced memory region based deep convolutional recognition, image processing, deep learning, and
neural network detection,” in Proc. ICCE, Berlin, Germany, 2016, computer vision.
pp. 15–19.
[210] J. Hosang et al., “Taking a deeper look at pedestrians,” in Proc. CVPR,
2015, pp. 4073–4082.
[211] J. Li et al.. (2015). “Scale-aware fast R-CNN for pedestrian detection.”
[Online]. Available: https://arxiv.org/abs/1510.08160
[212] Y. Gao et al., “Visual-textual joint relevance learning for tag-based
social image search,” IEEE Trans. Image Process., vol. 22, no. 1,
pp. 363–376, Jan. 2013. Xindong Wu (F’11) received the Ph.D. degree in
[213] T. Kong et al., “RON: Reverse connection with objectness prior artificial intelligence from The University of Edin-
networks for object detection,” in Proc. CVPR, 2017, pp. 5244–5252. burgh, Edinburgh, U.K.
[214] I. J. Goodfellow et al., “Generative adversarial nets,” in Proc. NIPS,
He is currently an Alfred and Helen Lamson
2014, pp. 2672–2680.
[215] Y. Fang et al., “Object detection meets knowledge graphs,” in Proc. Endowed Professor of computer science with the
IJCAI, 2017, pp. 1661–1667. University of Louisiana at Lafayette, Lafayette, LA,
[216] S. Welleck et al., “Saliency-based sequential image attention with USA. His current research interests include data
multiset prediction,” in Proc. NIPS, 2017, pp. 5173–5183. mining, knowledge-based systems, and Web infor-
[217] S. Azadi et al., “Learning detection with diverse proposals,” in Proc. mation exploration.
CVPR, 2017, pp. 7369–7377. Dr. Wu is a Fellow of the AAAS. He is the
[218] S. Sukhbaatar et al., “End-to-end memory networks,” in Proc. NIPS, Steering Committee Chair of the IEEE International
2015, pp. 2440–2448. Conference on Data Mining. He served as the Editor-in-Chief for the IEEE
[219] P. Dabkowski and Y. Gal, “Real time image saliency for black box T RANSACTIONS ON K NOWLEDGE AND D ATA E NGINEERING (IEEE Com-
classifiers,” in Proc. NIPS, 2017, pp. 6967–6976. puter Society) between 2005 and 2008.