Paper 19-Deep Learning Based Neck Models For Object Detection
Paper 19-Deep Learning Based Neck Models For Object Detection
Abstract—Artificial intelligence is the science of enabling Learning approaches are a set of models of Deep Learning,
computers to act without being further programmed. starting from input, then a backbone for feature extraction
Particularly, computer vision is one of its innovative fields that model, then neck model for feature fusion, and finally a head
manages how computers acquire comprehension from videos and model class/box network.
images. In the previous decades, computer vision has been
involved in many fields such as self-driving cars, efficient The neck of the object detector refers to the additional
information retrieval, effective surveillance, and a better layers existing between the backbone [1] and the head. Their
understanding of human behaviour. Based on deep neural role is to collect feature maps from different stages. The neck
networks, object detection is actively growing for pushing the models are composed of several top-down paths and several
limits of detection accuracy and speed. Object Detection aims to bottom-up paths. The idea behind this feature aggregation
locate each object instance and assign a class to it in an image or existing in this model is to allow low-level features to interact
a video sequence. Object detectors are usually provided with a more directly with high-level features, by mixing information
backbone network designed for feature extractors, a neck model from this high-level feature with the low-level feature. They
for feature aggregation, and finally a head for prediction. Neck reach aggregation and feature interaction across many layers,
models, which are the purpose of study in this paper, are neural since the distance between the two feature maps is large.
networks used to make a fusion between high-level features and Several methods can reach be implemented in this part, for
low-level features and are known by their efficiency in object example, PAN [2] or FPN [3] (see Fig. 1).
detection. The aim of this study to present a review of neck
models together before making a benchmarking that would help Head is the last model of object detection, predicts
researchers and scientists use it as a guideline for their works. bounding boxes and classes of objects and could be a sparse
prediction that belongs to One-stage detectors such as YOLO
Keywords—Object detection; deep learning; computer vision; [4] , SDD [5], CenterNet [6], or a Dense prediction that
neck models; feature aggregation; feature fusion belongs to Two-stage detectors, such as Fast R-CNN [7],
I. INTRODUCTION Faster R-CNN [8], Mask R-CNN [9] (see Fig. 1). On the one
hand, One Stage detectors have high inference speeds, these
Object detection is often called image detection, object models predict bounding boxes in a one or single step without
identification, and object recognition; and all these concepts using region proposals. On the other hand, two stage detectors
are synonymous. It is a computer vision method for locating have high localization and recognition accuracy. Firstly, they
instances of objects in an image or video sequence. Object use a Region Proposal Network to generate regions of
detection algorithms, therefore, typically benefit from machine interests; secondly, they send the region proposals for object
learning techniques or deep learning techniques to gain classification and bounding-box regression.
meaningful results. When humans look at images or videos,
they could locate and recognize objects of interest easily. The We aim that our benchmarking study can provide a timely
goal of object detection is to mimic this intelligence using a comparison of neck models of object detection for
computer. With recent advancements in Deep Learning-based practitioners and researchers to further master research on
computer vision models, Object Detection use cases are object detection models. The rest of our study is organized as
spreading more than ever before. A wide range of applications follows: In Section 2, we are going to discuss the different
is implemented, for instance, self-driving cars, object tracking, existing related works about feature aggregation. In Section 3,
anomaly detection, and video surveillance. we list the neck neural networks about object detection used
for feature fusion, their architecture is discussed also in their
Object Detection could be divided into two main categories. In Section 4, our comparative study is presented. In
categories Deep Learning-based techniques and Machine Section 5, we highlight the different recognizable results and
Learning based techniques. Deep Learning based techniques Section 6 covers the discussion. Finally, in Section 7, we
could be separated into two approaches one stage detectors conclude and discuss future directions.
and two-stage detectors. Object Detection based Deep
161 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 11, 2021
Fig. 1. Models’ Taxonomy of Object Detectors in each Part Backbone, Head, and Neck.
162 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 11, 2021
163 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 11, 2021
IV. COMPARISON boxes. The Backbone determines the backbone used for
Table I below illustrates the models that we are going to feature extraction the number associated refers to the number
compare based on different comparison metrics. The measures of layers, and finally, the neck illustrates the feature
are gathered carefully to cover several methods. aggregation network used.
This table illustrates the deep learning models used for the Table I contains the model’s name, Reference, Journal
object detection task of the COCO dataset. It defines the used year, Year, Backbone, Neck, AP, AP50, AP75, APS, APM,
models for the prediction for classification and bounding APL (see Table I).
TABLE I. DETAILED COMPARISONS ON MULTIPLE POPULAR BASELINE OBJECT DETECTORS ON THE COCO DATASET
Model
Journal Model Backbone Neck AP AP50 AP75 APS APM APL
Ref
Libra R-CNN ResNet-50 FPN 38.7 59.9 42.0 22.5 41.1 48.7
[18] CVPR 2019
Libra R-CNN ResNet-101 FPN 40.3 61.3 43.9 22.9 43.1 51.0
Libra R-CNN ResNeXt-101 FPN 43.0 64 47 25.3 45.6 54.6
Faster R-CNN ResNet-50 FPN 37.8 58.7 40.6 21.3 41.0 49.5
Faster R-CNN ResNet-50 AdaFPN 39.0 58.8 41.8 22.6 42.3 50.0
Faster R-CNN ResNet-50 AugFPN 38.8 61.5 42.0 23.3 42.1 47.7
[8] Faster R-CNN ResNet-101 AugFPN 41.5 63.9 45.1 23.8 44.7 52.8
Faster R-CNN ResNext-101- 32x4d AugFPN 41.9 64.4 45.6 25.2 45.4 52.6
Faster R-CNN ResNext-101-64x4d AugFPN 43.0 65.6 46.9 26.2 46.5 53.9
Faster R-CNN MobileNet-v2 AugFPN 34.2 56.6 36.2 19.6 36.4 43.1
FCOS ResNet-50 AugFPN 37.9 58.0 40.4 21.2 40.5 47.9
ICCV 2019 FCOS ResNet-50 FPN 39.1 57.9 42.1 23.3 43.0 50.2
[28]
FCOS ResNet-50 AdaFPN 40.1 58.6 43.2 24.1 43.6 50.6
FCOS ResNeXt-101 FPN 42.7 62.2 46.1 26.0 45.6 52.6
Mask R-CNN ResNet-101 FPN 38.2 60.3 41.7 20.1 41.1 50.2
Mask R-CNN ResNeXt-101 FPN 39.8 62.3 43.4 22.1 43.2 51.2
Mask R-CNN ResNet-50 AugFPN 39.5 61.8 42.9 23.4 42.7 49.1
[9] ICCV 2017
Mask R-CNN ResNet-101 AugFPN 42.4 64.4 46.3 24.6 45.7 54.0
Mask R-CNN ResNet-50 A2 -FPN 36.6 59.3 39.1 19.8 39.3 48.0
2
Mask R-CNN ResNet-101 A -FPN 37.9 60.8 40.5 20.6 41.8 50.1
CascadeR-CNN ResNet-50 FPN 36.5 59 39.2 20.3 38.8 46.4
[29] CVPR 2018 CascadeR-CNN ResNet-101 FPN 38.8 61.1 41.9 21.3 41.8 49.8
CascadeR-CNN ResNet-101 AC-FPN 45.0 64.4 49.0 26.9 47.7 56.6
RetinaNet ResNet-101 FPN 39.1 59.1 42.3 21.8 42.7 50.2
ICCV 2017 RetinaNet ResNeXt-101 FPN 40.8 61.1 44.1 24.1 44.2 51.2
[30]
RetinaNet ResNet-50 AugFPN 37.5 58.4 40.1 21.3 40.5 47.3
RetinaNet MobileNet-v2 AugFPN 34.0 54.0 36.0 18.6 36.0 44.0
[31] arXiv 2019 RetinaMask ResNet-50 FPN 39.4 58.6 42.3 21.9 42.0 51.0
[32] CVPR 2019 Grid R-CNN ResNeXt-101 FPN 43.2 63.0 46.6 25.1 46.5 55.2
HTC ResNeXt-101 FPN 47.1 63.9 44.7 22.8 43.9 54.6
HTC ResNet-50 FPN 38.4 60.0 41.5 20.4 40.7 51.2
HTC ResNet-101 FPN 39.7 61.8 43.1 21.0 42.2 53.5
[33] CVPR 2019
HTC ResNet-50 A2 -FPN 39.8 62.3 43.0 21.6 42.4 52.8
HTC ResNet-101 A2 -FPN 40.8 63.6 44.1 22.3 43.5 54.4
HTC ResNeXt -101 A2 -FPN 42.1 65.3 45.7 23.6 44.8 56.0
[34] CVPR 2020 DetectRS ResNeXt-101-DCN RFP 53.3 71.6 58.5 33.9 56.5 66.9
[35] arXiv 2021 CenterNet2 Res2Net-101-DCN BiFPN 56.4 74.0 61.6 38.7 59.7 68.6
164 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 11, 2021
165 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 11, 2021
5) HTC: Related to HTC [33] model, ResNeXt-101 and have preferred to compare the methods that gain the top
A2FPN are leading in performance, the second performant average precision. On the other hand, in terms of performance
fusion is ResNeXt-101 and FPN. Regarding the models based and based on our spider, centerNet2 achieves the best
on ResNet as a backbone, ResNet-50 with A2FPN works performance. The best method is based on Res2Net101-DCN
better than ResNet-50 with FPN in terms of performance (see as a backbone and BiFPN as a feature aggregation model. The
Fig. 12). second rank is for DetectRs based on ResNeXt-101-DCN as a
backbone and RFP as feature extraction (see Fig. 15).
6) Cascade R-CNN: Cascade R-CNN [29] performance Fig. 15. Multicriteria Comparison based Different Feature Aggregation
was led by merging ResNet-101 and AC-FPN. The Models.
combination of ResNet-101 as a backbone and FPN neck has
VI. DISCUSSION
gained less performance (see Fig. 13).
In this paper, we have systematically depicted the
importance of object detection components, covering the deep
learning methodologies used in object detection, including,
Two Stage detectors and one stage detectors.
Firstly, we have started by presenting object detection
methodologies that have been categorized on traditional
methods and based deep learning methodologies. Secondly,
we have talked about the main arrangement of object detection
based on deep learning that includes a backbone usually
pretrained used to extract feature then feature aggregation
Fig. 13. Cascade R-CNN Comparison based Different Feature Aggregation model for merging high and low features called neck and
Models.
finally, the head used for prediction.
7) RetinaNet: Regarding RetinaNet,[30] firstly, ResNeXt- Relied on our comparative study, we notice that the
101 as a backbone and FPN as a feature aggregation model CenterNet2 with Res2Net-101-DCN as a backbone and
compared to the other fusions, it has gained the highest BiFPN as a feature fusion model leads the performance and
performance; secondly, by merging ResNet-101 and FPN; and gains widespread dominance because of its supremacy
thirdly, ResNet-50 with AugFPN gains the performance, and regarding all criteria.
finally, MobileNet-V2 with AugFPN (see Fig. 14). DetectRS with ResNeXt-101-DCNas a backbone and RFP
as a feature fusion model is reaching the second score. HTC is
gaining the third position with its high performance based on
ResNeXt-101 as a backbone and FPN. We notice also that
there is no intersection between all the compared algorithms,
each algorithm gains its performance regarding all criteria that
the underlying algorithm.
This comparison has also been made based on a set of
criteria. The scores for each method evaluated were calculated
using the Weight Score Model. Various scores or results have
not only helped us determine an overall ranking, but they have
also shown their internal strengths and weaknesses concerning
Fig. 14. RetinaNet Comparison based Different Feature Aggregation Models. each criterion.
This comparison has also revealed the importance of
8) Six Top average precision: On the one hand, after making a benchmark in order to have a global straightforward
extracting the 6 best models in terms of average precision, we view of building efficient models with high performance.
166 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 11, 2021
One the one hand, we hold in mind that from this review [12] Z. Li, K. Gavrilyuk, E. Gavves, M. Jain, and C. G. M. Snoek,
and comparison study that object detection based deep ―VideoLSTM convolves, attends and flows for action recognition,‖
Comput. Vis. Image Underst., vol. 166, pp. 41–50, 2018, doi:
learning models, backbone, neck and head, impacting highly 10.1016/j.cviu.2017.10.011.
the performance. On the other hand, generally, more used [13] N. Ballas, L. Yao, C. Pal, A. Courville, and R. Convolution, ―D
layers give high performance. ELVING D EEPER INTO C ONVOLUTIONAL N ETWORKS,‖ pp.
1–11, 2016.
VII. CONCLUSION [14] A. Karpathy and T. Leung, ―Large-scale Video Classification with
From the study handed, it has been noticed that several Convolutional Neural Networks.‖
scientists and researchers from a diversity of ethnicities are [15] J. Donahue, ―Long-term Recurrent Convolutional Networks for Visual
Recognition and Description,‖ 2014.
working day after day on the object detection field, due to its
utmost importance. Several models are appearing every month [16] N. Ballas, H. Larochelle, and A. Courville, ―Describing Videos by
Exploiting Temporal Structure,‖ pp. 4507–4515, 2015.
with the growth of deep learning.
[17] O. Ronneberger, P. Fischer, and T. Brox, ―U-Net: Convolutional
This comparison could be used as a support, by handing Networks for Biomedical Image Segmentation,‖ pp. 1–8.
researchers a scientific comparison of different object [18] J. Pang, K. Chen, J. Shi, H. Feng, W. Ouyang, and D. Lin, ―Libra R-
detection methodologies and their main models, in order to CNN: Towards balanced learning for object detection,‖ Proc. IEEE
Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 2019-June, no.
build performant models. 2, pp. 821–830, 2019, doi: 10.1109/CVPR.2019.00091.
A comparison of neck used for feature aggregation [19] G. Ghiasi, T. Y. Lin, and Q. V. Le, ―NAS-FPN: Learning scalable
between high and low features has been presented. We have feature pyramid architecture for object detection,‖ Proc. IEEE Comput.
Soc. Conf. Comput. Vis. Pattern Recognit., vol. 2019-June, pp. 7029–
been interested in giving you different necks and analyse the 7038, 2019, doi: 10.1109/CVPR.2019.00720.
performance of their global models. [20] N. Wang et al., ―NAS-FCOS: Fast Neural Architecture Search for
Future work will be focusing on the implementation of Object Detection,‖ Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern
Recognit., pp. 11940–11948, 2020, doi:
some of the different models of object detection-based deep 10.1109/CVPR42600.2020.01196.
learning. We aim to implement, test, and analyze the results. [21] F. Chollet, ―Xception: Deep Learning with Depthwise Separable
REFERENCES Convolutions,‖ CVPR, vol. 7, no. 3, pp. 1251–1258, 2014, doi:
10.4271/2014-01-0975.
[1] S. Bouraya and A. Belangour, ―Object Detectors‖ Convolutional Neural
Networks backbones : a review and a comparative study,‖ vol. 9, no. 11, [22] A. Vaswani et al., ―Attention is all you need,‖ Adv. Neural Inf. Process.
pp. 1379–1386, 2021. Syst., vol. 2017-Decem, no. Nips, pp. 5999–6009, 2017.
[2] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, ―Path Aggregation Network for [23] X. Wang and R. Girshick, ―Non-local Neural Networks.‖
Instance Segmentation,‖ Proc. IEEE Comput. Soc. Conf. Comput. Vis. [24] Y. Chen, ―A 2 -Nets : Double Attention Networks,‖ no. NeurIPS, 2018.
Pattern Recognit., pp. 8759–8768, 2018, doi: [25] H. L. Fu Jun, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei
10.1109/CVPR.2018.00913. Fang, ―Dual Attention Network for Scene Segmentation.‖
[3] T. Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, [26] Y. Chen, M. Rohrbach, Z. Yan, S. Yan, J. Feng, and Y. Kalantidis,
―Feature pyramid networks for object detection,‖ Proc. - 30th IEEE ―Graph-Based Global Reasoning Networks,‖ vol. 1.
Conf. Comput. Vis. Pattern Recognition, CVPR 2017, vol. 2017-Janua,
pp. 936–944, 2017, doi: 10.1109/CVPR.2017.106. [27] M. Tan, R. Pang, and Q. V Le, ―EfficientDet : Scalable and Efficient
Object Detection,‖ pp. 10781–10790.
[4] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, ―You only look
once: Unified, real-time object detection,‖ Proc. IEEE Comput. Soc. [28] Z. Tian, C. Shen, H. Chen, and T. He, ―FCOS: Fully convolutional one-
Conf. Comput. Vis. Pattern Recognit., vol. 2016-Decem, pp. 779–788, stage object detection,‖ Proc. IEEE Int. Conf. Comput. Vis., vol. 2019-
2016, doi: 10.1109/CVPR.2016.91. Octob, pp. 9626–9635, 2019, doi: 10.1109/ICCV.2019.00972.
[5] W. Liu et al., ―SSD: Single shot multibox detector,‖ Lect. Notes [29] Z. Cai and N. Vasconcelos, ―Cascade R-CNN: Delving into High
Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Quality Object Detection,‖ Proc. IEEE Comput. Soc. Conf. Comput.
Bioinformatics), vol. 9905 LNCS, pp. 21–37, 2016, doi: 10.1007/978-3- Vis. Pattern Recognit., pp. 6154–6162, 2018, doi:
319-46448-0_2. 10.1109/CVPR.2018.00644.
[6] K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, and Q. Tian, ―CenterNet: [30] T. Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, ―Focal Loss for
Keypoint triplets for object detection,‖ Proc. IEEE Int. Conf. Comput. Dense Object Detection,‖ IEEE Trans. Pattern Anal. Mach. Intell., vol.
Vis., vol. 2019-Octob, pp. 6568–6577, 2019, doi: 42, no. 2, pp. 318–327, 2020, doi: 10.1109/TPAMI.2018.2858826.
10.1109/ICCV.2019.00667. [31] C. F. Mykhailo and S. Alexander, ―RetinaMask: Learning to predict
[7] R. Girshick, ―Fast R-CNN,‖ Proc. IEEE Int. Conf. Comput. Vis., vol. masks improves state-of-the-art single-shot detection for free.‖
2015 Inter, pp. 1440–1448, 2015, doi: 10.1109/ICCV.2015.169. [32] X. Lu, B. Li, Y. Yue, Q. Li, and J. Yan, ―Grid R-CNN,‖ Proc. IEEE
[8] S. Ren, K. He, and R. Girshick, ―Faster R-CNN : Towards Real-Time Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 2019-June, pp.
Object Detection with Region Proposal Networks,‖ pp. 1–9. 7355–7364, 2019, doi: 10.1109/CVPR.2019.00754.
[9] K. He, G. Gkioxari, P. Dollár, and R. Girshick, ―Mask R-CNN,‖ IEEE [33] K. Chen et al., ―Hybrid task cascade for instance segmentation,‖ Proc.
Trans. Pattern Anal. Mach. Intell., vol. 42, no. 2, pp. 386–397, 2020, IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 2019-
doi: 10.1109/TPAMI.2018.2844175. June, pp. 4969–4978, 2019, doi: 10.1109/CVPR.2019.00511.
[10] S. Sharma, R. Kiros, and R. Salakhutdinov, ―Action Recognition using [34] S. Qiao, L.-C. Chen, and A. Yuille, ―DetectoRS: Detecting Objects with
Visual Attention,‖ pp. 1–11, 2015, [Online]. Available: Recursive Feature Pyramid and Switchable Atrous Convolution,‖ 2020,
http://arxiv.org/abs/1511.04119. [Online]. Available: http://arxiv.org/abs/2006.02334.
[11] A. Kar, N. Rai, K. Sikka, and G. Sharma, ―AdaScan: Adaptive scan [35] X. Zhou, V. Koltun, and P. Krähenbühl, ―Probabilistic two-stage
pooling in deep convolutional neural networks for human action detection,‖ 2021, [Online]. Available: http://arxiv.org/abs/2103.07461.
recognition in videos,‖ Proc. - 30th IEEE Conf. Comput. Vis. Pattern
Recognition, CVPR 2017, vol. 2017-January, pp. 5699–5708, 2017, doi:
10.1109/CVPR.2017.604.
167 | P a g e
www.ijacsa.thesai.org