Object Detection and Tracking Algorithms For Vehicle Counting: A Comparative Analysis

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/343390179
Object Detection and Tracking Algorithms for Vehicle Counting: A Comparative

Analysis
Preprint · July 2020
CITATIONS READS
0 923
2 authors:
Vishal Mandal Yaw Adu-Gyamfi

Cavnue University of Missouri
11 PUBLICATIONS 117 CITATIONS 46 PUBLICATIONS 398 CITATIONS
SEE PROFILE SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Traffic Speed Prediction for Urban Arterial Roads Using Deep Neural Networks View project
Automated pavement distresses detection using road images View project
All content following this page was uploaded by Vishal Mandal on 15 March 2021.
The user has requested enhancement of the downloaded file.

Object Detection and Tracking Algorithms for
Vehicle Counting: A Comparative Analysis
Vishal Mandal and Yaw Adu-Gyamfi
 the field of deep learning and high performance computing,

Abstract— The rapid advancement in the field of deep learning which has fueled an era of ITS within the multi-disciplinary
and high performance computing has highly augmented the scope arena of transportation sciences.
of video-based vehicle counting system. In this paper, the authors
deploy several state-of-the-art object detection and tracking This study is motivated by the need to present a robust vision-
algorithms to detect and track different classes of vehicles in their based counting system that addresses the challenging real-
regions of interest (ROI). The goal of correctly detecting and
world vehicle counting problem. The visual understanding of
tracking vehicles’ in their ROI is to obtain an accurate vehicle
count. Multiple combinations of object detection models coupled objects in an image sequence must face many challenges,
with different tracking systems are applied to access the best perhaps customary to every counting task such as difference in
vehicle counting framework. The models’ addresses challenges scales and perspectives, occlusions, illumination effects and
associated to different weather conditions, occlusion and low-light many more [7]. To address these challenges, several deep
settings and efficiently extracts vehicle information and learning based techniques are proposed to accurately detect and
trajectories through its computationally rich training and count the number of vehicles in different environmental
feedback cycles. The automatic vehicle counts resulting from all conditions. Out of all the problems associated to counting, one
the model combinations are validated and compared against the that stands out the most would be the occlusion in traffic videos.
manually counted ground truths of over 9 hours’ traffic video data
They appear quite frequently on most urban roads that
obtained from the Louisiana Department of Transportation and
Development. Experimental results demonstrate that the experience some form of congestion. This leads to ambiguity in
combination of CenterNet and Deep SORT, Detectron2 and Deep vehicle counting which could likely undermine the quality of
SORT, and YOLOv4 and Deep SORT produced the best overall traffic studies that rely on vision-based counting schemes to
counting percentage for all vehicles. estimate traffic flows or volumes [19]. One of the objectives of
this paper is to propose a counting system that is robust to
Index Terms—Deep learning, Object Detection, Tracking, occlusion problem and can provide a resolve in accurately
Vehicle Counts counting vehicles that experience multi-vehicle occlusion.
I. INTRODUCTION Passenger cars occupy the greatest proportion of on-road

Accurate estimation of the number of vehicles on the road is an vehicles and most often than not they get occluded by trucks
important endeavor in intelligent transportation system (ITS). when they are either too near or distant to traffic cameras.
An effective measure of on-road vehicles can have a plethora Therefore, the scope of this study is limited to counting cars and
of application in transportation sciences including traffic trucks only. We focus on real-time vehicle tracking and
management, signal control and on-street parking [2, 13, 11]. counting using state-of-the-art object detection and tracking
Technically, most vehicle counting methods are characterized algorithms. The rest of the paper is outlined as follows: Section
into either hardware or software-based systems [14]. Inductive- 2 briefly reviews related works in the field of vehicle counting.
loop detectors and piezoelectric sensors are the two most Section 3 contains data description. Section 4 describes the
extensively used hardware systems till date. Although they have proposed methodology including different object detection and
higher accuracies than software based systems, they are tracking algorithms. Section 5 includes empirical results, and
intrusive and expensive to maintain. On the other hand, Section 6 details the conclusions of this study.
software based system that use video cameras and run on
computer vision algorithms present an inexpensive and non- II. RELATED WORK
intrusive approach to obtain vehicle counts. Similarly, with Vision-based vehicle counting is an interesting computer
increasing computing capabilities and recent successes in object vision problem tackled by different techniques. As per the
detection and tracking technology, they manifest a tremendous taxonomy accepted in [26], the counting approach could be
potential to surrogate hardware based systems. Part of the broadly classified into five main categories: counting by frame-
reason to make such a claim is due to the rapid advancement in differencing [24, 8], counting by detection [29, 23], motion
Vishal Mandal is with the Department of Civil and Environmental Yaw Adu-Gyamfi is with the Department of Civil and Environmental
Engineering, University of Missouri-Columbia and with WSP USA, 211 N Engineering, E2509 Lafferre Hall, Columbia, MO 65211 USA (e-mail:
Broadway Suite # 2800, St. Louis, MO 63102 USA (e-mail: adugyamfiy@missouri.edu)
vmghv@mail.missouri.edu).
based counting [22, 5, 6, 16], counting by density estimation residual learning environment. This approach leverages the
[12] and deep learning based counting [25, 27, 18, 28, 1, 21, capabilities of FCN based pixel wise estimation and the
10]. The first three counting methods are environmental strengths of LSTM to learn difficult time-based vehicle
sensitive and generally don’t perform very well in occluded dynamics. The counting accuracy is thus, improved by putting
environments or videos with low frame rates. While counting the time-based correlation into perspective.
by density estimation follows a supervised approach, they
perform poorly in videos that have larger perspective and III. DATA DESCRIPTION
contain oversized vehicles. Density estimation based methods Traffic images and video feeds were the two kinds of dataset
are also limited in their scope of detection and lack object used in this study. These datasets were obtained from the
tracking capabilities. Finally, out of all these counting cameras located at 6 different roadways over a seven-day
approaches, deep learning based counting techniques have had period. The cameras were installed across different roadways in
the greatest developments in recent years. The advancement in New Orleans and maintained by the Louisiana Department of
their built architectures have significantly improved the vehicle Transportation and Development (La DOTD). To train and
counting performance. In this study, we mainly focus on generate robust models, datasets pertaining to different weather
studying counting methods that are founded on deep learning conditions were collected. To incorporate that, video feeds were
based architectures. recorded at the start of every hour for one minute and followed
the same loop for the entire 24 hours in a day. This recording
Awang et al. in [3] proposed a deep learning based technique was further continued for 1 week at all the 6 roadways. Traffic
that tabulates the number of vehicles based on the layer images and videos consist of daylight, nighttime and rain. To
skipping-strategy within a convolutional neural network train all the models used in this study, altogether 11,101 images
(CNN). Prior to performing counts, their approach classifies the were manually annotated for different classes of vehicles viz.
vehicle into different classes based on their distinct features. cars and trucks. Figure 1 shows all the 6 different cameras
Dai et al. in [9] deployed a video based vehicle counting maintained by La DOTD and their respective camera views.
technique using a three-step strategy of object detection, Similarly, any vehicle that travelled across those green and blue
tracking and trajectory analysis. Their method uses a trajectory polygons were counted and appended in the north and
counting algorithm that accurately computes the number of southbound directions respectively.
vehicles in their respective categories and tracks vehicle routes
to obtain traffic flow information. Similarly, a deep neural
network is trained to detect and count the number of cars in
[17]. This approach integrates residual learning alongside
inception-style layers to count cars in a single look. Lately, it
has been demonstrated that single-look techniques have the
potential to excel at both speed and accuracy [20] requirements
useful for object recognition and localization. This could also,
prove beneficial to process image frames at much faster rates
that can accurately produce vehicle counts in real-time
conditions. The authors in [15] deliberate counting as a
computer vision problem and present an adaptive real-time
vehicle counting algorithm that takes robust detection and
counting approach in an urban setting.
Although video-based counting systems have emerged as an

active research area, there are issues with detection and re- Fig. 1. Camera Locations
identification of vehicles while they cross each other in separate
road lanes. To counter this problem, Bui et al. in [4]
successfully deployed state-of-the-art YOLO and SORT IV. METHODOLOGY
algorithms to perform vehicle detection and tracking
The study compares the combination of different object
respectively. To further improve their video-based vehicle
detectors and trackers for performing vehicle counts. As seen
counter, they followed a distinguished region tracking
from Figure 2, the proposed vehicle counting framework
paradigm that works well for intricate vehicle movement
initiates by manually annotating traffic images. This is followed
scenarios. Generally, most object counting literature [27, 18,
by training several object detection models which can then be
28, 1] approximates the object densities, maps them and
used to detect different classes of vehicles. All the object
computes densities over the entire image space to obtain vehicle
detection models are trained on NVIDIA GTX 1080Ti GPU.
counts. However, the accuracy of these methods drop whenever
a video has a larger perspective or if a large bus or truck
After obtaining detection results for each video frame, different
appears. The FCN-rLSTM network proposed in [26] tackles
tracking algorithms are used for multi-object tracking. In this
problems associated to larger perspective videos by
study, we used both online and offline tracking algorithms.
approximating vehicle density maps and performing vehicle
Although offline tracking algorithms yield better results, the
counts by integrating fully convolutional neural networks
advantage of using online trackers could be realized in
(FCN) with long short term memory networks (LSTM) in a
heatmap. Here, cascade corner pooling enables the original
corner pooling module to receive internal information whereas
center pooling helps center keypoints to attain further
identifiable visual pattern within objects that would enable it to
perceive the central part of the region. Likewise, analogous to
CornerNet, a pair of detected corners and familiar embeddings
are used to predict a bounding box. Then after, the final
bounding boxes are determined using the detected center
keypoints. In this study, CenterNet was trained on NVIDIA
GTX 1080Ti GPU which took approximately 22 hours.
Fig. 2. Detection-Tracking based Vehicle Counting

Framework
applications that involve online traffic control scenarios.
Similarly, based on the detection outcomes, each vehicle is
counted only once as per their trajectory matching function’
intrinsic to every object tracking algorithm. The green and blue
polygons drawn on the cameras (see Fig. 1) assigns the entrance
and exit zones for every vehicles’ trajectory and computes the
number of vehicles passing through the north and southbound
directions respectively. Altogether, 4 different state-of-the-art
object detectors and trackers were used making a total of 16
different detector-tracker combinations. Upon obtaining
vehicle counts, all these detector-tracker combinations were Fig. 3. Architecture of CenterNet
further analyzed and had their performance capabilities
compared based off different environmental conditions. The 2. Detectron2
object detectors and tracking algorithms used in this study are
further explained in detail in the subsequent sections. Detectron2 [32], a deep neural network builds up on the Mask
R-CNN benchmark, capable of implementing state-of-the-art
object detection algorithms. Fueled by the PyTorch deep
A. OBJECT DETECTORS
learning framework, it includes features such as panoptic
segmentation, dense-pose, Cascade R-CNN, rotated bounding
1. CenterNet boxes, etc. To perform object detection and segmentation,
Detectron2 requires images and its annotated database to follow
With the advancement in deep learning, object detection annotation format as followed by the COCO dataset. The
algorithms have significantly improved. In this study, the annotation consists of every individual object present in all
authors implemented an object detection framework called images of the training database. Detectron2 supports
CenterNet [30] which discovers visual patterns within each implementation to multiple object detection algorithms using
section of a cropped image at lower computational costs. different backbone network architectures such as ResNET {50,
Instead of detecting objects as a pair of key points, CenterNet 101, 152}, FPN, VGG16, etc. Hence, it can be used as a library
detects them as a triplet thereby, increasing both precision and to support a multitude of projects on top of it. In this study, all
recall values. The framework builds up on the drawbacks vehicles including cars and trucks present in the image is hand-
encountered by CornerNet [31] which uses a pair of corner- annotated for higher training precision. Similarly, while
keypoints to perform object detection. However, CornerNet performing vehicle detection, Detectron2 uses focused
fails at constructing a more global outlook of an object, which detection step comprising of scanning the regions of interest in
CenterNet does by having an additional keypoint to obtain a a pixel-wise manner and performing prediction with the help of
more central information of an image. CenterNet functions on a mask. An advantage of using Detectron2 is that it learns and
the intuition that if a detected bounding box has a higher trains at a much faster rate. Detectron2 shared the same
Intersection over Union (IoU) with the ground-truth box, then hardware resources like CenterNet and took approximately 36
the likelihoods of that central keypoint to be in its central region hours to train.
and be labelled in the same class is high. Hence, the knowledge
of having a triplet instead of a pair increases CenterNet’s 3. YOLOv4
superiority over CornerNet or any other anchor-based detection
approaches. Despite using a triplet, CenterNet is still a single- You Only Look Once (YOLO) is the state-of-the-art object
stage detector but partly receives the functionalities of RoI detection algorithm. Unlike traditional object detection
pooling. Figure 3 shows the architecture of CenterNet where it systems, YOLO investigates the image only once and detects if
uses a CNN backbone that performs cascade corner pooling and there are any objects in it. Out of all the earlier versions of
center pooling to yield two corner and a center keypoint YOLO, YOLOv4 is the latest and most advanced iteration till
date [33]. It has the fastest operating speed for use in production method is proposed for EfficientDet. This compound scaling
systems and for optimization in parallel computations. Some of approach scales up the overall dimensions of width, depth,
the new techniques adopted in YOLOv4 are: (i) Weighted- backbone resolution, BiFPN along with box and class
Residual-Connections, (ii) Cross-Stage-Partial-Connections, prediction networks. Although, the primary goal of EfficientDet
(iii) Cross mini-batch, (iv) Normalization (CmBN), (v) Self- was to perform object detection, it could also be deployed to
adversial-training, (vi) Mish-activation, etc. To obtain higher perform tasks such as semantic segmentation. Training an
values for precision, YOLOv4 uses a Dense Block, a deeper and EfficientDet model took approximately 36 hours on an
more complex network. Similarly, the backbone of its feature NVIDIA GTX 1080Ti GPU.
extractor uses CSPDarknet-53, which deploys the CSP
connections alongside Darkenet-53 from the earlier YOLOv3.
B. OBJECT TRACKER
In addition to CSPDarknet-53, the architecture of YOLOv4
comprises of SPP additional module, PANet path-aggregation 1. IOU Tracker
neck and YOLOv3 anchor-based head. The SPP block is IOU tracker is built on the assumption that every object is
stacked over CSPDarknet53 to increase the receptive field that tracked on a per-frame basis such that there are none or very
could discretize the most remarkable context features and few gaps present in between detections [35]. Similarly, IOU
makes sure that there is no drop in its network operation speed. assumes that there is a greater overlap value for intersection
Similarly, PANet is used for parameter aggregation from over union while obtaining object detections in successive
several levels of backbone in place of Feature Pyramid Network frames. The equation (1) measures the Intersection over Union
(FPN) that is used in YOLOv3. YOLOv4 models took which forms the basis for this approach.
approximately 24 hours to train and shared the same hardware
resources with CenterNet and Detectron2.
4. EfficientDet Algorithm 1 IOU Tracking
EfficientDet is a state-of-the-art object detection algorithm that 1: function Tracker( detections, σl, σh, σiou, mintsize
basically follows single-stage detectors pattern [34]. The ⊲ detection(dict(class, score, box))
architecture of EfficientDet is shown in Figure 4. Here, the 2: let σl ← low detection threshold
ImageNet-pretrained EfficientNets has been deployed as the 3: let σh ← high detection threshold
network’s backbone. Similarly, in order to obtain an easier and 4: let σiou ← IOU threshold
5: let mintsize ← minimum track size in frames
quicker multi-scale fusion of features, a weighted bi-directional
6. let Ta ←[] ⊲ active tracks
feature pyramid network (BiFPN) has been proposed. BiFPN 7. let Tf ←[]
here, serves as the feature network and receives approximately ⊲ finished tracks
3-7 features from the backbone network and continually 8. for f rame, dets in detections do
performs top-down and bottom-up bidirectional fusion of 9. dets ← filter for dets with score ≥ σl
features. These synthesized features are transferred to a class 10. let Tu ←[] ⊲ updated tracks
11. for ti in Ta do
and box network to achieve vehicle class and bounding box 12. if not empty(dets) then
predictions correspondingly. Also, all the vehicle class and box 13. biou, bbox ← find max iou box(tail box(ti), dets)
14. if biou ≥ σiou then
15. append new detection(ti, bbox)
16. set max score(ti, box score(bbox))
17. set class(ti, box class(bbox))
18. Tu ← append(Tu, ti)
19. remove(dets, bbox) ⊲ remove box from dets
20. if empty(Tu) or ti is not last(Tu) then
21. if get max score(ti)≥ σh or size(ti)≥ mintsize then
22. Tf ← append(Tf , ti)
23. Tn ← new tracks from dets
24. Ta ← Tu + Tn
25. return Tf
𝐴𝑟𝑒𝑎 (𝑎) ∩ 𝐴𝑟𝑒𝑎 (𝑏)

𝐼𝑂𝑈 (𝑎, 𝑏) = 𝐴𝑟𝑒𝑎 (𝑎) ∪ 𝐴𝑟𝑒𝑎 (𝑏) (1)
IOU tracker specifically tracks objects by assigning detection

Fig. 4. Architecture of EfficientDet [34] with the highest IOU value (equation 1) to the last detection in
the earlier frame if a specific threshold value is satisfied. In
network weights are jointly shared across every feature level. cases, where any detection was not assigned to an existing
Similarly, to achieve higher accuracy, a new compound scaling track, then it begins with a new one. Likewise, any track that
was devoid of an assigned detection will end. Since, we aim to which could perhaps, pose ambiguity problems in feature
track vehicles in this study, the IOU performance could be correspondence. To undermine ambiguity, most contemporary
further enhanced by canceling tracks that don’t meet a certain algorithms use exhaustive search along with correlation over
threshold time length and where no detected vehicle exceeded larger pixels of image neighborhood. Likewise, the minimum
the required IOU threshold. It is important to note that IOU value of cosine distance is also useful at computing any
tracker is heavily reliant on how accurately vehicles are resemblance between some of the characteristic features which
recognized by object detection models, so special focus should is useful for object tracking. In the current study, a feature-
be laid out on effectively training object detection algorithms. based object tracker called Deep SORT is deployed. Some of
IOU’s ability to handle frame rates of over 50,000 fps in the features of this tracking algorithm is explained in detail as
conjunction to its low computational cost makes it an incredibly follows.
powerful object tracker. The step-wise operations followed by
IOU tracker is shown in Algorithm 1. 3.1 Deep SORT
Similarly, Kalman-IOU (KIOU) tracking has been further The Simple Online and Realtime Tracking with a Deep
explored. The Kalman filter’s ability of performing predictions Association metric (Deep SORT) enables multiple object
allows users to skip frames while still keeping track of the tracking by integrating appearance information with its tracking
object. Skipping frames allows the detector to speed-up the components [37]. A combination of Kalman Filter and
process as in a tracking-by-detection task, smaller number of Hungarian algorithm is used for tracking. Here, Kalman
frames wedges lower computational requirement. Using an filtering is performed in image space while Hungarian
appropriate object detector with Kalman-IOU tracker, and technique facilitates frame-by-frame data association using an
configuring the frames to skip two-thirds of frames per second association metric that computes bounding box overlap. To
could enable the tracker to run in real-time. Likewise, this obtain motion and appearance information, a trained
feature could also improve the performance of Kalman-IOU convolutional neural network (CNN) is applied.
tracker compared to the original IOU tracker.
By integrating CNN, the tracker achieves greater robustness
2. SORT against object misses and occlusions while preserving the
trackers ability to quickly implement to online and realtime
Simple Online and Realtime Tracking (SORT) is an scenarios intact. The CNN architecture of the system is shown
implementation of tracking-by-detection framework where the in Table I. A wide residual network with two convolutional
main objective is to detect objects each frame and associate layers followed by six residual blocks is applied. In dense layer
them for online and real-time tracking application [36]. 10, a global feature map of dimensionality 128 is calculated.
Methods such as Kalman Filter and Hungarian algorithm are Finally, batch and ℓ2 normalization features over the unit
used for tracking. The characteristic feature of SORT is that it hypersphere accesses compatibility with cosine arrival metric.
only uses detection information from the previous and current Overall, Deep SORT is a highly versatile tracker and can match
frames, enabling it to competently perform online and real-time performance capabilities with other state-of-the-art tracking
tracking. To further explain this, an object model is described algorithms.
as expressed in equation (2) where u, v, s, and r represent the TABLE I.
horizontal pixel location, vertical pixel location, area and aspect OVERVIEW OF DEEP SORT’S CNN ARCHITECTURE
ratio of the target object respectively. Anytime a detection is
linked to a target object, the detected bounding box is used to Patch
inform the target state and the horizontal and velocity values are Name Size/Stride Output Size
solved using Kalman filters. This helps in identifying target’s Conv1 3 × 3/1 32 × 128 × 64
identity in successive frames and facilitates tracking.
Conv2 3 × 3/1 32 × 128 × 64
X = [ u, v, s, r, u, u̇, v̇, ṡ]T (2) Max Pool 3 3 × 3/2 32 × 64 × 32
Residual 4 3 × 3/1 32 × 64 × 32
3. Feature Based Object Tracker
Residual 5 3 × 3/1 32 × 64 × 32
In Feature-based object tracking, there is the usage of Residual 6 3 × 3/2 64 × 32 × 16
appearance information to track objects in respective traffic
scenes. This method is useful in tracking vehicles in occluded Residual 7 3 × 3/1 64 × 32 × 16
settings. The system extracts object features from one frame Residual 8 3 × 3/2 128 × 16 × 8
and then matches appearance information with successive Residual 9 3 × 3/1 128 × 16 × 8
frames based on the measure of similarity. Feature-based object
tracking consists of both feature extraction and feature Dense 10 128
correspondence. The feature points are extracted from the Batch and ℓ2
objects in an image using various statistical approaches. Feature normalization 128
correspondence is considered an arduous task since, a feature
point in one image may have analogous points in other images
V. RESULTS
This section evaluates the performance of different
combinations of object detectors and trackers. The main goal of
this study is to identify the best performing object detector-
tracker combination. For comparative analysis, the models are
tested on a total of 546 video clips of length 1 minute each
comprising of over 9 hours’ total video length. Figure 1 shows
all the camera views with manually generated green and blue
polygons that record the number of vehicles passing through
them in both north and southbound directions respectively. The
vehicle counts are evaluated based on four different categories:
(i) overall count of all vehicles, (ii) total count of cars only, (iii)
total count of trucks only, and (iv) overall vehicle counts for
different times of the day (i.e. daylight, nighttime, rain). To
establish ground truth, all the vehicles are manually counted
from the existing 9 hours’ video test data. The performance is
assessed by expressing the automatic counts obtained from
different model combinations over the ground truth value
expressed in per hundredth or percentage.
To examine the performance of object detectors, heat maps

showing False Negatives (FN), False Positives (FP) and True
Positives (TP) are plotted in Figure 5 for all the object detectors
used in the study. The models were tested on altogether 6
camera views at different times of the day. The top left and right
columns show the heat maps generated for CenterNet and
Detectron2 while the bottom left and right columns display heat
maps produced for EfficientDet and YOLOv4 respectively. For
all these respective object detectors, the first column represents
FN, the second column designates FP and the third column
denotes TP. The detection is classified as False Negative (FN)
if the detector fails at detecting the vehicle despite it being
present at that spot. Therefore, the column showing FN should
necessarily not have brighter intensity of colors around those
sections of the roadway. Almost all object detectors have
performed well at detecting FN in most camera views except
for a few instances where Detectron2 in its last camera row
recorded sharp intensity of heat scales in its south bound
direction and for CenterNet’s 5th camera view where it
generates heat maps in its south bounds as well. This is largely
because certain camera views had insufficient number of traffic
images used for training and could have possibly experienced
heavy congestion at those sites. For instance, heat maps closer
to the camera in night views are produced generally when the
heavy gross vehicles such as buses and trucks remain congested
at those spots for a very long time.
Similarly, the detection is classified as False Positive (FP) if the Fig. 5. Heat Maps Generated for Different Object Detectors
detector erroneously detects a vehicle at a spot with no vehicles
present. As observed from the heat maps, the FP columns for then the model might have been too confident which is not very
object detectors are generally clean with a few camera views in ideal. Finally, True Positive (TP) is the one that correctly
Detectron2 and EfficientDet generating incorrect detects vehicle when there are any actual vehicles present on
classifications. The camera view with flyovers or overpass the roadways. Most object detection models generated correct
roads caused the model to misclassify some of the detections. true positives except for a few camera views where the vehicles
Sometimes, camera movements and conditions such as rain are either too distant or encounters lowlight or nighttime
sticking on the camera lens and pitch darkness also cause such conditions where only the vehicles’ headlights were visible.
misclassifications. Ideally, we do not aim at seeing intense heat
maps for both false negatives and false positives. However, if Figure 6 shows the overall count percentage for all vehicles. As
we have higher false positives but obtain lower false negatives, seen from the figure, the overall count percentage for some of
Overall Count Percentage for All Vehicles
160
140
120
100
Percentage
80
60
40
20
0
Model Combination
Northbound Count Percentage Southbound Count Percentage
Fig. 6. Performance of Model Combination for All Vehicles Count
the model combinations exceed over 100 percent while a couple an over-estimate of truck counts. Exaggerating the actual
combinations obtain counting results below 45 percent of the number of vehicle counts (either trucks or cars) could be
actual counts. Any model combination that either over-counts attributed to that fact that some of the detectors produced
or under-counts the actual number of vehicles are considered multiple bounding boxes for the same vehicle while traversing
faulty match while the ones that perform counts in the order the video scene. This impelled the tracker to confuse the same
closer to 100 percent are termed an optimal match. The best vehicle as different ones and assign them with newer values
performing model combinations that obtained a more accurate every time a bounding box re-appears.
count estimate for all vehicles were YOLOv4 and Deep SORT,
Detectron2 and Deep SORT, and CenterNet and Deep SORT. Likewise, Table II illustrates the performance comparison of
Thus, all these model combinations can be considered an models in different weather conditions. The counting results
optimal match. show that the best performing model combinations were
YOLOv4 and Deep SORT, Detectron2 and Deep SORT, and
Similarly, Figures 7-8 compares the performance of different CenterNet and Deep SORT, analogous to the comparison chart
model combinations for counting cars and trucks respectively. as shown in Figure 6. Vehicle counting accuracies largely
From Figure 7, it can be observed that CenterNet and IOU, depends on the precision of object detection models. However,
CenterNet and SORT, Detectron2 and SORT, EfficientDet and it is evident from Table II that the models didn’t achieve
Deep SORT, and YOLOv4 and Deep SORT obtained the best optimal results for the most part. The reasons could be partly
counting results. These detector-tracker combinations attributed to the inferior camera quality, unstable camera views
performed well in both north and southbound directions due to the wind blowing on highways, and the presence of fog
respectively. Occlusion issues created a hindrance in correctly or mist on camera lens.
locating cars which would often be obstructed by larger
vehicles whenever they are too close to the camera. Likewise, During daylight, nighttime and rainy conditions, EfficientDet’s
in Figure 8, the truck counter performance is assessed. Out of combination with SORT and KIOU failed miserably at
all the model combinations, only EfficientDet and Deep SORT counting the number of vehicles. EfficientDet mainly suffered
obtained acceptable counting performance in both north and with its detection capability. For model combinations that
southbound directions. Although, the combination of CenterNet recorded count percentage over 100 typically had both detector
and KIOU, and EfficientDet and SORT separately obtained and tracker at fault. Object detectors generated multiple
accurate counting results, their scope was limited to only either bounding boxes for the same vehicle that resulted in over-
North or Southbound directions respectively. Most of the other counting of the number of vehicles. Also, some of the trackers
model combinations didn’t accurately count trucks due to the did not perform ideally at predicting vehicle trajectories and
presence of other heavy gross vehicles (HGV) such as buses, assigned them as separate vehicles at certain occasions.
trailers, and multi-axle single units. These HGVs often
confused the models and were assigned as trucks that generated
Car Counts Performance
160
140
Percentage
120
100
80
60
Model Combination
Northbound Count Percentage Southbound Count Percentage 100 % Line
Fig. 7. Performance of Model Combination for Car Counts Only
Truck Counts Performance

270
230
Percentage
190
150
110
70
30
Model Combination
Northbound Count Percentage Southbound Count Percentage 100 % Line
Fig. 8. Performance of Model Combination for Truck Counts Only

TABLE II.
PERFORMANCE OF MODEL COMBINATIONS IN DIFFERENT WEATHER CONDITIONS
Northbound Count Southbound Count
Time of Day Model Combination Percentage Percentage
YOLOv4 and SORT 112.3597165 114.9770576
YOLOv4 and KIOU 70.81364442 89.70461715
YOLOv4 and IOU 144.3812758 155.2767422
YOLOv4 and Deep SORT 92.277562 91.5865623
EfficientDet and SORT 30.53610906 23.24962286
EfficientDet and KIOU 32.47185667 41.27978954
EfficientDet and IOU 82.04832193 57.673148
Detectron2 and SORT 110.6098641 114.0952108
Detectron2 and KIOU 76.68340224 121.243189
Detectron2 and IOU 153.7507383 153.6062518
Detectron2 and Deep SORT 94.30005907 97.24691712
CenterNet and SORT 114.2941524 115.5147691
CenterNet and KIOU 75.02215003 105.6638945
CenterNet and IOU 137.0496161 144.063665
Daylight CenterNet and Deep SORT 97.42321323 100.1362202
YOLOv4 and SORT 107.1243523 106.5976714
YOLOv4 and KIOU 72.99222798 87.12807245
YOLOv4 and IOU 145.9196891 166.2354463
YOLOv4 and Deep IOU 91.256962 90.25664125
Night-time CenterNet and Deep SORT 95.98445596 92.94954722
YOLOv4 and SORT 114.4578313 101.9874477
YOLOv4 and KIOU 82.06157965 74.89539749
YOLOv4 and IOU 145.9170013 153.7656904
YOLOv4 and Deep SORT 91.258975 89.256987
Rain CenterNet and Deep SORT 102.0080321 87.23849372
VI. CONCLUSION sometimes over-exaggerate the number of vehicles. Although,

In this study, a detection-tracking framework is applied to conditions such as inferior camera quality, occlusion and low
automatically count the number of vehicles on roadways. The light conditions proved tricky in accurately detecting different
state-of-the-art detector-tracker model combinations have been classes of vehicles, certain combinations of detector-tracker
further refined to achieve significant improvements in vehicle framework functioned fine for challenging conditions as well.
counting results although there are still many shortcomings Deep learning based object detection models coupled with both
which the authors aim to address in the future study. Occlusion online and offline multi-object tracking systems could integrate
and lower visibility created identity switches and same vehicles real-time object detections in conjunction to tracking vehicle
were detected multiple times which caused the model to movement trajectories. This outline was accepted which in turn
facilitated accurate vehicle counts. Moreover, we experimented
with the detector-tracker ability to correctly detect different
classes of vehicles, estimate vehicles’ speed, direction and its [19] C.C.C. Pang, W. W. L. Lam, and N.H.C. Yung, "A method for vehicle
count in the presence of multiple-vehicle occlusions in traffic images." IEEE
trajectory information to identify some of the best performing
Transactions on Intelligent Transportation Systems 8, no. 3 (2007): 441-459.
models which could be further fine-tuned to remain robust at [20] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, "You only look once:
counting vehicles in different directions and environmental Unified, real-time object detection." In Proceedings of the IEEE conference
conditions. The figures and tables present a systematic on computer vision and pattern recognition, pp. 779-788. 2016.
[21] V.A. Sindagi, and V. M. Patel, "A survey of recent advances in cnn-based
representation of what model combinations perform well at
single image crowd counting and density estimation." Pattern Recognition
obtaining vehicle counts in different conditions. Overall, for Letters 107 (2018): 3-16.
counting all vehicles on the roadway, experimental results from [22] K. SuganyaDevi, N. Malmurugan, and R. Sivakumar, "Efficient
this study prove that YOLOv4 and Deep SORT, Detectron2 and foreground extraction based on optical flow and smed for road traffic
analysis." International Journal of Cyber-Security and Digital Forensics
Deep SORT, and CenterNet and Deep SORT were the most
(IJCSDF) 1, no. 3 (2012): 177-182.
ideal combinations. [23] E. Toropov, L. Gui, S. Zhang, S. Kottur, and J.M.F. Moura, "Traffic flow
from a low frame rate city camera." In 2015 IEEE International Conference
REFERENCES on Image Processing (ICIP), pp. 3802-3806. IEEE, 2015.
[24] C.M. Tsai, and Z.M. Yeh, "Intelligent moving objects detection via
adaptive frame differencing method." In Asian Conference on Intelligent
[1] C. Arteta, V. Lempitsky, and A. Zisserman, "Counting in the wild." In
Information and Database Systems, pp. 1-11. Springer, Berlin, Heidelberg,
European conference on computer vision, pp. 483-498. Springer, Cham,
2013.
2016.
[25] C. Zhang, H. Li, X. Wang, and X. Yang, "Cross-scene crowd counting via
[2] C. S. Asha, and A. V. Narasimhadhan, "Vehicle counting for traffic
deep convolutional neural networks." In Proceedings of the IEEE conference
management system using YOLO and correlation filter." In 2018 IEEE
on computer vision and pattern recognition, pp. 833-841. 2015.
International Conference on Electronics, Computing and Communication
[26] S. Zhang, G. Wu, J. P. Costeira, and J.MF Moura, "Fcn-rlstm: Deep spatio-
Technologies (CONECCT), pp. 1-6. IEEE, 2018.
temporal neural networks for vehicle counting in city cameras." In
[3] S. Awang, and NMAN Azmi, "Vehicle counting system based on vehicle
Proceedings of the IEEE international conference on computer vision, pp.
type classification using deep learning method." In IT Convergence and
3667-3676. 2017.
Security 2017, pp. 52-59. Springer, Singapore, 2018.
[27] Y. Zhang, D. Zhou, S. Chen, S. Gao, and Y. Ma, "Single-image crowd
[4] N. Bui, H. Yi, and J. Cho, "A vehicle counts by class framework using
counting via multi-column convolutional neural network." In Proceedings
distinguished regions tracking at multiple intersections." In Proceedings of
of the IEEE conference on computer vision and pattern recognition, pp. 589-
the IEEE/CVF Conference on Computer Vision and Pattern Recognition
597. 2016.
Workshops, pp. 578-579. 2020.
[28] Z. Zhao, H. Li, R. Zhao, and X. Wang, "Crossing-line crowd counting with
[5] Y.L. Chen, B.F. Wu, H.Y. Huang, and C.J. Fan, "A real-time vision system
two-phase deep neural networks." In European Conference on Computer
for nighttime vehicle detection and traffic surveillance." IEEE Transactions
Vision, pp. 712-726. Springer, Cham, 2016.
on Industrial Electronics 58, no. 5 (2010): 2030-2044.
[29] Y. Zheng, and S. Peng, "Model based vehicle localization for urban traffic
[6] Z. Chen, T. Ellis, and S. A. Velastin, "Vehicle detection, tracking and
surveillance using image gradient based matching." In 2012 15th
classification in urban traffic." In 2012 15th International IEEE Conference
International IEEE Conference on Intelligent Transportation Systems, pp.
on Intelligent Transportation Systems, pp. 951-956. IEEE, 2012.
945-950. IEEE, 2012.
[7] L. Ciampi, G. Amato, F. Falchi, C. Gennaro, and F. Rabitti, "Counting
[30] K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, and Q. Tian, "Centernet:
Vehicles with Cameras." In SEBD. 2018.
Keypoint triplets for object detection." In Proceedings of the IEEE
[8] R. Cucchiara, C. Grana, M. Piccardi, and A. Prati, "Statistic and knowledge-
International Conference on Computer Vision, pp. 6569-6578. 2019.
based moving object detection in traffic scenes." In ITSC2000. 2000 IEEE
[31] H. Law, and J. Deng, "Cornernet: Detecting objects as paired keypoints."
Intelligent Transportation Systems. Proceedings (Cat. No. 00TH8493), pp.
In Proceedings of the European Conference on Computer Vision (ECCV),
27-32. IEEE, 2000.
pp. 734-750. 2018.
[9] Z. Dai, H. Song, X. Wang, Y. Fang, X. Yun, Z. Zhang, and H. Li, "Video-
[32] Y. Wu, A. Kirillov, F. Massa, W.Y. Lo, and R. Girshick, "Detectron2."
based vehicle counting framework." IEEE Access 7 (2019): 64460-64470.
(2019).
[10] M.R. Hsieh, Y.L. Lin, and W. H. Hsu, "Drone-based object counting by
[33] A. Bochkovskiy, C.Y. Wang, and H.Y.M. Liao, "YOLOv4: Optimal Speed
spatially regularized regional proposal network." In Proceedings of the IEEE
and Accuracy of Object Detection." arXiv preprint arXiv:2004.10934
International Conference on Computer Vision, pp. 4145-4153. 2017.
(2020).
[11] G. Khan, M.A. Farooq, Z. Tariq, and M.U.G. Khan, "Deep-Learning Based
[34] M. Tan, R. Pang, and Q. V. Le, "Efficientdet: Scalable and efficient object
Vehicle Count and Free Parking Slot Detection System." In 2019 22nd
detection." In Proceedings of the IEEE/CVF Conference on Computer
International Multitopic Conference (INMIC), pp. 1-7. IEEE, 2019.
Vision and Pattern Recognition, pp. 10781-10790. 2020.
[12] V. Lempitsky, and A. Zisserman, "Learning to count objects in images."
[35] E. Bochinski, T. Senst, and T. Sikora, "Extending IOU based multi-object
In Advances in neural information processing systems, pp. 1324-1332.
tracking by visual information." In 2018 15th IEEE International Conference
2010.
on Advanced Video and Signal Based Surveillance (AVSS), pp. 1-6. IEEE,
[13] Z. Li, M. Shahidehpour, S. Bahramirad, and A. Khodaei, "Optimizing
2018.
traffic signal settings in smart cities." IEEE Transactions on Smart Grid 8,
[36] A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, "Simple online and
no. 5 (2016): 2382-2393.
realtime tracking." In 2016 IEEE International Conference on Image
[14] J.P. Lin, and M.T. Sun, "A YOLO-based traffic counting system." In 2018
Processing (ICIP), pp. 3464-3468. IEEE, 2016.
Conference on Technologies and Applications of Artificial Intelligence
[37] N. Wojke, A. Bewley, and D. Paulus, "Simple online and realtime tracking
(TAAI), pp. 82-85. IEEE, 2018.
with a deep association metric." In 2017 IEEE international conference on
[15] F. Liu, Z. Zeng, and R. Jiang, "A video-based real-time adaptive vehicle-
image processing (ICIP), pp. 3645-3649. IEEE, 2017.
counting system for urban roads." PloS one 12, no. 11 (2017): e0186098.
[16] G. Mo, and S. Zhang, "Vehicles detection in traffic flow." In 2010 Sixth
International Conference on Natural Computation, vol. 2, pp. 751-754.
IEEE, 2010.
[17] T.N. Mundhenk, G. Konjevod, W. A. Sakla, and K. Boakye, "A large
contextual dataset for classification, detection and counting of cars with deep
learning." In European Conference on Computer Vision, pp. 785-800.
Springer, Cham, 2016.
[18] D. Onoro-Rubio, and R. J. López-Sastre, "Towards perspective-free object
counting with deep learning." In European Conference on Computer Vision,
pp. 615-629. Springer, Cham, 2016.
View publication stats

Object Detection and Tracking Algorithms For Vehicle Counting: A Comparative Analysis

Uploaded by

Object Detection and Tracking Algorithms For Vehicle Counting: A Comparative Analysis

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Object Detection and Tracking Algorithms for Vehicle Counting: A Comparative

Preprint · July 2020

Vishal Mandal Yaw Adu-Gyamfi

SEE PROFILE SEE PROFILE

Automated pavement distresses detection using road images View project

The user has requested enhancement of the downloaded file.

 the field of deep learning and high performance computing,

I. INTRODUCTION Passenger cars occupy the greatest proportion of on-road

Although video-based counting systems have emerged as an

Fig. 2. Detection-Tracking based Vehicle Counting

4. EfficientDet Algorithm 1 IOU Tracking

𝐴𝑟𝑒𝑎 (𝑎) ∩ 𝐴𝑟𝑒𝑎 (𝑏)

IOU tracker specifically tracks objects by assigning detection

To examine the performance of object detectors, heat maps

Fig. 6. Performance of Model Combination for All Vehicles Count

Fig. 7. Performance of Model Combination for Car Counts Only

Truck Counts Performance

Northbound Count Percentage Southbound Count Percentage 100 % Line

Fig. 8. Performance of Model Combination for Truck Counts Only

VI. CONCLUSION sometimes over-exaggerate the number of vehicles. Although,

View publication stats

You might also like