Object Detection and Tracking Algorithms For Vehicle Counting: A Comparative Analysis
Object Detection and Tracking Algorithms For Vehicle Counting: A Comparative Analysis
net/publication/343390179
CITATIONS READS
0 923
2 authors:
Some of the authors of this publication are also working on these related projects:
Traffic Speed Prediction for Urban Arterial Roads Using Deep Neural Networks View project
All content following this page was uploaded by Vishal Mandal on 15 March 2021.
Vishal Mandal is with the Department of Civil and Environmental Yaw Adu-Gyamfi is with the Department of Civil and Environmental
Engineering, University of Missouri-Columbia and with WSP USA, 211 N Engineering, E2509 Lafferre Hall, Columbia, MO 65211 USA (e-mail:
Broadway Suite # 2800, St. Louis, MO 63102 USA (e-mail: adugyamfiy@missouri.edu)
vmghv@mail.missouri.edu).
based counting [22, 5, 6, 16], counting by density estimation residual learning environment. This approach leverages the
[12] and deep learning based counting [25, 27, 18, 28, 1, 21, capabilities of FCN based pixel wise estimation and the
10]. The first three counting methods are environmental strengths of LSTM to learn difficult time-based vehicle
sensitive and generally don’t perform very well in occluded dynamics. The counting accuracy is thus, improved by putting
environments or videos with low frame rates. While counting the time-based correlation into perspective.
by density estimation follows a supervised approach, they
perform poorly in videos that have larger perspective and III. DATA DESCRIPTION
contain oversized vehicles. Density estimation based methods Traffic images and video feeds were the two kinds of dataset
are also limited in their scope of detection and lack object used in this study. These datasets were obtained from the
tracking capabilities. Finally, out of all these counting cameras located at 6 different roadways over a seven-day
approaches, deep learning based counting techniques have had period. The cameras were installed across different roadways in
the greatest developments in recent years. The advancement in New Orleans and maintained by the Louisiana Department of
their built architectures have significantly improved the vehicle Transportation and Development (La DOTD). To train and
counting performance. In this study, we mainly focus on generate robust models, datasets pertaining to different weather
studying counting methods that are founded on deep learning conditions were collected. To incorporate that, video feeds were
based architectures. recorded at the start of every hour for one minute and followed
the same loop for the entire 24 hours in a day. This recording
Awang et al. in [3] proposed a deep learning based technique was further continued for 1 week at all the 6 roadways. Traffic
that tabulates the number of vehicles based on the layer images and videos consist of daylight, nighttime and rain. To
skipping-strategy within a convolutional neural network train all the models used in this study, altogether 11,101 images
(CNN). Prior to performing counts, their approach classifies the were manually annotated for different classes of vehicles viz.
vehicle into different classes based on their distinct features. cars and trucks. Figure 1 shows all the 6 different cameras
Dai et al. in [9] deployed a video based vehicle counting maintained by La DOTD and their respective camera views.
technique using a three-step strategy of object detection, Similarly, any vehicle that travelled across those green and blue
tracking and trajectory analysis. Their method uses a trajectory polygons were counted and appended in the north and
counting algorithm that accurately computes the number of southbound directions respectively.
vehicles in their respective categories and tracks vehicle routes
to obtain traffic flow information. Similarly, a deep neural
network is trained to detect and count the number of cars in
[17]. This approach integrates residual learning alongside
inception-style layers to count cars in a single look. Lately, it
has been demonstrated that single-look techniques have the
potential to excel at both speed and accuracy [20] requirements
useful for object recognition and localization. This could also,
prove beneficial to process image frames at much faster rates
that can accurately produce vehicle counts in real-time
conditions. The authors in [15] deliberate counting as a
computer vision problem and present an adaptive real-time
vehicle counting algorithm that takes robust detection and
counting approach in an urban setting.
EfficientDet is a state-of-the-art object detection algorithm that 1: function Tracker( detections, σl, σh, σiou, mintsize
basically follows single-stage detectors pattern [34]. The ⊲ detection(dict(class, score, box))
architecture of EfficientDet is shown in Figure 4. Here, the 2: let σl ← low detection threshold
ImageNet-pretrained EfficientNets has been deployed as the 3: let σh ← high detection threshold
network’s backbone. Similarly, in order to obtain an easier and 4: let σiou ← IOU threshold
5: let mintsize ← minimum track size in frames
quicker multi-scale fusion of features, a weighted bi-directional
6. let Ta ←[] ⊲ active tracks
feature pyramid network (BiFPN) has been proposed. BiFPN 7. let Tf ←[]
here, serves as the feature network and receives approximately ⊲ finished tracks
3-7 features from the backbone network and continually 8. for f rame, dets in detections do
performs top-down and bottom-up bidirectional fusion of 9. dets ← filter for dets with score ≥ σl
features. These synthesized features are transferred to a class 10. let Tu ←[] ⊲ updated tracks
11. for ti in Ta do
and box network to achieve vehicle class and bounding box 12. if not empty(dets) then
predictions correspondingly. Also, all the vehicle class and box 13. biou, bbox ← find max iou box(tail box(ti), dets)
14. if biou ≥ σiou then
15. append new detection(ti, bbox)
16. set max score(ti, box score(bbox))
17. set class(ti, box class(bbox))
18. Tu ← append(Tu, ti)
19. remove(dets, bbox) ⊲ remove box from dets
20. if empty(Tu) or ti is not last(Tu) then
21. if get max score(ti)≥ σh or size(ti)≥ mintsize then
22. Tf ← append(Tf , ti)
23. Tn ← new tracks from dets
24. Ta ← Tu + Tn
25. return Tf
Similarly, Kalman-IOU (KIOU) tracking has been further The Simple Online and Realtime Tracking with a Deep
explored. The Kalman filter’s ability of performing predictions Association metric (Deep SORT) enables multiple object
allows users to skip frames while still keeping track of the tracking by integrating appearance information with its tracking
object. Skipping frames allows the detector to speed-up the components [37]. A combination of Kalman Filter and
process as in a tracking-by-detection task, smaller number of Hungarian algorithm is used for tracking. Here, Kalman
frames wedges lower computational requirement. Using an filtering is performed in image space while Hungarian
appropriate object detector with Kalman-IOU tracker, and technique facilitates frame-by-frame data association using an
configuring the frames to skip two-thirds of frames per second association metric that computes bounding box overlap. To
could enable the tracker to run in real-time. Likewise, this obtain motion and appearance information, a trained
feature could also improve the performance of Kalman-IOU convolutional neural network (CNN) is applied.
tracker compared to the original IOU tracker.
By integrating CNN, the tracker achieves greater robustness
2. SORT against object misses and occlusions while preserving the
trackers ability to quickly implement to online and realtime
Simple Online and Realtime Tracking (SORT) is an scenarios intact. The CNN architecture of the system is shown
implementation of tracking-by-detection framework where the in Table I. A wide residual network with two convolutional
main objective is to detect objects each frame and associate layers followed by six residual blocks is applied. In dense layer
them for online and real-time tracking application [36]. 10, a global feature map of dimensionality 128 is calculated.
Methods such as Kalman Filter and Hungarian algorithm are Finally, batch and ℓ2 normalization features over the unit
used for tracking. The characteristic feature of SORT is that it hypersphere accesses compatibility with cosine arrival metric.
only uses detection information from the previous and current Overall, Deep SORT is a highly versatile tracker and can match
frames, enabling it to competently perform online and real-time performance capabilities with other state-of-the-art tracking
tracking. To further explain this, an object model is described algorithms.
as expressed in equation (2) where u, v, s, and r represent the TABLE I.
horizontal pixel location, vertical pixel location, area and aspect OVERVIEW OF DEEP SORT’S CNN ARCHITECTURE
ratio of the target object respectively. Anytime a detection is
linked to a target object, the detected bounding box is used to Patch
inform the target state and the horizontal and velocity values are Name Size/Stride Output Size
solved using Kalman filters. This helps in identifying target’s Conv1 3 × 3/1 32 × 128 × 64
identity in successive frames and facilitates tracking.
Conv2 3 × 3/1 32 × 128 × 64
X = [ u, v, s, r, u, u̇, v̇, ṡ]T (2) Max Pool 3 3 × 3/2 32 × 64 × 32
Residual 4 3 × 3/1 32 × 64 × 32
3. Feature Based Object Tracker
Residual 5 3 × 3/1 32 × 64 × 32
In Feature-based object tracking, there is the usage of Residual 6 3 × 3/2 64 × 32 × 16
appearance information to track objects in respective traffic
scenes. This method is useful in tracking vehicles in occluded Residual 7 3 × 3/1 64 × 32 × 16
settings. The system extracts object features from one frame Residual 8 3 × 3/2 128 × 16 × 8
and then matches appearance information with successive Residual 9 3 × 3/1 128 × 16 × 8
frames based on the measure of similarity. Feature-based object
tracking consists of both feature extraction and feature Dense 10 128
correspondence. The feature points are extracted from the Batch and ℓ2
objects in an image using various statistical approaches. Feature normalization 128
correspondence is considered an arduous task since, a feature
point in one image may have analogous points in other images
V. RESULTS
This section evaluates the performance of different
combinations of object detectors and trackers. The main goal of
this study is to identify the best performing object detector-
tracker combination. For comparative analysis, the models are
tested on a total of 546 video clips of length 1 minute each
comprising of over 9 hours’ total video length. Figure 1 shows
all the camera views with manually generated green and blue
polygons that record the number of vehicles passing through
them in both north and southbound directions respectively. The
vehicle counts are evaluated based on four different categories:
(i) overall count of all vehicles, (ii) total count of cars only, (iii)
total count of trucks only, and (iv) overall vehicle counts for
different times of the day (i.e. daylight, nighttime, rain). To
establish ground truth, all the vehicles are manually counted
from the existing 9 hours’ video test data. The performance is
assessed by expressing the automatic counts obtained from
different model combinations over the ground truth value
expressed in per hundredth or percentage.
Similarly, the detection is classified as False Positive (FP) if the Fig. 5. Heat Maps Generated for Different Object Detectors
detector erroneously detects a vehicle at a spot with no vehicles
present. As observed from the heat maps, the FP columns for then the model might have been too confident which is not very
object detectors are generally clean with a few camera views in ideal. Finally, True Positive (TP) is the one that correctly
Detectron2 and EfficientDet generating incorrect detects vehicle when there are any actual vehicles present on
classifications. The camera view with flyovers or overpass the roadways. Most object detection models generated correct
roads caused the model to misclassify some of the detections. true positives except for a few camera views where the vehicles
Sometimes, camera movements and conditions such as rain are either too distant or encounters lowlight or nighttime
sticking on the camera lens and pitch darkness also cause such conditions where only the vehicles’ headlights were visible.
misclassifications. Ideally, we do not aim at seeing intense heat
maps for both false negatives and false positives. However, if Figure 6 shows the overall count percentage for all vehicles. As
we have higher false positives but obtain lower false negatives, seen from the figure, the overall count percentage for some of
Overall Count Percentage for All Vehicles
160
140
120
100
Percentage
80
60
40
20
0
Model Combination
Northbound Count Percentage Southbound Count Percentage
the model combinations exceed over 100 percent while a couple an over-estimate of truck counts. Exaggerating the actual
combinations obtain counting results below 45 percent of the number of vehicle counts (either trucks or cars) could be
actual counts. Any model combination that either over-counts attributed to that fact that some of the detectors produced
or under-counts the actual number of vehicles are considered multiple bounding boxes for the same vehicle while traversing
faulty match while the ones that perform counts in the order the video scene. This impelled the tracker to confuse the same
closer to 100 percent are termed an optimal match. The best vehicle as different ones and assign them with newer values
performing model combinations that obtained a more accurate every time a bounding box re-appears.
count estimate for all vehicles were YOLOv4 and Deep SORT,
Detectron2 and Deep SORT, and CenterNet and Deep SORT. Likewise, Table II illustrates the performance comparison of
Thus, all these model combinations can be considered an models in different weather conditions. The counting results
optimal match. show that the best performing model combinations were
YOLOv4 and Deep SORT, Detectron2 and Deep SORT, and
Similarly, Figures 7-8 compares the performance of different CenterNet and Deep SORT, analogous to the comparison chart
model combinations for counting cars and trucks respectively. as shown in Figure 6. Vehicle counting accuracies largely
From Figure 7, it can be observed that CenterNet and IOU, depends on the precision of object detection models. However,
CenterNet and SORT, Detectron2 and SORT, EfficientDet and it is evident from Table II that the models didn’t achieve
Deep SORT, and YOLOv4 and Deep SORT obtained the best optimal results for the most part. The reasons could be partly
counting results. These detector-tracker combinations attributed to the inferior camera quality, unstable camera views
performed well in both north and southbound directions due to the wind blowing on highways, and the presence of fog
respectively. Occlusion issues created a hindrance in correctly or mist on camera lens.
locating cars which would often be obstructed by larger
vehicles whenever they are too close to the camera. Likewise, During daylight, nighttime and rainy conditions, EfficientDet’s
in Figure 8, the truck counter performance is assessed. Out of combination with SORT and KIOU failed miserably at
all the model combinations, only EfficientDet and Deep SORT counting the number of vehicles. EfficientDet mainly suffered
obtained acceptable counting performance in both north and with its detection capability. For model combinations that
southbound directions. Although, the combination of CenterNet recorded count percentage over 100 typically had both detector
and KIOU, and EfficientDet and SORT separately obtained and tracker at fault. Object detectors generated multiple
accurate counting results, their scope was limited to only either bounding boxes for the same vehicle that resulted in over-
North or Southbound directions respectively. Most of the other counting of the number of vehicles. Also, some of the trackers
model combinations didn’t accurately count trucks due to the did not perform ideally at predicting vehicle trajectories and
presence of other heavy gross vehicles (HGV) such as buses, assigned them as separate vehicles at certain occasions.
trailers, and multi-axle single units. These HGVs often
confused the models and were assigned as trucks that generated
Car Counts Performance
160
140
Percentage
120
100
80
60
Model Combination
Northbound Count Percentage Southbound Count Percentage 100 % Line
190
150
110
70
30
Model Combination