Visual-LiDAR Based 3D Object Detection and Tracking For Embedded Systems
Visual-LiDAR Based 3D Object Detection and Tracking For Embedded Systems
ABSTRACT In recent years, persistent news updates on autonomous vehicles and the claims of companies
entering the space, brace the notion that vehicular autonomy of level 5 is just around the corner. However,
the main hindrance in asserting the full autonomy still boils down to environmental perception that affects
the autonomous decisions. An efficient perceptual system requires redundancy in sensor modalities capable
of performing in varying environmental conditions, and providing a reliable information using limited
computational resources. This work addresses the task of 3D object detection and tracking in the vehicles’
environment, using camera and 3D LiDAR as primary sensors. The proposed framework is designed to
operate in an embedded system that visually classifies the objects using a lightweight neural network,
while tracking is performed in 3D space using LiDAR information. The main contribution of this work
is 3D LiDAR point cloud classification using visual object detector, and an IMM-UKF-JPDAF based object
tracker that jointly performs 3D object detection and tracking. The performance evaluation is carried out
using MOT16 metrics and ground truths provided by KITTI Datasets. Furthermore, the proposed tracker
is evaluated and compared with state-of-the-art approaches. The experiments suggest that the proposed
framework offers a suitable solution for embedded systems to solve 3D object detection and tracking
problem, with added benefits.
INDEX TERMS Kalman filter, object detection, object tracking, point cloud classification, sensor fusion.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 8, 2020 156285
M. Sualeh, G.-W. Kim: Visual-LiDAR Based 3D Object Detection and Tracking for Embedded Systems
to the shortcomings of the detector. The key challenges in addresses the problem of tracking by seeking diversity using
developing a framework for autonomous vehicles to perform detrimental point processes to forecast the trajectories of
3D object detection and tracking include, real-time perfor- objects. The main drawback in the existing schemes are the
mance, limited computational demand, applicability in vari- parameters of networks that require training, computational
ety of weather and lighting conditions, and ease of adapting needs, inapplicability on embedded systems, and reliance on
the change in number and positioning of sensors. 3D object detector performance.
The autonomous vehicles are generally equipped with In this work, a comprehensive framework for joint 3D
numerous sensors for environmental perception like ultra- object detection and tracking for an embedded system is
sonic, radar, LiDAR (light detection and ranging), cam- proposed. The framework makes use of visual-LiDAR setup
eras, and so on. Among the above sensors, many modern to exploit the information redundancy for real-time reliable
approaches use camera, LiDAR, or a fusion of both for 3D results. The 3D LiDAR point cloud is represented in a cylin-
object detection tasks. Although, LiDAR and camera can drical grid and possible candidates for objects are filtered.
perform the object detection independently, each sensor pos- The candidates are tracked and the information of position,
sesses some limitations. LiDAR based approaches are suscep- pose, dimensions, and class vector is maintained. In paral-
tible to harsh weather conditions and low resolution [4]–[6]. lel, a neural network is employed for visual classification
Whereas camera-based methods are primarily challenged by of objects for proposal generation that temporally updates
inadequate light and depth information [7]. Therefore, both the class vector of the tracked candidates. The framework
sensors require a joint operation to complement the individual is an extension of previous work [23], where only LiDAR
limitations, and to enable the applicability in a wider range of was considered as the perceptual sensor but lacked in proper
environmental conditions [8]–[11]. classification of objects, resulting in large number of false
The Visual-LiDAR based 3D object detection methods positives.
adopt either early, late, or deep fusion schemes [12]. The The advantages of the proposed approach are in many
modalities are combined at the beginning of process in folds, as challenges pertaining to occlusions and missed
early fusion scheme [13], with interdependent representa- visual detections are temporally addressed. Furthermore,
tion of data. The late fusion scheme processes the modali- even in poor lighting conditions the LiDAR detector con-
ties independently up to the last stage where information is tinues to operate. Moreover, since no training is involved in
fused [10], [11]. The deep fusion schemes tend to mix the direct classification of point clouds, the approach can seam-
modalities hierarchically in neural network layers, allowing lessly integrate into a variety of sensor arrangement. In addi-
the features from different modalities to interact over lay- tion, the tracker can temporally provide dynamic attributes of
ers [8], [9], [14]. In order to exploit redundant information the detected objects, that can be directly used for autonomous
of modalities while compensating the individual sensor limi- decisions. The proposed framework is implemented on an
tations, late fusion schemes are most appropriate selection. embedded system and performance evaluation is carried out
The tasks of 3D object detection and 3D object tracking using well-established metrics for object detection and track-
are traditionally approached independently. Recent works ing on the ground truths provided by KITTI Datasets [24].
that jointly perform 3D object detection and tracking tasks The main contributions of this work include a novel 3D
are proposed in [15]–[18], reflected on performance leader- object detector, an efficient point cloud processing for object
boards across various benchmarking platforms. The approach candidate estimation, and clutter aware probabilistic tracking
adopted in [15], [16], utilize visual-LiDAR setup for detec- algorithm for an embedded system.
tion, unlike monocular setups for detections in [17], and 3D The perception module with efficient 3D object detection
LiDAR as the only sensor for detections in [18]. The meth- and tracking capabilities directly impacts the quality of spa-
ods being learning based, require laborious and expensive tial localization, mapping and motion planning. The local-
annotation process to prepare training datasets. Furthermore, ization can benefit from the static part of the environment in
change in the number, type or positioning of sensors require conjunction with the redundancy of the sensory setup, espe-
retraining of the networks. Moreover, the inference time and cially in a dense and dynamic urban environment. Similarly,
computational needs limit their use in real-time applications. regions of the environment pertaining to the dynamic objects
Currently, more and more trackers are being proposed can be ignored for mapping. In addition, motion patterns of
that perform tracking in 3D space [19]–[22]. However, these the dynamic objects in the vicinity can be utilized for safer
trackers require reasonably accurate 3D detections. Thereby motion planning [25].
adding the computational demand on the system. The tracker The remainder of the paper is organized as follows: in
proposed in [19] utilizes 3D IoU thresholding for data associ- section II, the proposed architecture is described in terms
ation under a 3D Kalman Filter paradigm. The approach pro- of hardware/software and information flow. Working of
posed in [20] does perform in real-time but at the cost of GPU the framework is explained in section III, followed by the
utilization. Furthermore, a multi-modality approach proposed specifications on implementation platform in section IV.
in [22] focuses on fusion of detected objects information, that A detailed description of the proposed framework at mod-
remains infeasible from application perspective as multiple ular level is made in section V. The evaluation criteria are
networks run for detections. The authors of [21], distinctly defined in section VI, including results and comparisons
with state-of-the-art. In section VII framework implemen- is addressed by a tracker that maintains a unique ID for
tation on the platform is demonstrated. The added features a detected object and predicts the motion patterns of the
of the proposed framework are discussed in section VIII, detected objects. Thus, accurate 3D object detection without
followed by conclusions in section IX. tracking provides no information regarding object motion.
V. 3D MODT
The proposed 3D MODT framework comprises of two
threads pertaining to the processing of 3D LiDAR and camera
respectively. The thread for processing of 3D LiDAR point
cloud is composed of sub modules, ground segmentation,
clustering, box-fitting, and tracking. Whereas, the thread
responsible for treating images from camera is composed of
YOLO v3 [42] implemented as a ROS package to provide
object class information. The operation and structure of each FIGURE 4. 2D polar grid for ground classification.
sub module is explained in the subsequent subsections.
The approach adopted in this work involves indexing of
A. GROUND CLASSIFICATION point cloud into a 2D array that processes the classification
The ground classification is an essential pre-processing task task efficiently. Each cell of an array contains indexes of
in which LiDAR point cloud is partitioned into ground and the point cloud measurements that belong to a section of
non-ground measurements. The portion of point cloud classi- vertically sliced cylinder into channels and bins, as shown
fied as ground can be further processed for road markings, in Fig. 4. Each channel is traversed independently, directed
curb detection, traversable area, and path planning tasks. outward from the vehicle, to estimate the ground level in
Whereas, part of point cloud corresponding to non-ground each cell of the grid. The sensor height from the ground
LiDAR measurements is effectively used for the tasks per- is considered as the initial ground level, and slope to the
taining to 3D object detection. Several approaches for ground lowest measurement of consecutive cell is computed. The
classification exist in the literature that largely vary in terms slope exceeding a threshold related to a cell containing non-
of sensor setups and assumptions made for the environment. ground measurements and previous ground level is main-
The prominent strategies for ground classification utilize tained. Whereas, slope within a threshold limit updates the
scan-rings [26], voxels [27], height threshold [28], or feature ground level for subsequent cells. With all cells of the grid
learning [29]. The scan-ring based approaches are gener- getting the ground level, point cloud is segregated with a
ally applicable on single LiDAR setups, in which distance tolerance parameter to remove the edge noise.
between consecutive scan lines are studied for ground clas- The proposed ground classification module in comparison
sification. Whereas, voxelization of point cloud into 2D or to the module developed in prior work [23], is optimized
3D space is also a common practice to scale down the further to reduce the process time to about one half. The
number of measurements for estimation. Similarly, with the approach in former work for ground classification was sim-
assumption of planner ground environment, setting a height ilar, however the point cloud was traversed multiple times
threshold is enough for ground classification. On the other for data representation into cylindrical grid, labeling of grid
hand, some approaches utilize neural networks to address cells, and labeling of LiDAR points respectively. The ground
the classification of the sparse LiDAR point cloud. In this classification module evaluated on the similar datasets pro-
work, possibility of ground to be non-planner is considered cessed the data in an average time of 7.8ms, that former to
and point cloud is assumed to be a merger of multiple cali- optimization consumed 15.7ms. Similarly, the process time
brated LiDARs that are arbitrarily positioned. This assump- on embedded system reduced to an average of 39.1ms from
tion rules out the approaches that rely on height threshold and 64.9ms. The execution times are further expressed in the
evaluation section. The key modifications for optimization to cluster or per point classification such as [14], [33]. The
are, approaches prove to be effective for 3D object detection tasks
• The lowest and highest point, of each cell is found when but require excessive computational resource, beyond the
indexes of the point cloud are being distributed into grid. constraints of embedded platforms.
• Instead of traversing each channel twice, estimation of In this work, the LiDAR point cloud is clustered using
slope and local ground level along the bins are carried connectivity-based approach. To reduce the complexity, 3D
out in a single traversal. cylindrical grid is used instead of point wise clustering. The
• The iterators for the point cloud indexes are efficiently advantage of 3D grid over 2D grid is to cater the measure-
utilized to form point clouds for ground and non-ground ments pertaining to elevated structures, like traffic lights and
points, skipping the step of labeling the bins. bridges. Furthermore, cylindrical grid can address the sparsity
In addition to the processing speed of computational plat-
of measurements that are far from the sensor. The point cloud
form, and optimized programming approach, the parameters
for clustering is represented in a 3D array, where each cell
for data representation contribute in the overall processing
contains the corresponding indexes of points.
time. This include the range of measurements from sensor
The 3D array is processed through a 3D connected compo-
Rrange , number of LiDAR measurements in a time step,
nent clustering approach to group the grid cells in proximity.
LiDAR field of view to be considered FOV, and resolution
The formulation of clustering traverses all the cells of 3D
of grid-based representation, expressed in terms of grid cell
array and examines the immediate neighbors for minimum
area,
number of cells to include in a cluster. The clusters of point
FOV · π cloud are filtered based on dimensions, large clusters gener-
GAcell = (R2 − R2i−1 ), (1)
channels · 180 i ally correspond to buildings while very small clusters either
Rrange belong to noise, insignificant obstacles, or over segmentation.
Ri = · i, where i = 0, 1, 2, . . . Bins (2)
Bins Furthermore, the clusters elevated from the ground are also
The number of bins and channels determine the area and filtered, as intention is to track the moving objects on the
number of grid cells that need to be traversed for slope test ground. The remaining clusters are treated with the box fitting
and local ground estimation. Higher resolution demands addi- task to estimate the pose of object and the centroid, further
tional processing, whereas lower resolution provides compact explained in the subsequent subsection.
representation. Once the ground is classified, the point cloud Like ground segmentation, clustering module developed in
pertaining to non-ground LiDAR measurements is presented previous work [23], is optimized and process time is substan-
to the clustering module. tially reduced. The clustering module developed in the former
work used a rectangular grid-based representation of point
B. CLUSTERING cloud, requiring relatively higher resolution. Furthermore,
The concept of clustering is to group the entities based on to cluster the occupied grid cells, all 26 neighbors of each
some similarity [30]. Clustering a LiDAR point cloud such cell were traversed for occupancy check. The average process
that each cluster corresponds to a unique object is a chal- time of clustering together with the box fitting task on similar
lenging task, due to sparsity and lack of textured information. datasets is reduced to 3.66ms from 14.3ms, and on embedded
The clustering approaches generally utilize connectivity [28], system the process time is reduced to 17.3ms from 31.18ms.
centroid [30], density [31], distribution [32], or learned The execution times at modular level are further explained
features of the LiDAR measurements. Connectivity or in the evaluation section. The key modifications in clustering
hierarchy-based approaches rely on proximity of neighboring module are listed below,
measurements and expand iteratively. • Rectangular grid is replaced with a 3D cylindrical grid,
The centroid-based approaches require prior knowledge to exploit the point cloud of single LiDAR instead of
of the number of clusters to divide data into, such as merged point cloud of three LiDARs.
K-means, Gaussian Mixture Models, and Fuzzy c-mean. • Instead of searching 26 neighbors of grid cell for clus-
While, density-based approaches identify high density tering, 6 immediate neighbors are traversed.
regions for clustering, but density of LiDAR measurements In the clustering process, unlike 2D representation of LiDAR
radially decrease as a function of distance from the sensor. data for ground classification, 3D or volumetric representa-
Moreover, occluded measurements further affect the densi- tion is adopted. In addition to the bins and channels, the ver-
ties, thus density-based clustering approaches are not effec- tical range Vrange of LiDAR cloud is divided into layers. The
tively applicable in 3D object detection paradigm. number of LiDAR measurements however are reduced to
The distribution-based clustering methods utilize distri- only non-ground measurements within FOV and range Rrange .
bution models to fit potential clusters of objects, providing The volume of grid cell is then represented as,
more information compared to density-based methods, but
FOV · π Vrange 2
at the cost of complexity. However, absence of distribution GVcell = · Ri − R2i−1 , (3)
model and measurements under partial occlusion suffers in channels · 180 levels
proper clustering. Similarly, learning based approaches trains The clustering method adopted in this work requires that
a neural network for a set of optimization functions/criteria the adjacent cells of the grid pertaining to unique objects
C. BOX FITTING
Box fitting of clustered LiDAR point cloud data is an essen-
tial and challenging task, as the measurements are always
occluded because of obstructions in the sensor line-of-sight.
An efficient box fitting technique estimates the correct object
pose and centroid, considering the partially measurements.
Several approaches tend to address this problem by either
model-based [34] or feature-based methods [35]. The model- FIGURE 5. Cylindrical grid for clustering.
based methods match the raw point cloud with a known
geometric model, whereas features-based approaches exploit
the edge features to estimate the pose. Lack of generality
and excessive computational requirement bars the use of
model-based approaches in MODT applications. Similarly,
the selection of features that best describe the object pose is
a difficult task. Currently, neural networks are also trained
for feature selection process, as utilized in [10]. However,
change in sensor setups often require labeling of datasets and
retraining of networks.
In this work, considering the computational constraints a
feature-based method is utilized, that performs the L-shape
point cloud fitting within a minimum rectangle area. Ini-
tially, the indexes of points with coordinates that define the
minimum box fitting are traversed to identify the corners
of clustered point cloud on the horizontal axis. The farthest
corners based on dimensions and location of object cluster
are used to formulate a line, and all points of the cluster are
traversed to find the farthest point from the line as a third cor-
ner. Using the three corners, dimension of the bounding box
and centroid is updated. Lastly, the pose of clustered object
is calculated about the updated centroid. Since the presence FIGURE 6. L-shape box fitting.
of occlusions affect the correct pose and centroid estimation,
the tracker module maintains the history of tracked object • Instead of traversing all the points, the points corre-
and heuristically adjusts the dimensions and pose of object sponding to the minimum and maximum coordinates of
temporally. The information flow is expressed in Fig. 6, the cluster are exploited.
where point a and b are the farthest points of the cluster • The farthest points of the cluster are heuristically found
identified by the maximum and minimum coordinates of the by making use of dimensions and cluster position with
cluster respectively. respect to the sensor.
The box fitting task is performed within clustering module
and finding the farthest points in the clusters contribute in D. TRACKING
overall process time. The key factors modified to acquire The multi-object tracking is an essential component in the
optimized process time are as follows, perception pipeline of autonomous vehicles. The object
tracking capabilities can enable the system to make better vj,k are mutually independent covariance matrices Qj,k and
decisions of actions to perform in cluttered environments. Rj,k respectively. Moreover, the progression of system among
An extensive literature on 2D MOT algorithms exist that r models is considered as first order Markov chain with time
focuses on object tracking in image plane [36], where the invariant Markovian model transition probability matrix:
objective is to maintain unique identities and to provide
temporally consistent location of detected objects. However, p11 ··· pr1
tracking of objects in 3D space is becoming popular in .. .. .. ∈ Rr×r .
5= . . . (6)
the research community [19], as more and more 3D MOT
p1r ··· prr
schemes are being proposed. The 3D MOT systems in general
share the similar components of 2D MOT systems, a part
form the distinction of object detections in 3D space instead The elements of the matrix pij represent mode transition
of image plane. This potentially allows the design of motion probability from model i to j.
models, data association, occlusion handling, and trajec- The proposed IMM-UKF-JPDAF tracker follows a five-
tory maintenance directly in 3D space without perspective step process: (a) interaction, (b) state prediction and mea-
distortion. surement validation, (c) data association and model-based
An autonomous vehicle acts like a dynamic system that is filtering, (d) mode probability update, and (e) combination
required to track objects in the environment. In this scenario, step. A similar approach for a single target is explained in
the tracked objects tend not to follow a regular motion pattern, [40], that addresses data association of measurements to a
giving rise to motion uncertainties. Similarly, the cluttered single tracked object. In comparison, a JPDAF is deployed
environments and sensor limitations impose partial or com- in the proposed framework to perform tracking of multiple
plete occlusion of objects that adds up the uncertainty in objects. This requires computation of association probability
position and pose of the tracked objects. To perform tracking between each track and measurement, while considering all
in the presence of uncertainties, Bayesian filtration strategies feasible joint association events across all measurements,
are usually deployed. Where the state estimation is carried leading to a combinatorial explosion problem.
out with either the assumption of Gaussian Mixture for den- To mitigate the possible combinatorial explosion, cluster-
sity or Gaussian distribution for transition. The assumption ing technique is adopted, where the association matrix is
leads to the use of Gaussian mixture Probability Hypothesis clustered into the sets of marginal θj,q and joint associa-
tion events 2 = N j=1 θj,qj . The number of clusters equal the
T k
Density Filter (PHDF) [37] or Joint Probabilistic Data Asso-
ciation Filter (JPDAF) [38] for state estimation, respectively. sum of marginal and joint association events. The clustering
Whereas, for non-Gaussian assumption Particle Filter (PF) technique helps in mitigating the combinatorial explosion of
methods are used [39]. The states of the tracked objects hypothesis that naturally grows in the cluttered environments.
are updated after association between tracks and detection Furthermore, the covariance of a track prediction increases
information is established. with unassociated measurement in consecutive time steps,
In this work, like the former implementation in [23], the consequently increasing the gate area for association. The
uncertainties due to clutter are addressed with an assump- larger gate area results in larger number of joint association
tion of Gaussian distribution, and JPDAF is applied for data events.
association. Similarly, the uncertainties due to motion is The hypothesis of all possible occurrences of events within
handled by an Interacting Multiple Model (IMM), to per- every cluster, marginal association probabilities are com-
form non-linear prediction of states for tracked objects. puted, that is the probability sum of the joint association
To cater the non-linearities of motion models for Gaussian events; given that the measurement j belongs to track q:
process, an Unscented Kalman Filter (UKF) is utilized. The
cl
implementation of IMM-UKF-JPDAF is an approach that X n o h i
efficiently addresses the problem of recursively estimating βjq
cl
= P 2cl |zk ω̂jq
cl
2cl ,
2
states and mode probabilities of targets, described by a jump
Markov non-linear system, in the presence of clutter. j = 1, . . . ,N cl and q = 1, . . . ,Tcl , (7)
The trackable N cl T cl N cl
objects are assumed to follow r motion n o 1Y
(1 − PD )δq β φj .
Y Y
r
models M = Mj j=1 , represented as a non-linear stochastic P 2 |zk
cl
= gjq PD (8)
c
state space model, j=1 q=1 j=1
model i of filter according to the associated measurement set: datasets consumes 6.4ms and 18.5ms on desktop and embed-
cl
ded computing platforms, respectively. Whereas, in former
N
X implementation consumed 8.3ms on desktop and 24.6ms on
z̃i,q,k = βj,q
cl
z̃i,j,q,k (9) embedded system respectively, without visual class associ-
j=1
ation. The key factor responsible in optimizing the process
The cross-covariance matrix Cxk ,zk between predicted states time is of an efficient track management module that limits
and measurements are used together with innovation covari- the false positive measurements for tracking.
ance matrix Sk to calculate the optimal Kalman gain Kk .
Subsequently, the states and covariances of each model for E. OBJECT CLASSIFICATION
the corresponding tracks are updated. The paradigm of fusing multiple modalities followed in this
work can be regarded as late fusion, where the tracked clusters
Kk = Cxk ,zk Sk−1 (10)
of point clouds are classified. The classification of tracked
xi,q,(k | k) = xi,q,(k | k−1) + Ki,q,k z̃i,q,k (11) point cloud clusters rely on two components; class association
and class management.
The states, covariances, and mode probabilities of IMM-
• The class association involves the process of assigning a
UKF-JPDAF are recursively estimated with the help of indi-
visually detected objects’ class to the tracked point cloud
vidual model likelihoods. The individual filter states and
cluster.
covariances are combined into a single weighted output using
• The class management utilizes the assignment history to
mode probabilities of tracks. The flow of tracker module
maintain and select a class for the tracked object.
is elaborated in Fig. 7, along with the clustered association
matrix, forming three association clusters as sub problems The visual object detection is carried out using
for the tracker. Furthermore, a tracked object is shown that YOLO-v3 [42] that is pretrained in Microsoft COCO
presents the tracking parameters along with the class and datasets [43] and detection classes are limited to track-
confidence percentage. able dynamic objects. YOLOv3, uses Darknet-53 (a CNN
model with 53 convolutional layers) backbone, and delivers
57.9 mAP (AP50) on Microsoft’s COCO dataset, using an
input resolution of 608 × 608 pixels. The optimized variants
of the network can perform real-time on embedded systems
but at the cost of compromised accuracy.
In this work, the input image resolution is tuned to
416 × 416 pixels that maintains the process time well below
100 milliseconds on the embedded system. Although, reduc-
tion in input image resolution results in missed detections due
to size, saturation, and exposure issues in image. The tracker
module handles the missed detections with the class vector
that probabilistically assigns the class to objects. In addition,
the LiDAR range is limited to 60-80m, beyond this range
point cloud is too sparse for accurate estimation of object
dimensions and pose. Visual detector detecting an object
beyond this range only adds an additional complexity for the
class association process.
Let T k and Dk be the sets of maintained tracks and visually
detected objects respectively at time step k. The corrected
centroids of tracked clusters oki are projected onto the image
frame of corresponding time stamp, resulting in a 2D pixel
location in image ōki . Similarly, the localization and dimen-
FIGURE 7. IMM-UKF-JPDAF tracker with clustered associations. sions of visually detected objects are used to calculate the
centroids mkj . Using the 2D centroids of both sources the
The execution times of tracker module mainly rely on h i
the number of maintained tracks. The former implemen- Euclidian distance cost matrix E k = ckij is populated,
tation [23] without visual classification, pruned the tracks where i = {1, 2, . . . , T } and j = {1, 2, . . . , D}. Furthermore,
merely based on inconsistent measurements, resulting in constrained by the criterion that at least 30% of overlap exists
large number of tracks to maintain. With an additional con- among the corresponding 2D bounding boxes.
dition of tracked object being classified by visual object
detector, reduce the number of tracks, resulting in decreased
d(o¯ki , mkj ) ifiou t¯ik , djk > 0.3
(
process time. The average execution time of tracker including ckij = (12)
track management and visual class association on similar 1000 otherwise
Following the Munkres association strategy for optimized and blue dots in the image represent the projected centroid of
minimum cost is performed and set of index pairs ϒ per- cluster, and center of visually detected objects respectively.
taining to associated tracks tik and visual detections djk is The tracked objects maintain a class vector with each
h i
obtained. Using the set ϒ class association matrix Ê k = ĉkij dimension registered to a visually detectable class. At the
fusion step, after a successful association, a unit increment is
can be formulated such that,
( made in the corresponding class dimension of the vector. The
vj if < i, j >⊂ ϒ maximum count is a dimension of the class vector specifies
ĉkij = (13) the object class, whereas the certainty is computed against the
0 otherwise,
life of the track.
where, v is the number that represents the class of visually
detected object and the dimension index of class association F. TRACK MANAGEMENT
vector Aiv = (ai1 , . . . ain ). The matrix Ê k is used to update the The MODT task in the presence of uncertainties pertain-
class association vector Ai with an increment in the associated ing to classification, clutter, and motion of objects require
class dimension, a robust track managing module to maintain and provide
( i
a if ĉkij = 0 reliable information. The main purpose of track management
ĉij = ĉi ij
k
(14) module is to initialize and maintain track statistics, occlusion
aĉ + 1 otherwise, handling of the tracked object, and pruning out of tracks
ij
pertaining to false positive measurements.
The updated vector Ai , along with the age of track tage is The track managing module initiates new tracks for unas-
utilized to compute the class certainty Pic of tracked objects sociated measurements with unique identity, and records
and the ratio Pio of object, that reasons the tracked object to track age in terms of frame count. In addition, the dimensions
be an outlier. and pose of the tracked object are retained while consid-
max (av ) ering the LiDAR properties. The measurements related to
Pic = (15)
tage the objects moving farther from the sensor tend to experi-
tage − nv=1 aiv
P ence increased occlusions, and report decreased dimensions.
Pio = (16) On the other hand, objects approaching closer to the sensor
tage
get more exposure and provide comparatively more accurate
dimensions. Similar pattern is observed in the estimated pose
of the object and is handled by smoothing the sudden changes
in the yaw angles. The accuracy of maintained dimensions
and pose aids the occlusion handling and centroid correction.
As the centroid of the measurement pertaining to the object
under occlusions is also shifted proportional to the change in
the dimensions. By capitalizing on the sensor characteristics
and maintained information the centroid C of the tracked
object is corrected using change in length 1L, width 1W ,
height 1H and yaw ϕ.
cos (ϕ)
Cx0 = Cx + 1L · , (17)
2
sin (ϕ)
Cy0 = Cy + 1W · , and (18)
2
Cz0 = Cz + 1H . (19)
The tasks associated with the track management module are
presented in Fig. 9, with an example of centroid correction.
The mature track of an object retains the dimensions as width
W and length L, compared with the measured dimensions
Wm and Lm to find the difference between 1W and 1L. The
change in dimensions is utilized together with the position of
FIGURE 8. Visual object detector to classify tracked objects using class object relative to the sensor to acquire the correct centroid C’.
vector.
Furthermore, the mature tracks of objects represent the mea-
In Fig. 8, the flow of object classification is presented that sured dimensions by a wire frame bounding box, along a
runs in parallel with the LiDAR based detection tracking red box with correct dimensions and centroid. The initialized
thread. Furthermore, the class vector of a mature track is track requires measurement association for consecutive five
shown that represents the age and counts of class associations time-steps to be regarded as a mature track. If a track misses
resulting in the class certainty of 66.6%. Moreover, the red an association measurement without getting a classification
TABLE 1. Tracking evaluation with datasets information. TABLE 2. MODT time consumption on desktop and jetson board.
TABLE 3. 3D MOT evaluation results. leading score of precision in 2D MOT format. Furthermore,
the proposed method performs computation at the highest
speed of 236 FPS. Similarly, the identity switching count
ranks third among the compared approaches with a total
of 21. Thus, the proposed tracker can perform on a budgeted
computational resource, with performance metrics at par with
the state-of-the-art approaches.
informed selection of visual features for visual odometry. [9] M. Liang, B. Yang, Y. Chen, R. Hu, and R. Urtasun, ‘‘Multi-task multi-
Similarly, the motion patterns of dynamic objects can benefit sensor fusion for 3D object detection,’’ in Proc. IEEE/CVF Conf. Comput.
Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 7345–7353.
from the path planning. Furthermore, the tracking of static [10] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas, ‘‘Frustum PointNets for
objects can aid the odometry, that is often required in dense 3D object detection from RGB-D data,’’ in Proc. IEEE/CVF Conf. Comput.
urban environment where conventional localization methods Vis. Pattern Recognit., Jun. 2018, pp. 918–927.
[11] Z. Wang and K. Jia, ‘‘Frustum ConvNet: Sliding frustums to aggre-
become less reliable. Furthrmore, instance aware semantic gate local point-wise features for amodal 3D object detection,’’ in Proc.
segmentation of visual scene can also be realized in a cost- IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), Nov. 2019, pp. 1742–1749.
effective way, requiring the up sampling of tracked LiDAR [12] E. Arnold, O. Y. Al-Jarrah, M. Dianati, S. Fallah, D. Oxtoby, and
A. Mouzakitis, ‘‘A survey on 3D object detection methods for autonomous
clusters projected on image, as demonstrated in Fig. 11. driving applications,’’ IEEE Trans. Intell. Transp. Syst., vol. 20, no. 10,
pp. 3782–3795, Oct. 2019.
[13] J. Ku, M. Mozifian, J. Lee, A. Harakeh, and S. L. Waslander, ‘‘Joint 3D
proposal generation and object detection from view aggregation,’’ in Proc.
IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), Oct. 2018, pp. 1–8.
[14] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, ‘‘Multi-view 3D object
detection network for autonomous driving,’’ in Proc. IEEE Conf. Comput.
Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 6526–6534.
[15] M. Simon, K. Amende, A. Kraus, J. Honer, T. Samann, H. Kaulbersch,
S. Milz, and H. M. Gross, ‘‘Complexer-YOLO: Real-time 3D object detec-
tion and tracking on semantic point clouds,’’ in Proc. CVPRW, Jun. 2019,
FIGURE 11. MODT based instance aware semantic segmentation.
pp. 1–10.
[16] D. Frossard and R. Urtasun, ‘‘End-to-end learning of multi-sensor 3D
tracking by detection,’’ in Proc. IEEE Int. Conf. Robot. Automat. (ICRA),
IX. CONCLUSION May 2018, pp. 635–642.
In this work an efficient MODT framework is proposed for [17] H.-N. Hu, Q.-Z. Cai, D. Wang, J. Lin, M. Sun, P. Kraehenbuehl, T. Darrell,
and F. Yu, ‘‘Joint monocular 3D vehicle detection and tracking,’’ in Proc.
embedded systems that operate on visual-LiDAR setups. The IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 5390–5399.
framework takes advantage of spatial LiDAR data and 2D [18] W. Luo, B. Yang, and R. Urtasun, ‘‘Fast and furious: Real time end-to-end
scene understanding by performing late fusion of modalities 3D detection, tracking and motion forecasting with a single convolutional
net,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
temporally. The framework is tested on well-established per- pp. 3569–3577.
formance metrics against publicly available synthetic KITTI [19] X. Weng, J. Wang, D. Held, and K. Kitani, ‘‘3D multi-object track-
datasets. Whereas, the tracking component is also indepen- ing: A baseline and new evaluation metrics,’’ 2019, arXiv:1907.03961.
[Online]. Available: http://arxiv.org/abs/1907.03961
dently tested for a 3D MOT for a fair comparison with state- [20] E. Baser, V. Balasubramanian, P. Bhattacharyya, and K. Czarnecki,
of-the-art methods. It is intended to further extend this work ‘‘FANTrack: 3D multi-object tracking with feature association network,’’
in future to improve the object detection by early classifica- in Proc. IEEE Intell. Vehicles Symp. (IV), Jun. 2019, pp. 1426–1433.
[21] X. Weng, Y. Yuan, and K. Kitani, ‘‘Joint 3D tracking and forecasting with
tion of objects and to realize MODT based semantic anno- graph neural network and diversity sampling,’’ 2020, arXiv:2003.07847.
tations of images. Moreover, MODT can be exploited to aid [Online]. Available: http://arxiv.org/abs/2003.07847
visual odometry in dense environmental conditions. [22] W. Zhang, H. Zhou, S. Sun, Z. Wang, J. Shi, and C. C. Loy, ‘‘Robust multi-
modality multi-object tracking,’’ in Proc. IEEE Int. Conf. Comput. Vis.,
Oct. 2019, pp. 2365–2374.
REFERENCES [23] M. Sualeh and G.-W. Kim, ‘‘Dynamic multi-LiDAR based multiple object
[1] Car Crash Deaths and Rates—Injury Facts. Car Crash Deaths and detection and tracking,’’ Sensors, vol. 19, no. 6, p. 1474, Mar. 2019.
Rates. Accessed: Apr. 29, 2020. [Online]. Available: https://injuryfacts. [24] A. Geiger, P. Lenz, and R. Urtasun, ‘‘Are we ready for autonomous driving?
nsc.org/motor-vehicle/historical-fatality-trends/deaths-and-rates/ The KITTI vision benchmark suite,’’ in Proc. IEEE Conf. Comput. Vis.
[2] M. Cunneen, M. Mullins, F. Murphy, D. Shannon, I. Furxhi, and C. Ryan, Pattern Recognit., Jun. 2012, pp. 3354–3361.
‘‘Autonomous vehicles and avoiding the trolley (Dilemma): Vehicle per- [25] E. Yurtsever, J. Lambert, A. Carballo, and K. Takeda, ‘‘A survey
ception, classification, and the challenges of framing decision ethics,’’ of autonomous driving: Common practices and emerging technolo-
Cybern. Syst., vol. 51, no. 1, pp. 59–80, Jan. 2020. gies,’’ 2019, arXiv:1906.05113. [Online]. Available: http://arxiv.org/
[3] S.-C. Lin, Y. Zhang, C.-H. Hsu, M. Skach, M. E. Haque, L. Tang, and abs/1906.05113
J. Mars, ‘‘The architectural implications of autonomous driving: Con- [26] P. Narksri, E. Takeuchi, Y. Ninomiya, Y. Morales, N. Akai, and
straints and acceleration,’’ in Proc. 23rd Int. Conf. Archit. Support Pro- N. Kawaguchi, ‘‘A slope-robust cascaded ground segmentation in 3D point
gram. Lang. Oper. Syst., Mar. 2018, pp. 751–766. cloud for autonomous vehicles,’’ in Proc. 21st Int. Conf. Intell. Transp.
[4] G. P. Meyer, A. Laddha, E. Kee, C. Vallespi-Gonzalez, and Syst. (ITSC), Nov. 2018, pp. 497–504.
C. K. Wellington, ‘‘LaserNet: An efficient probabilistic 3D object [27] M. Himmelsbach, F. V. Hundelshausen, and H.-J. Wuensche, ‘‘Fast seg-
detector for autonomous driving,’’ in Proc. IEEE/CVF Conf. Comput. Vis. mentation of 3D point clouds for ground vehicles,’’ in Proc. IEEE Intell.
Pattern Recognit. (CVPR), Jun. 2019, pp. 12677–12686. Vehicles Symp., Jun. 2010, pp. 560–565.
[5] J. Zhou, X. Tan, Z. Shao, and L. Ma, ‘‘FVNet: 3D front-view pro- [28] Q. Li, L. Zhang, Q. Mao, Q. Zou, P. Zhang, S. Feng, and W. Ochieng,
posal generation for real-time object detection from point clouds,’’ 2019, ‘‘Motion field estimation for a dynamic scene using a 3D LiDAR,’’ Sen-
arXiv:1903.10750. [Online]. Available: http://arxiv.org/abs/1903.10750 sors, vol. 14, no. 9, pp. 16672–16691, 2014.
[6] B. Wu, A. Wan, X. Yue, and K. Keutzer, ‘‘SqueezeSeg: Convolutional [29] M. Velas, M. Spanel, M. Hradis, and A. Herout, ‘‘CNN for very fast ground
neural nets with recurrent CRF for real-time road-object segmentation from segmentation in Velodyne LiDAR data,’’ in Proc. IEEE Int. Conf. Auton.
3D LiDAR point cloud,’’ in Proc. IEEE Int. Conf. Robot. Automat. (ICRA), Robot Syst. Competitions, Apr. 2018, pp. 97–103.
May 2018. [30] S. K. Uppada, ‘‘Centroid based clustering algorithms—A clarion study,’’
[7] G. Brazil and X. Liu, ‘‘M3D-RPN: Monocular 3D region proposal network Int. J. Comput. Sci. Inf. Technol., vol. 5, no. 6, pp. 7309–7313, 2014.
for object detection,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), [31] A. Rodriguez and A. Laio, ‘‘Clustering by fast search and find of density
Oct. 2019, pp. 9287–9296. peaks,’’ Science, vol. 344, no. 6191, pp. 1492–1496, Jun. 2014.
[8] M. Liang, B. Yang, S. Wang, and R. Urtasun, ‘‘Deep continuous fusion [32] X. Xu, M. Ester, H.-P. Kriegel, and J. Sander, ‘‘A distribution-based
for multi-sensor 3D object detection,’’ in Proc. Eur. Conf. Comput. Vis. clustering algorithm for mining in large spatial databases,’’ in Proc. 14th
(ECCV), 2018, pp. 641–656. IEEE Int. Conf. Data Eng., Feb. 1998, pp. 324–331.
[33] Y. Yan, Y. Mao, and B. Li, ‘‘SECOND: Sparsely embedded convolutional [46] X. Weng and K. Kitani, ‘‘Monocular 3D object detection with pseudo-
detection,’’ Sensors, vol. 18, no. 10, p. 3337, Oct. 2018. LiDAR point cloud,’’ 2019, arXiv:1903.09847. [Online]. Available:
[34] D. D. Morris, R. Hoffman, and P. Haley, ‘‘A view-dependent adaptive http://arxiv.org/abs/1903.09847
matching filter for LiDAR-based vehicle tracking,’’ in Proc. 14th IASTED [47] J. Leonard et al., ‘‘A perception-driven autonomous urban vehicle,’’ J. Field
Int. Conf. Robot. Appl., Cambridge, MA, USA, Nov. 2009, pp. 1–9. Robot., vol. 25, no. 10, pp. 727–774, Oct. 2008.
[35] Z. Luo, S. Habibi, and M. V. Mohrenschildt, ‘‘LiDAR based real time [48] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman,
multiple vehicle detection and tracking,’’ Int. J. Comput. Electr. Autom. ‘‘The Pascal visual object classes (VOC) challenge,’’ Int. J. Comput. Vis.,
Control Inf. Eng., vol. 10, no. 6, pp. 1125–1132, 2016. vol. 88, no. 2, pp. 303–338, Sep. 2009.
[36] Y. Wu, J. Lim, and M. H. Yang, ‘‘Object tracking benchmark,’’ IEEE Trans. [49] S. Scheidegger, J. Benjaminsson, E. Rosenberg, A. Krishnan, and
Pattern Anal. Mach. Intell., vol. 37, no. 9, pp. 1834–1848, Sep. 2015. K. Granstrom, ‘‘Mono-camera 3D multi-object tracking using deep learn-
[37] B. N. Vo and W. K. Ma, ‘‘The Gaussian mixture probability hypothesis den- ing detections and PMBM filtering,’’ in Proc. IEEE Intell. Vehicles Symp.
sity filter,’’ IEEE Trans. Signal Process., vol. 54, no. 11, pp. 4091–4104, (IV), Changshu, China, Jun. 2018, pp. 433–440.
Nov. 2006.
[38] S. H. Rezatofighi, A. Milan, Z. Zhang, Q. Shi, A. Dick, and I. Reid, ‘‘Joint
probabilistic data association revisited,’’ in Proc. IEEE Int. Conf. Comput.
Vis. (ICCV), Dec. 2015, pp. 3047–3055. MUHAMMAD SUALEH received the B.S. degree
[39] A. Doucet, ‘‘On sequential simulation-based methods for Bayesian filter- in electronics engineering from COMSATS Uni-
ing,’’ Dept. Eng., Univ. Cambridge, Cambridge, U.K., Tech. Rep. CUED- versity Islamabad, Abbottabad Campus, Pakistan,
F-ENG-TR310, 1998. in 2009, and the M.S. degree in systems, control,
[40] M. Schreier, V. Willert, and J. Adamy, ‘‘Compact representation of and mechatronics from the Chalmers University
dynamic driving environments for ADAS by parametric free space and of Technology, Sweden, in 2011. He is currently
dynamic object maps,’’ IEEE Trans. Intell. Transp. Syst., vol. 17, no. 2, pursuing the Ph.D. degree with the Department
pp. 367–384, Feb. 2016. of Control and Robot Engineering, Chungbuk
[41] A. Milan, L. Leal-Taixe, I. Reid, S. Roth, and K. Schindler, ‘‘MOT16: National University, South Korea.
A benchmark for multi-object tracking,’’ 2016, arXiv:1603.00831. His research interests include robotics, semantic
[Online]. Available: http://arxiv.org/abs/1603.00831 SLAM, object detection and tracking, and control systems.
[42] J. Redmon and A. Farhadi, ‘‘YOLOv3: An incremental improve-
ment,’’ Apr. 2018, arXiv:1804.02767. [Online]. Available: http://arxiv.org/
abs/1804.02767
GON-WOO KIM received the M.S. and Ph.D.
[43] T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays,
P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollár, ‘‘Microsoft COCO: degrees from Seoul National University, South
Common objects in context,’’ May 2014, arXiv:1405.0312. [Online]. Korea, in 2002 and 2006, respectively.
Available: http://arxiv.org/abs/1405.0312 He is currently a Professor with the School of
[44] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, ‘‘Vision meets robotics: Electronics Engineering, Chungbuk National Uni-
The KITTI dataset,’’ Int. J. Robot. Res., vol. 32, no. 11, pp. 1231–1237, versity, South Korea. His research interests include
Sep. 2013. navigation, localization, and SLAM for mobile
[45] S. Shi, X. Wang, and H. Li, ‘‘PointRCNN: 3D object proposal generation robots and autonomous vehicles.
and detection from point cloud,’’ Dec. 2018, arXiv:1812.04244. [Online].
Available: http://arxiv.org/abs/1812.04244