0% found this document useful (0 votes)
45 views14 pages

Visual-LiDAR Based 3D Object Detection and Tracking For Embedded Systems

This document summarizes a research paper that proposes a framework for 3D object detection and tracking using both camera and LiDAR sensors on an embedded system. The framework performs visual object classification using a neural network while tracking objects in 3D space using LiDAR point clouds. It represents the point cloud data in a cylindrical grid and filters candidate objects. The candidates are then tracked to maintain their position, pose, dimensions and class over time. This approach allows the sensors to complement each other's limitations for improved detection in various environments.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
45 views14 pages

Visual-LiDAR Based 3D Object Detection and Tracking For Embedded Systems

This document summarizes a research paper that proposes a framework for 3D object detection and tracking using both camera and LiDAR sensors on an embedded system. The framework performs visual object classification using a neural network while tracking objects in 3D space using LiDAR point clouds. It represents the point cloud data in a cylindrical grid and filters candidate objects. The candidates are then tracked to maintain their position, pose, dimensions and class over time. This approach allows the sensors to complement each other's limitations for improved detection in various environments.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 14

Received June 22, 2020, accepted August 14, 2020, date of publication August 24, 2020, date of current

version September 8, 2020.


Digital Object Identifier 10.1109/ACCESS.2020.3019187

Visual-LiDAR Based 3D Object Detection and


Tracking for Embedded Systems
MUHAMMAD SUALEH AND GON-WOO KIM
Department of Robot and Control Engineering, Chungbuk National University, Cheongju 28644, South Korea
Corresponding author: Gon-Woo Kim (gwkim@cbnu.ac.kr)
This work was supported in part by the National Research Foundation of Korea (NRF) grant funded by the Korean Government (MSIT)
under Grant 2018006154; in part by the Ministry of Trade, Industry and Energy (MOTIE) and the Korea Institute for Advancement of
Technology (KIAT) through the International Cooperative Research and Development Program under Project P0004631; and in part by the
Ministry of Science and ICT (MSIT), South Korea, through the Grand Information Technology Research Center Support Program,
supervised by the Institute for Information and communications Technology Planning and Evaluation (IITP), under Grant
IITP-2020-0-01462.

ABSTRACT In recent years, persistent news updates on autonomous vehicles and the claims of companies
entering the space, brace the notion that vehicular autonomy of level 5 is just around the corner. However,
the main hindrance in asserting the full autonomy still boils down to environmental perception that affects
the autonomous decisions. An efficient perceptual system requires redundancy in sensor modalities capable
of performing in varying environmental conditions, and providing a reliable information using limited
computational resources. This work addresses the task of 3D object detection and tracking in the vehicles’
environment, using camera and 3D LiDAR as primary sensors. The proposed framework is designed to
operate in an embedded system that visually classifies the objects using a lightweight neural network,
while tracking is performed in 3D space using LiDAR information. The main contribution of this work
is 3D LiDAR point cloud classification using visual object detector, and an IMM-UKF-JPDAF based object
tracker that jointly performs 3D object detection and tracking. The performance evaluation is carried out
using MOT16 metrics and ground truths provided by KITTI Datasets. Furthermore, the proposed tracker
is evaluated and compared with state-of-the-art approaches. The experiments suggest that the proposed
framework offers a suitable solution for embedded systems to solve 3D object detection and tracking
problem, with added benefits.

INDEX TERMS Kalman filter, object detection, object tracking, point cloud classification, sensor fusion.

I. INTRODUCTION Another challenge in the autonomous vehicles’ paradigm is


The delay in large scale commercialization of autonomous of rising computational demands to process the raw data from
vehicles, circle around the factors pertaining to safety, fea- sensors in real time. However, a moving datacenter in the
sibility and affordability. The push towards driver-less cars name of autonomous vehicle is infeasible [3]. Although, edge
has been supported by the prospect of saving human lives. computing can offer remote processing of computationally
The current car fatality rate in the US is about 1.22 deaths per expensive tasks, but with a compromise on security and reli-
100 million miles driven, including safety violation cases [1]. ability of information.
Effectively that sets the benchmark for an autonomous vehi- An important potentiality of an effective environmental
cle failure, which remains a huge challenge [2]. Furthermore, perception is to understand the dynamic properties of coex-
the autonomous vehicles require to take decisions while isting entities. This has put forth huge requirements on the
making trade-offs between safety and feasibility. Such as, associated research domains related to 3D object detection
avoiding lane changes, driving slow at all times, or not to and tracking. The 3D object detection provides a faithful
drive at all; may be considered safe but remains infeasible. representation of 3D space around the vehicle, in terms of
class, dimensions, and pose. Whereas, tracking enables the
The associate editor coordinating the review of this manuscript and estimation of the dynamic parameters. Furthermore, tracking
approving it for publication was Malik Jahan Khan . also addresses the issue of temporally missed detections due

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 8, 2020 156285
M. Sualeh, G.-W. Kim: Visual-LiDAR Based 3D Object Detection and Tracking for Embedded Systems

to the shortcomings of the detector. The key challenges in addresses the problem of tracking by seeking diversity using
developing a framework for autonomous vehicles to perform detrimental point processes to forecast the trajectories of
3D object detection and tracking include, real-time perfor- objects. The main drawback in the existing schemes are the
mance, limited computational demand, applicability in vari- parameters of networks that require training, computational
ety of weather and lighting conditions, and ease of adapting needs, inapplicability on embedded systems, and reliance on
the change in number and positioning of sensors. 3D object detector performance.
The autonomous vehicles are generally equipped with In this work, a comprehensive framework for joint 3D
numerous sensors for environmental perception like ultra- object detection and tracking for an embedded system is
sonic, radar, LiDAR (light detection and ranging), cam- proposed. The framework makes use of visual-LiDAR setup
eras, and so on. Among the above sensors, many modern to exploit the information redundancy for real-time reliable
approaches use camera, LiDAR, or a fusion of both for 3D results. The 3D LiDAR point cloud is represented in a cylin-
object detection tasks. Although, LiDAR and camera can drical grid and possible candidates for objects are filtered.
perform the object detection independently, each sensor pos- The candidates are tracked and the information of position,
sesses some limitations. LiDAR based approaches are suscep- pose, dimensions, and class vector is maintained. In paral-
tible to harsh weather conditions and low resolution [4]–[6]. lel, a neural network is employed for visual classification
Whereas camera-based methods are primarily challenged by of objects for proposal generation that temporally updates
inadequate light and depth information [7]. Therefore, both the class vector of the tracked candidates. The framework
sensors require a joint operation to complement the individual is an extension of previous work [23], where only LiDAR
limitations, and to enable the applicability in a wider range of was considered as the perceptual sensor but lacked in proper
environmental conditions [8]–[11]. classification of objects, resulting in large number of false
The Visual-LiDAR based 3D object detection methods positives.
adopt either early, late, or deep fusion schemes [12]. The The advantages of the proposed approach are in many
modalities are combined at the beginning of process in folds, as challenges pertaining to occlusions and missed
early fusion scheme [13], with interdependent representa- visual detections are temporally addressed. Furthermore,
tion of data. The late fusion scheme processes the modali- even in poor lighting conditions the LiDAR detector con-
ties independently up to the last stage where information is tinues to operate. Moreover, since no training is involved in
fused [10], [11]. The deep fusion schemes tend to mix the direct classification of point clouds, the approach can seam-
modalities hierarchically in neural network layers, allowing lessly integrate into a variety of sensor arrangement. In addi-
the features from different modalities to interact over lay- tion, the tracker can temporally provide dynamic attributes of
ers [8], [9], [14]. In order to exploit redundant information the detected objects, that can be directly used for autonomous
of modalities while compensating the individual sensor limi- decisions. The proposed framework is implemented on an
tations, late fusion schemes are most appropriate selection. embedded system and performance evaluation is carried out
The tasks of 3D object detection and 3D object tracking using well-established metrics for object detection and track-
are traditionally approached independently. Recent works ing on the ground truths provided by KITTI Datasets [24].
that jointly perform 3D object detection and tracking tasks The main contributions of this work include a novel 3D
are proposed in [15]–[18], reflected on performance leader- object detector, an efficient point cloud processing for object
boards across various benchmarking platforms. The approach candidate estimation, and clutter aware probabilistic tracking
adopted in [15], [16], utilize visual-LiDAR setup for detec- algorithm for an embedded system.
tion, unlike monocular setups for detections in [17], and 3D The perception module with efficient 3D object detection
LiDAR as the only sensor for detections in [18]. The meth- and tracking capabilities directly impacts the quality of spa-
ods being learning based, require laborious and expensive tial localization, mapping and motion planning. The local-
annotation process to prepare training datasets. Furthermore, ization can benefit from the static part of the environment in
change in the number, type or positioning of sensors require conjunction with the redundancy of the sensory setup, espe-
retraining of the networks. Moreover, the inference time and cially in a dense and dynamic urban environment. Similarly,
computational needs limit their use in real-time applications. regions of the environment pertaining to the dynamic objects
Currently, more and more trackers are being proposed can be ignored for mapping. In addition, motion patterns of
that perform tracking in 3D space [19]–[22]. However, these the dynamic objects in the vicinity can be utilized for safer
trackers require reasonably accurate 3D detections. Thereby motion planning [25].
adding the computational demand on the system. The tracker The remainder of the paper is organized as follows: in
proposed in [19] utilizes 3D IoU thresholding for data associ- section II, the proposed architecture is described in terms
ation under a 3D Kalman Filter paradigm. The approach pro- of hardware/software and information flow. Working of
posed in [20] does perform in real-time but at the cost of GPU the framework is explained in section III, followed by the
utilization. Furthermore, a multi-modality approach proposed specifications on implementation platform in section IV.
in [22] focuses on fusion of detected objects information, that A detailed description of the proposed framework at mod-
remains infeasible from application perspective as multiple ular level is made in section V. The evaluation criteria are
networks run for detections. The authors of [21], distinctly defined in section VI, including results and comparisons

156286 VOLUME 8, 2020


M. Sualeh, G.-W. Kim: Visual-LiDAR Based 3D Object Detection and Tracking for Embedded Systems

with state-of-the-art. In section VII framework implemen- is addressed by a tracker that maintains a unique ID for
tation on the platform is demonstrated. The added features a detected object and predicts the motion patterns of the
of the proposed framework are discussed in section VIII, detected objects. Thus, accurate 3D object detection without
followed by conclusions in section IX. tracking provides no information regarding object motion.

II. SYSTEM ARCHITECTURE


The scope of this work is 3D object detection and tracking,
which is part of smart car project that involves a broad V2X
based autonomous vehicle architecture. Idea is to let vehicles
communicate over V2X protocol and share the environmental
information. As in a dense urban situation most part of the
environment is occluded by other dynamic objects. Given
that all vehicles can share minimal MODT information of
the environment, visibility beyond the sensors range can be
attained. Furthermore, the fast-paced development of Edge
computing and 5G can enhance the computation and com-
munication capacity.

FIGURE 2. 3D MODT framework.

The proposed framework addresses the 3D object detec-


tion and tracking problem jointly in a temporal fashion,
information flow is shown in Fig. 2. The framework runs
on two threads, associated with LiDAR and Camera inputs
respectively. The LiDAR point cloud is treated with ground
removal and clustering to predict initial pose and dimensions
of potentially trackable objects. The centroids of the objects
FIGURE 1. MODT in a V2X based architecture.
are considered as measurements for the IMM-UKF-JPDAF
based tracker. The second thread in parallel predicts visual
The proposed MODT scheme is implemented in a Vehicle- detections in the image, providing localized bounding boxes
to-everything (V2X) based autonomous vehicle architecture and class information. Instead of assigning a fixed class
shown in Fig. 1, in basic form. Idea is to populate Local and dimensions to an object in a single frame, a tracked
Dynamic Map (LDM) to aid controls of individual ‘n’ num- object is assigned a class, whereas parameters pertaining to
ber of smart cars in the network and share safety mes- dimensions, pose, and velocity are updated temporally across
sages. Once the smart car is localized in the map, the local multiple time frames. The tracking information is merged
MODT information is fused with the LDM information via with visual detections to provide 3D object poses along with
V2X transceiver. This provides an environmental percep- associated tracking parameters.
tion ranging beyond the sensing capability of single vehicle
sensors. This article is focused on MODT framework for a IV. PLATFORM FOR 3D MODT IMPLEMENTATION
single vehicle, therefore localization, transmission protocols, The proposed framework is implemented and tested on
safety messages and control mechanism will not be discussed Hyundai i30 (Hyundai Motor Company, Seoul, South Korea),
further. shown in Fig. 3. Platform is equipped with OS1-64 Ouster
LiDAR (Ouster, San Francisco, CA, USA), mounted on the
III. PROPOSED FRAMEWORK center top of platform. For visual perception, ZED camera
The objective of 3D MODT is to detect objects by class, (Stereo Labs, San Francisco, CA, USA) is mounted beside
dimensions and orientation, and to maintain unique IDs along LiDAR inside a custom-made casing. The sensors provide
with the parameters pertaining to the position and kine- raw measurements to Jetson AGX Xavier unit by Nvidia
matics. However, researchers approach 3D object detection (Nvidia Corporation, Santa Clara, CA, USA), that performs
and tracking problems separately, evident from the well- the computations associated to the proposed framework. Fur-
established evaluation metrics and leaderboard rankings. The thermore, vehicle CAN is interfaced along with V2X Modem
motivation of this work is derived from the notion that the to perform V2X communication. The framework is devel-
consecutive visual frames in the application areas of object oped to operate on ROS (robot operating system) ‘‘Melodic
detection in most cases are temporal. That is, scene does not Morenia’’ middleware on top of Ubuntu Linux 18.04.1. The
change abruptly, rather appearance of an object in the scene GPU of Xavier is utilized through CUDA 10.0 libraries
remains visible for several frames. Furthermore, application for visual detections. Whereas, LiDAR preprocessing and
areas such as autonomous vehicles largely benefit from the tracking tasks are handled by NVIDIA Carmel ARM CPU
dynamic information of the detected objects. This usually processors.

VOLUME 8, 2020 156287


M. Sualeh, G.-W. Kim: Visual-LiDAR Based 3D Object Detection and Tracking for Embedded Systems

scan-rings. Furthermore, the variability in number and posi-


tioning of LiDAR sensors, and constraints of embedded com-
puting limits the use of learning-based approaches.

FIGURE 3. Sensor setup and implementation platform.

V. 3D MODT
The proposed 3D MODT framework comprises of two
threads pertaining to the processing of 3D LiDAR and camera
respectively. The thread for processing of 3D LiDAR point
cloud is composed of sub modules, ground segmentation,
clustering, box-fitting, and tracking. Whereas, the thread
responsible for treating images from camera is composed of
YOLO v3 [42] implemented as a ROS package to provide
object class information. The operation and structure of each FIGURE 4. 2D polar grid for ground classification.
sub module is explained in the subsequent subsections.
The approach adopted in this work involves indexing of
A. GROUND CLASSIFICATION point cloud into a 2D array that processes the classification
The ground classification is an essential pre-processing task task efficiently. Each cell of an array contains indexes of
in which LiDAR point cloud is partitioned into ground and the point cloud measurements that belong to a section of
non-ground measurements. The portion of point cloud classi- vertically sliced cylinder into channels and bins, as shown
fied as ground can be further processed for road markings, in Fig. 4. Each channel is traversed independently, directed
curb detection, traversable area, and path planning tasks. outward from the vehicle, to estimate the ground level in
Whereas, part of point cloud corresponding to non-ground each cell of the grid. The sensor height from the ground
LiDAR measurements is effectively used for the tasks per- is considered as the initial ground level, and slope to the
taining to 3D object detection. Several approaches for ground lowest measurement of consecutive cell is computed. The
classification exist in the literature that largely vary in terms slope exceeding a threshold related to a cell containing non-
of sensor setups and assumptions made for the environment. ground measurements and previous ground level is main-
The prominent strategies for ground classification utilize tained. Whereas, slope within a threshold limit updates the
scan-rings [26], voxels [27], height threshold [28], or feature ground level for subsequent cells. With all cells of the grid
learning [29]. The scan-ring based approaches are gener- getting the ground level, point cloud is segregated with a
ally applicable on single LiDAR setups, in which distance tolerance parameter to remove the edge noise.
between consecutive scan lines are studied for ground clas- The proposed ground classification module in comparison
sification. Whereas, voxelization of point cloud into 2D or to the module developed in prior work [23], is optimized
3D space is also a common practice to scale down the further to reduce the process time to about one half. The
number of measurements for estimation. Similarly, with the approach in former work for ground classification was sim-
assumption of planner ground environment, setting a height ilar, however the point cloud was traversed multiple times
threshold is enough for ground classification. On the other for data representation into cylindrical grid, labeling of grid
hand, some approaches utilize neural networks to address cells, and labeling of LiDAR points respectively. The ground
the classification of the sparse LiDAR point cloud. In this classification module evaluated on the similar datasets pro-
work, possibility of ground to be non-planner is considered cessed the data in an average time of 7.8ms, that former to
and point cloud is assumed to be a merger of multiple cali- optimization consumed 15.7ms. Similarly, the process time
brated LiDARs that are arbitrarily positioned. This assump- on embedded system reduced to an average of 39.1ms from
tion rules out the approaches that rely on height threshold and 64.9ms. The execution times are further expressed in the

156288 VOLUME 8, 2020


M. Sualeh, G.-W. Kim: Visual-LiDAR Based 3D Object Detection and Tracking for Embedded Systems

evaluation section. The key modifications for optimization to cluster or per point classification such as [14], [33]. The
are, approaches prove to be effective for 3D object detection tasks
• The lowest and highest point, of each cell is found when but require excessive computational resource, beyond the
indexes of the point cloud are being distributed into grid. constraints of embedded platforms.
• Instead of traversing each channel twice, estimation of In this work, the LiDAR point cloud is clustered using
slope and local ground level along the bins are carried connectivity-based approach. To reduce the complexity, 3D
out in a single traversal. cylindrical grid is used instead of point wise clustering. The
• The iterators for the point cloud indexes are efficiently advantage of 3D grid over 2D grid is to cater the measure-
utilized to form point clouds for ground and non-ground ments pertaining to elevated structures, like traffic lights and
points, skipping the step of labeling the bins. bridges. Furthermore, cylindrical grid can address the sparsity
In addition to the processing speed of computational plat-
of measurements that are far from the sensor. The point cloud
form, and optimized programming approach, the parameters
for clustering is represented in a 3D array, where each cell
for data representation contribute in the overall processing
contains the corresponding indexes of points.
time. This include the range of measurements from sensor
The 3D array is processed through a 3D connected compo-
Rrange , number of LiDAR measurements in a time step,
nent clustering approach to group the grid cells in proximity.
LiDAR field of view to be considered FOV, and resolution
The formulation of clustering traverses all the cells of 3D
of grid-based representation, expressed in terms of grid cell
array and examines the immediate neighbors for minimum
area,
number of cells to include in a cluster. The clusters of point
FOV · π cloud are filtered based on dimensions, large clusters gener-
GAcell = (R2 − R2i−1 ), (1)
channels · 180 i ally correspond to buildings while very small clusters either
Rrange belong to noise, insignificant obstacles, or over segmentation.
Ri = · i, where i = 0, 1, 2, . . . Bins (2)
Bins Furthermore, the clusters elevated from the ground are also
The number of bins and channels determine the area and filtered, as intention is to track the moving objects on the
number of grid cells that need to be traversed for slope test ground. The remaining clusters are treated with the box fitting
and local ground estimation. Higher resolution demands addi- task to estimate the pose of object and the centroid, further
tional processing, whereas lower resolution provides compact explained in the subsequent subsection.
representation. Once the ground is classified, the point cloud Like ground segmentation, clustering module developed in
pertaining to non-ground LiDAR measurements is presented previous work [23], is optimized and process time is substan-
to the clustering module. tially reduced. The clustering module developed in the former
work used a rectangular grid-based representation of point
B. CLUSTERING cloud, requiring relatively higher resolution. Furthermore,
The concept of clustering is to group the entities based on to cluster the occupied grid cells, all 26 neighbors of each
some similarity [30]. Clustering a LiDAR point cloud such cell were traversed for occupancy check. The average process
that each cluster corresponds to a unique object is a chal- time of clustering together with the box fitting task on similar
lenging task, due to sparsity and lack of textured information. datasets is reduced to 3.66ms from 14.3ms, and on embedded
The clustering approaches generally utilize connectivity [28], system the process time is reduced to 17.3ms from 31.18ms.
centroid [30], density [31], distribution [32], or learned The execution times at modular level are further explained
features of the LiDAR measurements. Connectivity or in the evaluation section. The key modifications in clustering
hierarchy-based approaches rely on proximity of neighboring module are listed below,
measurements and expand iteratively. • Rectangular grid is replaced with a 3D cylindrical grid,
The centroid-based approaches require prior knowledge to exploit the point cloud of single LiDAR instead of
of the number of clusters to divide data into, such as merged point cloud of three LiDARs.
K-means, Gaussian Mixture Models, and Fuzzy c-mean. • Instead of searching 26 neighbors of grid cell for clus-
While, density-based approaches identify high density tering, 6 immediate neighbors are traversed.
regions for clustering, but density of LiDAR measurements In the clustering process, unlike 2D representation of LiDAR
radially decrease as a function of distance from the sensor. data for ground classification, 3D or volumetric representa-
Moreover, occluded measurements further affect the densi- tion is adopted. In addition to the bins and channels, the ver-
ties, thus density-based clustering approaches are not effec- tical range Vrange of LiDAR cloud is divided into layers. The
tively applicable in 3D object detection paradigm. number of LiDAR measurements however are reduced to
The distribution-based clustering methods utilize distri- only non-ground measurements within FOV and range Rrange .
bution models to fit potential clusters of objects, providing The volume of grid cell is then represented as,
more information compared to density-based methods, but
FOV · π Vrange  2 
at the cost of complexity. However, absence of distribution GVcell = · Ri − R2i−1 , (3)
model and measurements under partial occlusion suffers in channels · 180 levels
proper clustering. Similarly, learning based approaches trains The clustering method adopted in this work requires that
a neural network for a set of optimization functions/criteria the adjacent cells of the grid pertaining to unique objects

VOLUME 8, 2020 156289


M. Sualeh, G.-W. Kim: Visual-LiDAR Based 3D Object Detection and Tracking for Embedded Systems

are populated with LiDAR cloud indexes. Therefore, optimal


resolution parameters are desired to set GVcell , as higher res-
olution results in over segmentation, despite increased com-
putational resource. Whereas, low resolution representation
tends to cluster LiDAR measurements pertaining to objects
in proximity as single object. Therefore, the volume GVcell
provides a balance between performance and computation
time.
The clustering module includes the task of box fitting that
greatly affects the overall performance of tracking. Even if
the tracked object undergoes a partial occlusion, the dimen-
sions and pose history of the tracked object still contribute
in recovering the accurate centroid and pose, handled by the
track management module.

C. BOX FITTING
Box fitting of clustered LiDAR point cloud data is an essen-
tial and challenging task, as the measurements are always
occluded because of obstructions in the sensor line-of-sight.
An efficient box fitting technique estimates the correct object
pose and centroid, considering the partially measurements.
Several approaches tend to address this problem by either
model-based [34] or feature-based methods [35]. The model- FIGURE 5. Cylindrical grid for clustering.
based methods match the raw point cloud with a known
geometric model, whereas features-based approaches exploit
the edge features to estimate the pose. Lack of generality
and excessive computational requirement bars the use of
model-based approaches in MODT applications. Similarly,
the selection of features that best describe the object pose is
a difficult task. Currently, neural networks are also trained
for feature selection process, as utilized in [10]. However,
change in sensor setups often require labeling of datasets and
retraining of networks.
In this work, considering the computational constraints a
feature-based method is utilized, that performs the L-shape
point cloud fitting within a minimum rectangle area. Ini-
tially, the indexes of points with coordinates that define the
minimum box fitting are traversed to identify the corners
of clustered point cloud on the horizontal axis. The farthest
corners based on dimensions and location of object cluster
are used to formulate a line, and all points of the cluster are
traversed to find the farthest point from the line as a third cor-
ner. Using the three corners, dimension of the bounding box
and centroid is updated. Lastly, the pose of clustered object
is calculated about the updated centroid. Since the presence FIGURE 6. L-shape box fitting.
of occlusions affect the correct pose and centroid estimation,
the tracker module maintains the history of tracked object • Instead of traversing all the points, the points corre-
and heuristically adjusts the dimensions and pose of object sponding to the minimum and maximum coordinates of
temporally. The information flow is expressed in Fig. 6, the cluster are exploited.
where point a and b are the farthest points of the cluster • The farthest points of the cluster are heuristically found
identified by the maximum and minimum coordinates of the by making use of dimensions and cluster position with
cluster respectively. respect to the sensor.
The box fitting task is performed within clustering module
and finding the farthest points in the clusters contribute in D. TRACKING
overall process time. The key factors modified to acquire The multi-object tracking is an essential component in the
optimized process time are as follows, perception pipeline of autonomous vehicles. The object

156290 VOLUME 8, 2020


M. Sualeh, G.-W. Kim: Visual-LiDAR Based 3D Object Detection and Tracking for Embedded Systems

tracking capabilities can enable the system to make better vj,k are mutually independent covariance matrices Qj,k and
decisions of actions to perform in cluttered environments. Rj,k respectively. Moreover, the progression of system among
An extensive literature on 2D MOT algorithms exist that r models is considered as first order Markov chain with time
focuses on object tracking in image plane [36], where the invariant Markovian model transition probability matrix:
objective is to maintain unique identities and to provide  
temporally consistent location of detected objects. However, p11 ··· pr1
tracking of objects in 3D space is becoming popular in  .. .. ..  ∈ Rr×r .
5= . . .  (6)
the research community [19], as more and more 3D MOT
p1r ··· prr
schemes are being proposed. The 3D MOT systems in general
share the similar components of 2D MOT systems, a part
form the distinction of object detections in 3D space instead The elements of the matrix pij represent mode transition
of image plane. This potentially allows the design of motion probability from model i to j.
models, data association, occlusion handling, and trajec- The proposed IMM-UKF-JPDAF tracker follows a five-
tory maintenance directly in 3D space without perspective step process: (a) interaction, (b) state prediction and mea-
distortion. surement validation, (c) data association and model-based
An autonomous vehicle acts like a dynamic system that is filtering, (d) mode probability update, and (e) combination
required to track objects in the environment. In this scenario, step. A similar approach for a single target is explained in
the tracked objects tend not to follow a regular motion pattern, [40], that addresses data association of measurements to a
giving rise to motion uncertainties. Similarly, the cluttered single tracked object. In comparison, a JPDAF is deployed
environments and sensor limitations impose partial or com- in the proposed framework to perform tracking of multiple
plete occlusion of objects that adds up the uncertainty in objects. This requires computation of association probability
position and pose of the tracked objects. To perform tracking between each track and measurement, while considering all
in the presence of uncertainties, Bayesian filtration strategies feasible joint association events across all measurements,
are usually deployed. Where the state estimation is carried leading to a combinatorial explosion problem.
out with either the assumption of Gaussian Mixture for den- To mitigate the possible combinatorial explosion, cluster-
sity or Gaussian distribution for transition. The assumption ing technique is adopted, where the association matrix is
leads to the use of Gaussian mixture Probability Hypothesis clustered into the sets of marginal θj,q and joint associa-
tion events 2 = N j=1 θj,qj . The number of clusters equal the
T k
Density Filter (PHDF) [37] or Joint Probabilistic Data Asso-
ciation Filter (JPDAF) [38] for state estimation, respectively. sum of marginal and joint association events. The clustering
Whereas, for non-Gaussian assumption Particle Filter (PF) technique helps in mitigating the combinatorial explosion of
methods are used [39]. The states of the tracked objects hypothesis that naturally grows in the cluttered environments.
are updated after association between tracks and detection Furthermore, the covariance of a track prediction increases
information is established. with unassociated measurement in consecutive time steps,
In this work, like the former implementation in [23], the consequently increasing the gate area for association. The
uncertainties due to clutter are addressed with an assump- larger gate area results in larger number of joint association
tion of Gaussian distribution, and JPDAF is applied for data events.
association. Similarly, the uncertainties due to motion is The hypothesis of all possible occurrences of events within
handled by an Interacting Multiple Model (IMM), to per- every cluster, marginal association probabilities are com-
form non-linear prediction of states for tracked objects. puted, that is the probability sum of the joint association
To cater the non-linearities of motion models for Gaussian events; given that the measurement j belongs to track q:
process, an Unscented Kalman Filter (UKF) is utilized. The
cl
implementation of IMM-UKF-JPDAF is an approach that X n o h i
efficiently addresses the problem of recursively estimating βjq
cl
= P 2cl |zk ω̂jq
cl
2cl ,
2
states and mode probabilities of targets, described by a jump
Markov non-linear system, in the presence of clutter. j = 1, . . . ,N cl and q = 1, . . . ,Tcl , (7)
The trackable N cl T cl N cl
 objects are assumed to follow r motion n o 1Y
(1 − PD )δq β φj .
Y Y
r
models M = Mj j=1 , represented as a non-linear stochastic P 2 |zk
cl
= gjq PD (8)
c
state space model, j=1 q=1 j=1

xk+1 = fj (xk , uk ) + wj,k , (4) where, ω̂jq


cl represents the joint association event within the

zk = hj (xk , uk ) + vj,k . (5) cluster cl of N measurements and T tracks. Furthermore,


gjq is the likelihood of measurement j being associated to
That operates the system function fj and measurement func- track q, normalized by a factor c. Moreover, δq and φj rep-
tion hj , with the input vector uk ∈ Rp , state vector xk ∈ Rn , resent the number of unassociated tracks and measurements,
and measurement vector zk ∈ Rq at each time step k. respectively, within the cluster. Subsequently, weighted mea-
Where the zero-mean Gaussian noise sequences wj,k and surement residual z̃I ,q,k is computed for each corresponding

VOLUME 8, 2020 156291


M. Sualeh, G.-W. Kim: Visual-LiDAR Based 3D Object Detection and Tracking for Embedded Systems

model i of filter according to the associated measurement set: datasets consumes 6.4ms and 18.5ms on desktop and embed-
cl
ded computing platforms, respectively. Whereas, in former
N
X implementation consumed 8.3ms on desktop and 24.6ms on
z̃i,q,k = βj,q
cl
z̃i,j,q,k (9) embedded system respectively, without visual class associ-
j=1
ation. The key factor responsible in optimizing the process
The cross-covariance matrix Cxk ,zk between predicted states time is of an efficient track management module that limits
and measurements are used together with innovation covari- the false positive measurements for tracking.
ance matrix Sk to calculate the optimal Kalman gain Kk .
Subsequently, the states and covariances of each model for E. OBJECT CLASSIFICATION
the corresponding tracks are updated. The paradigm of fusing multiple modalities followed in this
work can be regarded as late fusion, where the tracked clusters
Kk = Cxk ,zk Sk−1 (10)
of point clouds are classified. The classification of tracked
xi,q,(k | k) = xi,q,(k | k−1) + Ki,q,k z̃i,q,k (11) point cloud clusters rely on two components; class association
and class management.
The states, covariances, and mode probabilities of IMM-
• The class association involves the process of assigning a
UKF-JPDAF are recursively estimated with the help of indi-
visually detected objects’ class to the tracked point cloud
vidual model likelihoods. The individual filter states and
cluster.
covariances are combined into a single weighted output using
• The class management utilizes the assignment history to
mode probabilities of tracks. The flow of tracker module
maintain and select a class for the tracked object.
is elaborated in Fig. 7, along with the clustered association
matrix, forming three association clusters as sub problems The visual object detection is carried out using
for the tracker. Furthermore, a tracked object is shown that YOLO-v3 [42] that is pretrained in Microsoft COCO
presents the tracking parameters along with the class and datasets [43] and detection classes are limited to track-
confidence percentage. able dynamic objects. YOLOv3, uses Darknet-53 (a CNN
model with 53 convolutional layers) backbone, and delivers
57.9 mAP (AP50) on Microsoft’s COCO dataset, using an
input resolution of 608 × 608 pixels. The optimized variants
of the network can perform real-time on embedded systems
but at the cost of compromised accuracy.
In this work, the input image resolution is tuned to
416 × 416 pixels that maintains the process time well below
100 milliseconds on the embedded system. Although, reduc-
tion in input image resolution results in missed detections due
to size, saturation, and exposure issues in image. The tracker
module handles the missed detections with the class vector
that probabilistically assigns the class to objects. In addition,
the LiDAR range is limited to 60-80m, beyond this range
point cloud is too sparse for accurate estimation of object
dimensions and pose. Visual detector detecting an object
beyond this range only adds an additional complexity for the
class association process.
Let T k and Dk be the sets of maintained tracks and visually
detected objects respectively at time step k. The corrected
centroids of tracked clusters oki are projected onto the image
frame of corresponding time stamp, resulting in a 2D pixel
location in image ōki . Similarly, the localization and dimen-
FIGURE 7. IMM-UKF-JPDAF tracker with clustered associations. sions of visually detected objects are used to calculate the
centroids mkj . Using the 2D centroids of both sources the
The execution times of tracker module mainly rely on h i
the number of maintained tracks. The former implemen- Euclidian distance cost matrix E k = ckij is populated,
tation [23] without visual classification, pruned the tracks where i = {1, 2, . . . , T } and j = {1, 2, . . . , D}. Furthermore,
merely based on inconsistent measurements, resulting in constrained by the criterion that at least 30% of overlap exists
large number of tracks to maintain. With an additional con- among the corresponding 2D bounding boxes.
dition of tracked object being classified by visual object
detector, reduce the number of tracks, resulting in decreased  
d(o¯ki , mkj ) ifiou t¯ik , djk > 0.3
(
process time. The average execution time of tracker including ckij = (12)
track management and visual class association on similar 1000 otherwise

156292 VOLUME 8, 2020


M. Sualeh, G.-W. Kim: Visual-LiDAR Based 3D Object Detection and Tracking for Embedded Systems

Following the Munkres association strategy for optimized and blue dots in the image represent the projected centroid of
minimum cost is performed and set of index pairs ϒ per- cluster, and center of visually detected objects respectively.
taining to associated tracks tik and visual detections djk is The tracked objects maintain a class vector with each
h i
obtained. Using the set ϒ class association matrix Ê k = ĉkij dimension registered to a visually detectable class. At the
fusion step, after a successful association, a unit increment is
can be formulated such that,
( made in the corresponding class dimension of the vector. The
vj if < i, j >⊂ ϒ maximum count is a dimension of the class vector specifies
ĉkij = (13) the object class, whereas the certainty is computed against the
0 otherwise,
life of the track.
where, v is the number that represents the class of visually
detected object and the dimension index of class association F. TRACK MANAGEMENT
vector Aiv = (ai1 , . . . ain ). The matrix Ê k is used to update the The MODT task in the presence of uncertainties pertain-
class association vector Ai with an increment in the associated ing to classification, clutter, and motion of objects require
class dimension, a robust track managing module to maintain and provide
( i
a if ĉkij = 0 reliable information. The main purpose of track management
ĉij = ĉi ij
k
(14) module is to initialize and maintain track statistics, occlusion
aĉ + 1 otherwise, handling of the tracked object, and pruning out of tracks
ij
pertaining to false positive measurements.
The updated vector Ai , along with the age of track tage is The track managing module initiates new tracks for unas-
utilized to compute the class certainty Pic of tracked objects sociated measurements with unique identity, and records
and the ratio Pio of object, that reasons the tracked object to track age in terms of frame count. In addition, the dimensions
be an outlier. and pose of the tracked object are retained while consid-
max (av ) ering the LiDAR properties. The measurements related to
Pic = (15)
tage the objects moving farther from the sensor tend to experi-
tage − nv=1 aiv
P  ence increased occlusions, and report decreased dimensions.
Pio = (16) On the other hand, objects approaching closer to the sensor
tage
get more exposure and provide comparatively more accurate
dimensions. Similar pattern is observed in the estimated pose
of the object and is handled by smoothing the sudden changes
in the yaw angles. The accuracy of maintained dimensions
and pose aids the occlusion handling and centroid correction.
As the centroid of the measurement pertaining to the object
under occlusions is also shifted proportional to the change in
the dimensions. By capitalizing on the sensor characteristics
and maintained information the centroid C of the tracked
object is corrected using change in length 1L, width 1W ,
height 1H and yaw ϕ.
cos (ϕ)
Cx0 = Cx + 1L · , (17)
2
sin (ϕ)
Cy0 = Cy + 1W · , and (18)
2
Cz0 = Cz + 1H . (19)
The tasks associated with the track management module are
presented in Fig. 9, with an example of centroid correction.
The mature track of an object retains the dimensions as width
W and length L, compared with the measured dimensions
Wm and Lm to find the difference between 1W and 1L. The
change in dimensions is utilized together with the position of
FIGURE 8. Visual object detector to classify tracked objects using class object relative to the sensor to acquire the correct centroid C’.
vector.
Furthermore, the mature tracks of objects represent the mea-
In Fig. 8, the flow of object classification is presented that sured dimensions by a wire frame bounding box, along a
runs in parallel with the LiDAR based detection tracking red box with correct dimensions and centroid. The initialized
thread. Furthermore, the class vector of a mature track is track requires measurement association for consecutive five
shown that represents the age and counts of class associations time-steps to be regarded as a mature track. If a track misses
resulting in the class certainty of 66.6%. Moreover, the red an association measurement without getting a classification

VOLUME 8, 2020 156293


M. Sualeh, G.-W. Kim: Visual-LiDAR Based 3D Object Detection and Tracking for Embedded Systems

The metrics used to evaluate the proposed MODT are:


(a) tracker to target assignment, (b) multi object tracking
accuracy MOTA, (c) multi object tracking precision MOTP,
and (d) track quality. The metric for tracker to target assign-
ment measures the number of False Positives (FP), False
Negatives (FN), and ID Switches IDSW. Where, FP and FN
deals with incorrect associations of the measurements. Fur-
thermore, the IDSW determines the number of ID switches
across all the fames for an object, the metric for MOTA is
computed by:
(FNt +FPt +IDSW)
P
t
MOTA = 1 − P , (20)
t Gt

where, t is the frame index and G is ground truth value.


The negative MOTA indicates that the number of errors has
exceeded the actual number of objects, maximized at 100.
Furthermore, the metric MOTP is evaluated by the average
3D IOU of the tracked object. The metric for track quality
is described by classification of a track into: Mostly Tracked
(MT), Partially Tracked (PT), and Mostly Lost (ML). This
FIGURE 9. Tracks manager with pose, dimensions, and centroid corrector.
measures the extent of ground truth G trajectory recovered
by the tracker. A target is MT if it is successfully tracked
from a visual detector within the maturity period, the track for at least 80% of its life span. Where, IDSW number is
is pruned out. Moreover, the ratio Pio for all mature tracks irrelevant in this metric, as the ID needs not to remain the
getting higher than 60% are filtered out as outliers. Further- same throughout the track. If the recovered track is for less
more, mature tracks that share common measurements for than 20% of its total length, it is said to be (ML). Other tracks
consecutive five time-steps results in pruning of inconsistent, fall under the class of (PT). A higher number of (MT) and
or younger track. The trajectory of tracked object including few (ML) is desirable.
the position, pose, and time is stored, and utilized to calculate In this work, a criterion is proposed that overlaps the
relative heading direction, velocity, and angular velocity of ground truth information pertaining to the camera and LiDAR
the tracked objects at every time-step, as shown in Fig. 9. frame as a reference. Hence, a subset of ground truth is
In addition, class vector Ai is updated that provides the class attained such that: (a) object is in FOV of LiDAR and camera,
certainty Pic to the tracked objects. (b) existence of object within 40 meters range from the sensor,
and (c) the lifetime of a track is specified by the duration
VI. EVALUATION of first two conditions being true. The range criterion is set
The KITTI datasets [44] are widely accepted as a standard to 40 meters range, as the Velodyne HDL-64E sensor used
evaluation platform for MOT tasks. However, a tool to eval- in KITTI datasets has the effective measurement range of
uate 3D MODT systems directly in 3D space is not provided 20-40 meters [47]. Furthermore, instead of evaluating the
by KITTI dataset. The convention of evaluating 3D MOT detected objects and corresponding tracks framewise, the
systems is to project 3D tracking results to the 2D image trajectories of objects are compared with the provided ground
plane to perform evaluation. Recently, an extension to the truths. As the proposed framework temporally assigns the
official KITTI 2D MOT is proposed in [19], where the cost class to the tracked objects. Since, the dimensions of objects
function is modified from 2D IoU to 3D IoU, and tracking are not provided frame-wise in the ground truth. The dimen-
evaluation is performed directly in 3D space. As the proposed sions of tracked objects evaluated across the life of track are
framework performs MOT in 3D space, the extension of 3D used for evaluations. The raw data provided by the KITTI
MOT evaluation is more relevant. In the proposed framework datasets under the category of ‘City’, with the ground truth
however, MODT is jointly performed where an object while annotations are used for being more relevant to this imple-
being tracked may require several time steps before getting mentation. However, the objects of type ‘Tram’, ‘Misc’, and
an accurate class, dimensions, and orientation. To evaluate ‘person sitting’ are excluded for evaluation, and contribute to
the true potential and fair comparison with state-of-the-art FP if detected and tracked.
systems, MODT is evaluated on the KITTI datasets with
ground truths [44], based on the MOT16 evaluation metrics A. EVALUATION RESULTS FOR MODT
proposed in [41]. Whereas, the MOT component is indepen- The raw KITTI Dataset is provided with the ground truth
dently evaluated using off-the-shelf 3D object detector [45], for tracking information structured in XML format. More-
[46] against the 3D MOT evaluation extension [19]. over, a support to generate similar XML file directly from

156294 VOLUME 8, 2020


M. Sualeh, G.-W. Kim: Visual-LiDAR Based 3D Object Detection and Tracking for Embedded Systems

TABLE 1. Tracking evaluation with datasets information. TABLE 2. MODT time consumption on desktop and jetson board.

tracking algorithm is provided. To perform the evaluation


and benchmarking, the MODT framework is programmed
to produce an XML file similar in format to that of ground
truth. In addition, MATLAB wrapper is also offered by KITTI
Dataset to extract tracking information from the XML files to
perform the evaluations. The evaluation results of 10 dataset requirements adhere to demands of embedded systems,
sequences, along with dataset recording number, frame count as overall execution cycle of the algorithm remains within the
and the number of trackable objects that qualify the defined sampling time of LiDAR. Moreover, ample time is at disposal
criterion are tabulated in Table. 1. at every time step for communication across the platform.
The evaluated metrics in Table. 1 reflect that MODT algo- The best performing parameters are used for evaluations
rithm performs reasonably within a 40 meters range. How- while considering the constraints of computational resource
ever, metric scores drastically degrades with an increase in and real-time requirements. The area of grid cell for ground
range. The main contributors are the FP and FN, as IDSW classification, volume of grid-cell for clustering task, and the
notably remain low. The error metric specifies the average input image resolution are the key factors for optimization to
Euclidian distance from the object centroids of the detector realize real-time and resource constraint implementation of
and the ground truth at every time step. The range of centroid the proposed 3D object detection and tracking framework.
error being around a meter shows that the measurements of
detector lies within proximity of object under occlusion. The C. EVALUATION RESULTS FOR MOT
metrics for the quality of tracks establish that two-thirds of The extension in 2D MOT evaluation tool by KITTI for
the tracks fall under the category of MT. The ML metric is 3D MOT evaluation [19] include integral metrics to better
contributed mainly due to the sparsity of LiDAR at larger express the performance of the frameworks. The purpose is to
distances. The overall tracking quality is affected mainly due average the MOTA and MOTP at different threshold values
to the variation in the datasets with lowest quality resulting in for detection scores, like the existing approach of average
dataset 1, 9, and 51. These datasets also contribute in higher precision for object detection [48]. Thus, AMOTA is defined
FP and FN. The algorithm being sensitive to sudden change as,
in speed and sharp turns that loses the track reflected in the
AMOTA
IDSW count.   
(FNt +FPt +IDSW)
P
1 X t
, (21)
r
 
B. EXECUTION TIMES = 1 − P
The metric evaluations for KITTI datasets are carried out on L n o t Gt
L , L ,...,1
1 2
r∈
desktop computer; however, time complexity is measured on
the Jetson board. The time consumed by individual modules where L is the number of recall values set at 40, and all
of the algorithm while executing in respective computational metrics are computed at r recall value. Furthermore, a scaled
environment is presented in Table. 2. The overall perfor- accuracy metric (sAMOTA) is used that provides the absolute
mance of the algorithm in terms of metrics pertaining to measure of the system performance at a recall value r [19].
accuracy, precision, and quality are comparable to state-of- The tracking module in the proposed framework is tested on
the-art MODT paradigms. Furthermore, the computational the validation set of KITTI tracking dataset, with an extended

VOLUME 8, 2020 156295


M. Sualeh, G.-W. Kim: Visual-LiDAR Based 3D Object Detection and Tracking for Embedded Systems

TABLE 3. 3D MOT evaluation results. leading score of precision in 2D MOT format. Furthermore,
the proposed method performs computation at the highest
speed of 236 FPS. Similarly, the identity switching count
ranks third among the compared approaches with a total
of 21. Thus, the proposed tracker can perform on a budgeted
computational resource, with performance metrics at par with
the state-of-the-art approaches.

TABLE 4. 3D MOT evaluation comparison.

FIGURE 10. Detection and tracking of vehicle and pedestrian.

VII. EXPERIMENTS ON PLATFORM


The proposed algorithm is put to test under the scenario of
fast emergency vehicle passing by and a model dummy being
evaluation tool. As KITTI dataset does not have an official swiftly dragged across the front of platform as a pedestrian.
train/validation split, following the approach in [19], [49] the Fig. 10. Shows that the framework efficiently tracks and
sequences 1, 6, 8, 10, 12, 13, 14, 15, 16, 18, 19 are used reports the tracking parameters in both scenarios. The emer-
for validation. Furthermore, a modification is made in the gency vehicle is identified as a car with certainty of 100%,
tracking module to comply with the input format of detector this implies that the tracked object in every time frame got
and tracking output readable for the evaluation tool. assigned the same class, while maintaining a unique ID, and
The 3D MOT evaluation tabulated in Table. 3, depict better getting a measurement associated in the last time step.
overall metric scores in car and cyclist categories compared The analysis of evaluation results against benchmarking
to the pedestrian category. Furthermore, the lower scores in datasets and experiments performed on the platform high-
pedestrian category is also reflected in higher number of iden- lighted some failure cases and limitations of the proposed
tity switches and fragmentation counts with comparatively framework. The framework is set to capture and start tracking
lower percentage of MT percentage. This is mainly because an object moving at a relative speed of less than 80 Km/h,
of datasets with crowded pedestrians in proximity. The 2D the object entering the detection region with larger relative
MOT metrics show better weightages as tracks are evaluated speed fails to establish a mature track. This results in ID
on the image plane at a best performing recall threshold switches and poor pose and tracking parameters estimation.
value. The speed of the tracker consistently remaining above Furthermore, the correct dimensions of a tracked object can-
200 FPS without GPU utilization is an exceptional advantage not be estimated that remains partially occluded throughout
as the framework is intended to run on embedded systems. the track life, this reflects an error in estimated centroid.
The recent 3D MOT methods on the KITTI leader- The parameters of yaw angle (in radians) and speed of
boards including FANTrack [20], Complexer-YOLO [15], the emergency vehicle (in m/sec) are reported relative to the
DSM [16], and FaF [18] have not released the code yet to be heading and speed of platform. Similarly, the trail of pedes-
evaluated on the new metrics. However, in AB3DMOT [19] trian track relative to platform reports the tracking parame-
reproduced results of FANTrack are evaluated and compared ters, as the platform stops at a distance to let the pedestrian
using the 3D MOT metrics. Similarly, mmMOT [22], GTrk- vacate the path. Here, the class certainty is lower than 100%
Forecast [21], and 3DT [17] are evaluated using the same either because of missed visual detections or an incorrect
evaluation tool, against common datasets to produce the cor- class assignment in the track history.
responding metric results. The evaluation results of 3D MOT
over validation KITTI-car dataset are tabulated in Table 4. For VIII. ADDED POSSIBLE APPLICATIONS
fair comparison all 3D MOT methods are provided with the The prime focus of this work is to perform multiple object
object detections obtained by PointRCNN [45], so that only detection and tracking. However, the proposed approach
tracking performance of the methods are evaluated. offers additional features that can benefit from the over-
The comparison results suggest that the performance all autonomous vehicle architecture. Such as, classification
weightages lie close to state-of-the-art approaches, with a of static and dynamic regions in the scene can realize an

156296 VOLUME 8, 2020


M. Sualeh, G.-W. Kim: Visual-LiDAR Based 3D Object Detection and Tracking for Embedded Systems

informed selection of visual features for visual odometry. [9] M. Liang, B. Yang, Y. Chen, R. Hu, and R. Urtasun, ‘‘Multi-task multi-
Similarly, the motion patterns of dynamic objects can benefit sensor fusion for 3D object detection,’’ in Proc. IEEE/CVF Conf. Comput.
Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 7345–7353.
from the path planning. Furthermore, the tracking of static [10] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas, ‘‘Frustum PointNets for
objects can aid the odometry, that is often required in dense 3D object detection from RGB-D data,’’ in Proc. IEEE/CVF Conf. Comput.
urban environment where conventional localization methods Vis. Pattern Recognit., Jun. 2018, pp. 918–927.
[11] Z. Wang and K. Jia, ‘‘Frustum ConvNet: Sliding frustums to aggre-
become less reliable. Furthrmore, instance aware semantic gate local point-wise features for amodal 3D object detection,’’ in Proc.
segmentation of visual scene can also be realized in a cost- IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), Nov. 2019, pp. 1742–1749.
effective way, requiring the up sampling of tracked LiDAR [12] E. Arnold, O. Y. Al-Jarrah, M. Dianati, S. Fallah, D. Oxtoby, and
A. Mouzakitis, ‘‘A survey on 3D object detection methods for autonomous
clusters projected on image, as demonstrated in Fig. 11. driving applications,’’ IEEE Trans. Intell. Transp. Syst., vol. 20, no. 10,
pp. 3782–3795, Oct. 2019.
[13] J. Ku, M. Mozifian, J. Lee, A. Harakeh, and S. L. Waslander, ‘‘Joint 3D
proposal generation and object detection from view aggregation,’’ in Proc.
IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), Oct. 2018, pp. 1–8.
[14] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, ‘‘Multi-view 3D object
detection network for autonomous driving,’’ in Proc. IEEE Conf. Comput.
Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 6526–6534.
[15] M. Simon, K. Amende, A. Kraus, J. Honer, T. Samann, H. Kaulbersch,
S. Milz, and H. M. Gross, ‘‘Complexer-YOLO: Real-time 3D object detec-
tion and tracking on semantic point clouds,’’ in Proc. CVPRW, Jun. 2019,
FIGURE 11. MODT based instance aware semantic segmentation.
pp. 1–10.
[16] D. Frossard and R. Urtasun, ‘‘End-to-end learning of multi-sensor 3D
tracking by detection,’’ in Proc. IEEE Int. Conf. Robot. Automat. (ICRA),
IX. CONCLUSION May 2018, pp. 635–642.
In this work an efficient MODT framework is proposed for [17] H.-N. Hu, Q.-Z. Cai, D. Wang, J. Lin, M. Sun, P. Kraehenbuehl, T. Darrell,
and F. Yu, ‘‘Joint monocular 3D vehicle detection and tracking,’’ in Proc.
embedded systems that operate on visual-LiDAR setups. The IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 5390–5399.
framework takes advantage of spatial LiDAR data and 2D [18] W. Luo, B. Yang, and R. Urtasun, ‘‘Fast and furious: Real time end-to-end
scene understanding by performing late fusion of modalities 3D detection, tracking and motion forecasting with a single convolutional
net,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
temporally. The framework is tested on well-established per- pp. 3569–3577.
formance metrics against publicly available synthetic KITTI [19] X. Weng, J. Wang, D. Held, and K. Kitani, ‘‘3D multi-object track-
datasets. Whereas, the tracking component is also indepen- ing: A baseline and new evaluation metrics,’’ 2019, arXiv:1907.03961.
[Online]. Available: http://arxiv.org/abs/1907.03961
dently tested for a 3D MOT for a fair comparison with state- [20] E. Baser, V. Balasubramanian, P. Bhattacharyya, and K. Czarnecki,
of-the-art methods. It is intended to further extend this work ‘‘FANTrack: 3D multi-object tracking with feature association network,’’
in future to improve the object detection by early classifica- in Proc. IEEE Intell. Vehicles Symp. (IV), Jun. 2019, pp. 1426–1433.
[21] X. Weng, Y. Yuan, and K. Kitani, ‘‘Joint 3D tracking and forecasting with
tion of objects and to realize MODT based semantic anno- graph neural network and diversity sampling,’’ 2020, arXiv:2003.07847.
tations of images. Moreover, MODT can be exploited to aid [Online]. Available: http://arxiv.org/abs/2003.07847
visual odometry in dense environmental conditions. [22] W. Zhang, H. Zhou, S. Sun, Z. Wang, J. Shi, and C. C. Loy, ‘‘Robust multi-
modality multi-object tracking,’’ in Proc. IEEE Int. Conf. Comput. Vis.,
Oct. 2019, pp. 2365–2374.
REFERENCES [23] M. Sualeh and G.-W. Kim, ‘‘Dynamic multi-LiDAR based multiple object
[1] Car Crash Deaths and Rates—Injury Facts. Car Crash Deaths and detection and tracking,’’ Sensors, vol. 19, no. 6, p. 1474, Mar. 2019.
Rates. Accessed: Apr. 29, 2020. [Online]. Available: https://injuryfacts. [24] A. Geiger, P. Lenz, and R. Urtasun, ‘‘Are we ready for autonomous driving?
nsc.org/motor-vehicle/historical-fatality-trends/deaths-and-rates/ The KITTI vision benchmark suite,’’ in Proc. IEEE Conf. Comput. Vis.
[2] M. Cunneen, M. Mullins, F. Murphy, D. Shannon, I. Furxhi, and C. Ryan, Pattern Recognit., Jun. 2012, pp. 3354–3361.
‘‘Autonomous vehicles and avoiding the trolley (Dilemma): Vehicle per- [25] E. Yurtsever, J. Lambert, A. Carballo, and K. Takeda, ‘‘A survey
ception, classification, and the challenges of framing decision ethics,’’ of autonomous driving: Common practices and emerging technolo-
Cybern. Syst., vol. 51, no. 1, pp. 59–80, Jan. 2020. gies,’’ 2019, arXiv:1906.05113. [Online]. Available: http://arxiv.org/
[3] S.-C. Lin, Y. Zhang, C.-H. Hsu, M. Skach, M. E. Haque, L. Tang, and abs/1906.05113
J. Mars, ‘‘The architectural implications of autonomous driving: Con- [26] P. Narksri, E. Takeuchi, Y. Ninomiya, Y. Morales, N. Akai, and
straints and acceleration,’’ in Proc. 23rd Int. Conf. Archit. Support Pro- N. Kawaguchi, ‘‘A slope-robust cascaded ground segmentation in 3D point
gram. Lang. Oper. Syst., Mar. 2018, pp. 751–766. cloud for autonomous vehicles,’’ in Proc. 21st Int. Conf. Intell. Transp.
[4] G. P. Meyer, A. Laddha, E. Kee, C. Vallespi-Gonzalez, and Syst. (ITSC), Nov. 2018, pp. 497–504.
C. K. Wellington, ‘‘LaserNet: An efficient probabilistic 3D object [27] M. Himmelsbach, F. V. Hundelshausen, and H.-J. Wuensche, ‘‘Fast seg-
detector for autonomous driving,’’ in Proc. IEEE/CVF Conf. Comput. Vis. mentation of 3D point clouds for ground vehicles,’’ in Proc. IEEE Intell.
Pattern Recognit. (CVPR), Jun. 2019, pp. 12677–12686. Vehicles Symp., Jun. 2010, pp. 560–565.
[5] J. Zhou, X. Tan, Z. Shao, and L. Ma, ‘‘FVNet: 3D front-view pro- [28] Q. Li, L. Zhang, Q. Mao, Q. Zou, P. Zhang, S. Feng, and W. Ochieng,
posal generation for real-time object detection from point clouds,’’ 2019, ‘‘Motion field estimation for a dynamic scene using a 3D LiDAR,’’ Sen-
arXiv:1903.10750. [Online]. Available: http://arxiv.org/abs/1903.10750 sors, vol. 14, no. 9, pp. 16672–16691, 2014.
[6] B. Wu, A. Wan, X. Yue, and K. Keutzer, ‘‘SqueezeSeg: Convolutional [29] M. Velas, M. Spanel, M. Hradis, and A. Herout, ‘‘CNN for very fast ground
neural nets with recurrent CRF for real-time road-object segmentation from segmentation in Velodyne LiDAR data,’’ in Proc. IEEE Int. Conf. Auton.
3D LiDAR point cloud,’’ in Proc. IEEE Int. Conf. Robot. Automat. (ICRA), Robot Syst. Competitions, Apr. 2018, pp. 97–103.
May 2018. [30] S. K. Uppada, ‘‘Centroid based clustering algorithms—A clarion study,’’
[7] G. Brazil and X. Liu, ‘‘M3D-RPN: Monocular 3D region proposal network Int. J. Comput. Sci. Inf. Technol., vol. 5, no. 6, pp. 7309–7313, 2014.
for object detection,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), [31] A. Rodriguez and A. Laio, ‘‘Clustering by fast search and find of density
Oct. 2019, pp. 9287–9296. peaks,’’ Science, vol. 344, no. 6191, pp. 1492–1496, Jun. 2014.
[8] M. Liang, B. Yang, S. Wang, and R. Urtasun, ‘‘Deep continuous fusion [32] X. Xu, M. Ester, H.-P. Kriegel, and J. Sander, ‘‘A distribution-based
for multi-sensor 3D object detection,’’ in Proc. Eur. Conf. Comput. Vis. clustering algorithm for mining in large spatial databases,’’ in Proc. 14th
(ECCV), 2018, pp. 641–656. IEEE Int. Conf. Data Eng., Feb. 1998, pp. 324–331.

VOLUME 8, 2020 156297


M. Sualeh, G.-W. Kim: Visual-LiDAR Based 3D Object Detection and Tracking for Embedded Systems

[33] Y. Yan, Y. Mao, and B. Li, ‘‘SECOND: Sparsely embedded convolutional [46] X. Weng and K. Kitani, ‘‘Monocular 3D object detection with pseudo-
detection,’’ Sensors, vol. 18, no. 10, p. 3337, Oct. 2018. LiDAR point cloud,’’ 2019, arXiv:1903.09847. [Online]. Available:
[34] D. D. Morris, R. Hoffman, and P. Haley, ‘‘A view-dependent adaptive http://arxiv.org/abs/1903.09847
matching filter for LiDAR-based vehicle tracking,’’ in Proc. 14th IASTED [47] J. Leonard et al., ‘‘A perception-driven autonomous urban vehicle,’’ J. Field
Int. Conf. Robot. Appl., Cambridge, MA, USA, Nov. 2009, pp. 1–9. Robot., vol. 25, no. 10, pp. 727–774, Oct. 2008.
[35] Z. Luo, S. Habibi, and M. V. Mohrenschildt, ‘‘LiDAR based real time [48] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman,
multiple vehicle detection and tracking,’’ Int. J. Comput. Electr. Autom. ‘‘The Pascal visual object classes (VOC) challenge,’’ Int. J. Comput. Vis.,
Control Inf. Eng., vol. 10, no. 6, pp. 1125–1132, 2016. vol. 88, no. 2, pp. 303–338, Sep. 2009.
[36] Y. Wu, J. Lim, and M. H. Yang, ‘‘Object tracking benchmark,’’ IEEE Trans. [49] S. Scheidegger, J. Benjaminsson, E. Rosenberg, A. Krishnan, and
Pattern Anal. Mach. Intell., vol. 37, no. 9, pp. 1834–1848, Sep. 2015. K. Granstrom, ‘‘Mono-camera 3D multi-object tracking using deep learn-
[37] B. N. Vo and W. K. Ma, ‘‘The Gaussian mixture probability hypothesis den- ing detections and PMBM filtering,’’ in Proc. IEEE Intell. Vehicles Symp.
sity filter,’’ IEEE Trans. Signal Process., vol. 54, no. 11, pp. 4091–4104, (IV), Changshu, China, Jun. 2018, pp. 433–440.
Nov. 2006.
[38] S. H. Rezatofighi, A. Milan, Z. Zhang, Q. Shi, A. Dick, and I. Reid, ‘‘Joint
probabilistic data association revisited,’’ in Proc. IEEE Int. Conf. Comput.
Vis. (ICCV), Dec. 2015, pp. 3047–3055. MUHAMMAD SUALEH received the B.S. degree
[39] A. Doucet, ‘‘On sequential simulation-based methods for Bayesian filter- in electronics engineering from COMSATS Uni-
ing,’’ Dept. Eng., Univ. Cambridge, Cambridge, U.K., Tech. Rep. CUED- versity Islamabad, Abbottabad Campus, Pakistan,
F-ENG-TR310, 1998. in 2009, and the M.S. degree in systems, control,
[40] M. Schreier, V. Willert, and J. Adamy, ‘‘Compact representation of and mechatronics from the Chalmers University
dynamic driving environments for ADAS by parametric free space and of Technology, Sweden, in 2011. He is currently
dynamic object maps,’’ IEEE Trans. Intell. Transp. Syst., vol. 17, no. 2, pursuing the Ph.D. degree with the Department
pp. 367–384, Feb. 2016. of Control and Robot Engineering, Chungbuk
[41] A. Milan, L. Leal-Taixe, I. Reid, S. Roth, and K. Schindler, ‘‘MOT16: National University, South Korea.
A benchmark for multi-object tracking,’’ 2016, arXiv:1603.00831. His research interests include robotics, semantic
[Online]. Available: http://arxiv.org/abs/1603.00831 SLAM, object detection and tracking, and control systems.
[42] J. Redmon and A. Farhadi, ‘‘YOLOv3: An incremental improve-
ment,’’ Apr. 2018, arXiv:1804.02767. [Online]. Available: http://arxiv.org/
abs/1804.02767
GON-WOO KIM received the M.S. and Ph.D.
[43] T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays,
P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollár, ‘‘Microsoft COCO: degrees from Seoul National University, South
Common objects in context,’’ May 2014, arXiv:1405.0312. [Online]. Korea, in 2002 and 2006, respectively.
Available: http://arxiv.org/abs/1405.0312 He is currently a Professor with the School of
[44] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, ‘‘Vision meets robotics: Electronics Engineering, Chungbuk National Uni-
The KITTI dataset,’’ Int. J. Robot. Res., vol. 32, no. 11, pp. 1231–1237, versity, South Korea. His research interests include
Sep. 2013. navigation, localization, and SLAM for mobile
[45] S. Shi, X. Wang, and H. Li, ‘‘PointRCNN: 3D object proposal generation robots and autonomous vehicles.
and detection from point cloud,’’ Dec. 2018, arXiv:1812.04244. [Online].
Available: http://arxiv.org/abs/1812.04244

156298 VOLUME 8, 2020

You might also like