0% found this document useful (0 votes)
127 views11 pages

TrackFormer: Tracking by Attention

Uploaded by

Zhuoyao He
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
127 views11 pages

TrackFormer: Tracking by Attention

Uploaded by

Zhuoyao He
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

TrackFormer: Multi-Object Tracking with Transformers

Tim Meinhardt1 * Alexander Kirillov2 Laura Leal-Taixé1 Christoph Feichtenhofer2


1 2
Technical University of Munich Facebook AI Research (FAIR)

Abstract

The challenging task of multi-object tracking (MOT) re-


quires simultaneous reasoning about track initialization,
identity, and spatio-temporal trajectories. We formulate
this task as a frame-to-frame set prediction problem and
introduce TrackFormer, an end-to-end trainable MOT ap-
proach based on an encoder-decoder Transformer archi-
tecture. Our model achieves data association between
frames via attention by evolving a set of track predictions
through a video sequence. The Transformer decoder ini-
tializes new tracks from static object queries and autore-
gressively follows existing tracks in space and time with
the conceptually new and identity preserving track queries.
Both query types benefit from self- and encoder-decoder
attention on global frame-level features, thereby omitting Figure 1. TrackFormer jointly performs object detection and
tracking-by-attention with Transformers. Object and autoregres-
any additional graph optimization or modeling of mo-
sive track queries reason about track initialization, identity, and
tion and/or appearance. TrackFormer introduces a new spatiotemporal trajectories.
tracking-by-attention paradigm and while simple in its de-
sign is able to achieve state-of-the-art performance on the
task of multi-object tracking (MOT17) and segmentation
(MOTS20). The code is available at [Link] thereby creating individual object tracks over time. Tra-
com/timmeinhardt/trackformer ditional tracking-by-detection methods associate detections
via temporally sparse [22, 25] or dense [18, 21] graph opti-
mization, or apply convolutional neural networks to predict
1. Introduction matching scores between detections [8, 23].
Recent works [4,6,28,66] suggest a variation of the tradi-
Humans need to focus their attention to track objects in tional paradigm, coined tracking-by-regression [12]. In this
space and time, for example, when playing a game of ten- approach, the object detector not only provides frame-wise
nis, golf, or pong. This challenge is only increased when detections, but replaces the data association step with a con-
tracking not one, but multiple objects, in crowded and real tinuous regression of each track to the changing position of
world scenarios. Following this analogy, we demonstrate its object. These approaches achieve track association im-
the effectiveness of Transformer [50] attention for the task plicitly, but provide top performance only by relying either
of multi-object tracking (MOT) in videos. on additional graph optimization [6, 28] or motion and ap-
The goal in MOT is to follow the trajectories of a set of pearance models [4]. This is largely due to the isolated and
objects, e.g., pedestrians, while keeping their identities dis- local bounding box regression which lacks any notion of
criminated as they are moving throughout a video sequence. object identity or global communication between tracks.
Due to the advances in image-level object detection [7, 38],
In this work, we introduce the tracking-by-attention
most approaches follow the two-step tracking-by-detection
paradigm which not only applies attention for data associ-
paradigm: (i) detecting objects in individual video frames,
ation [11, 67] but jointly performs tracking and detection.
and (ii) associating sets of detections between frames and
As shown in Figure 1, this is achieved by evolving a set of
* Work done during an internship at Facebook AI Research (FAIR). tracks from frame to frame forming trajectories over time.

8844
We present a first straightforward instantiation of Graphs have been used for track association and long-
tracking-by-attention, TrackFormer, an end-to-end train- term re-identification by formulating the problem as a max-
able Transformer [50] encoder-decoder architecture. It imum flow (minimum cost) optimization [3] with distance
encodes frame-level features from a convolutional neural based [20, 36, 62] or learned costs [24]. Other methods
network (CNN) [17] and decodes queries into bounding use association graphs [45], learned models [22], and mo-
boxes associated with identities. The data association is tion information [21], general-purpose solvers [61], multi-
performed through the novel and simple concept of track cuts [48], weighted graph labeling [18], edge lifting [19],
queries. Each query represents an object and follows it in or trainable graph neural networks [6, 54]. However, graph-
space and time over the course of a video sequence in an based approaches suffer from expensive optimization rou-
autoregressive fashion. New objects entering the scene are tines, limiting their practical application for online tracking.
detected by static object queries as in [7, 68] and subse- Appearance driven methods capitalize on increasingly
quently transform to future track queries. At each frame, powerful image recognition backbones to track objects by
the encoder-decoder computes attention between the input relying on similarity measures given by twin neural net-
image features and the track as well as object queries, and works [23], learned reID features [32, 41], detection candi-
outputs bounding boxes with assigned identities. Thereby, date selection [8] or affinity estimation [10]. Similar to re-
TrackFormer performs tracking-by-attention and achieves identification, appearance models struggle in crowded sce-
detection and data association jointly without relying on narios with many object-object-occlusions.
any additional track matching, graph optimization, or ex- Motion can be modelled for trajectory prediction [1, 25,
plicit modeling of motion and/or appearance. In contrast 42] using a constant velocity assumption (CVA) [2, 9] or
to tracking-by-detection/regression, our approach detects the social force model [25, 34, 43, 58]. Learning a motion
and associates tracks simultaneously in a single step via at- model from data [24] accomplishes track association be-
tention (and not regression). TrackFormer extends the re- tween frames [63]. However, the projection of non-linear
cently proposed set prediction objective for object detec- 3D motion [49] into the 2D image domain still poses a chal-
tion [7, 47, 68] to multi-object tracking. lenging problem for many models.
We evaluate TrackFormer on the MOT17 [29] bench-
mark where it achieves state-of-the-art performance for Tracking-by-regression refrains from associating detec-
public and private detections. Furthermore, we demonstrate tions between frames but instead accomplishes tracking by
the extension with a mask prediction head and show state- regressing past object locations to their new positions in the
of-the-art results on the Multi-Object Tracking and Seg- current frame. Previous efforts [4, 14] use regression heads
mentation (MOTS20) challenge [51]. We hope this simple on region-pooled object features. In [66], objects are rep-
yet powerful baseline will inspire researchers to explore the resented as center points which allow for an association by
potential of the tracking-by-attention paradigm. a distance-based greedy matching algorithm. To overcome
In summary, we make the following contributions: their lacking notion of object identity and global track rea-
soning, additional re-identification and motion models [4],
• An end-to-end trainable multi-object tracking ap- as well as traditional [28] and learned [6] graph methods
proach which achieves detection and data association have been necessary to achieve top performance.
in a new tracking-by-attention paradigm.

• The concept of autoregressive track queries which em- Tracking-by-segmentation not only predicts object
bed an object’s spatial position and identity, thereby masks but leverages the pixel-level information to mitigate
tracking it in space and time. issues with crowdedness and ambiguous backgrounds.
Prior attempts used category-agnostic image segmenta-
• The TrackFormer model which obtains state-of-the- tion [30], applied Mask R-CNN [16] with 3D convolu-
art results on two challenging multi-object tracking tions [51], mask pooling layers [37], or represented objects
(MOT17) and segmentation (MOTS20) benchmarks. as unordered point clouds [57] and cost volumes [56].
However, the scarcity of annotated MOT segmentation data
2. Related work makes modern approaches still rely on bounding boxes.

In light of the recent trend in MOT to look beyond Attention for image recognition correlates each element
tracking-by-detection, we categorize and review methods of the input with respect to the others and is used in Trans-
according to their respective tracking paradigm. formers [50] for image generation [33] and object detec-
tion [7, 68]. For MOT, attention has only been used to as-
Tracking-by-detection approaches form trajectories by sociate a given set of object detections [11, 67], not tackling
associating a given set of detections over time. the detection and tracking problem jointly.

8845
In contrast, TrackFormer casts the entire tracking objec- 3.2. Tracking-by-attention with queries
tive into a single set prediction problem, applying attention
The total set of output embeddings is initialized with two
not only for the association step. It jointly reasons about
types of query encodings: (i) static object queries, which
track initialization, identity, and spatio-temporal trajecto-
allow the model to initialize tracks at any frame of the video,
ries. We only rely on feature-level attention and avoid addi-
and (ii) autoregressive track queries, which are responsible
tional graph optimization and appearance/motion models.
for tracking objects across frames.
The simultaneous decoding of object and track queries
3. TrackFormer
allows our model to perform detection and tracking in a uni-
We present TrackFormer, an end-to-end trainable multi- fied way, thereby introducing a new tracking-by-attention
object tracking (MOT) approach based on an encoder- paradigm. Different tracking-by-X approaches are defined
decoder Transformer [50] architecture. This section de- by their key component responsible for track generation.
scribes how we cast MOT as a set prediction problem and For tracking-by-detection, the tracking is performed by
introduce the new tracking-by-attention paradigm. Further- computing/modelling distances between frame-wise object
more, we explain the concept of track queries and their ap- detections. The tracking-by-regression paradigm also per-
plication for frame-to-frame data association. forms object detection, but tracks are generated by regress-
ing each object box to its new position in the current frame.
3.1. MOT as a set prediction problem Technically, our TrackFormer also performs regression in
Given a video sequence with K individual object iden- the mapping of object embeddings with MLPs. However,
tities, MOT describes the task of generating ordered tracks the actual track association happens earlier via attention in
Tk = (bkt1 , bkt2 , . . . ) with bounding boxes bt and track iden- the Transformer decoder. A detailed architecture overview
tities k. The subset (t1 , t2 , . . . ) of total frames T indicates which illustrates the integration of track and object queries
the time span between an object entering and leaving the into the Transformer decoder is shown in the appendix.
the scene. These include all frames for which an object is
occluded by either the background or other objects. Track initialization. New objects appearing in the scene
In order to cast MOT as a set prediction problem, we are detected by a fixed number of Nobject output embeddings
leverage an encoder-decoder Transformer architecture. Our each initialized with a static and learned object encoding
model performs online tracking and yields per-frame object referred to as object queries [7]. Intuitively, each object
bounding boxes and class predictions associated with iden- query learns to predict objects with certain spatial proper-
tities in four consecutive steps: ties, such as bounding box size and position. The decoder
self-attention relies on the object encoding to avoid dupli-
(i) Frame-level feature extraction with a common CNN cate detections and to reason about spatial and categorical
backbone, e.g., ResNet-50 [17]. relations of objects. The number of object queries is ought
(ii) Encoding of frame features with self-attention in a to exceed the maximum number of objects per frame.
Transformer encoder [50].
Track queries. In order to achieve frame-to-frame track
(iii) Decoding of queries with self- and encoder-decoder at-
generation, we introduce the concept of track queries to the
tention in a Transformer decoder [50].
decoder. Track queries follow objects through a video se-
(iv) Mapping of queries to box and class predictions using quence carrying over their identity information while adapt-
multilayer perceptrons (MLP). ing to their changing position in an autoregressive manner.
For this purpose, each new object detection initializes
Objects are implicitly represented in the decoder queries, a track query with the corresponding output embedding of
which are embeddings used by the decoder to output bound- the previous frame. The Transformer encoder-decoder per-
ing box coordinates and class predictions. The decoder al- forms attention on frame features and decoder queries con-
ternates between two types of attention: (i) self-attention tinuously updating the instance-specific representation of an
over all queries, which allows for joint reasoning about object‘s identity and location in each track query embed-
the objects in a scene and (ii) encoder-decoder attention, ding. Self-attention over the joint set of both query types al-
which gives queries global access to the visual information lows for the detection of new objects while simultaneously
of the encoded features. The output embeddings accumu- avoiding re-detection of already tracked objects.
late bounding box and class information over multiple de- In Figure 2, we provide a visual illustration of the track
coding layers. The permutation invariance of Transformers query concept. The initial detections in frame t = 0
requires additive feature and object encodings for the frame spawn new track queries following their corresponding ob-
features and decoder queries, respectively. jects to frame t and beyond. To this end, Nobject ob-

8846
X <latexit sha1_base64="bWcAvIskIFH6M//q7BF22uy7w4o=">AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KklB9FjoxVOpaD+gDWWz3bRLN5uwOxFK6E/w4kERr/4ib/4bt20O2vpg4PHeDDPzgkQKg6777Wxsbm3v7Bb2ivsHh0fHpZPTtolTzXiLxTLW3YAaLoXiLRQoeTfRnEaB5J1gUp/7nSeujYjVI04T7kd0pEQoGEUrPdQbjUGp7FbcBcg68XJShhzNQemrP4xZGnGFTFJjep6boJ9RjYJJPiv2U8MTyiZ0xHuWKhpx42eLU2fk0ipDEsbalkKyUH9PZDQyZhoFtjOiODar3lz8z+ulGN76mVBJilyx5aIwlQRjMv+bDIXmDOXUEsq0sLcSNqaaMrTpFG0I3urL66RdrXjXFfe+Wq65eRwFOIcLuAIPbqAGd9CEFjAYwTO8wpsjnRfn3flYtm44+cwZ/IHz+QPISo1n</latexit>

X X X X X X X X
CNN
<latexit sha1_base64="bWcAvIskIFH6M//q7BF22uy7w4o=">AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KklB9FjoxVOpaD+gDWWz3bRLN5uwOxFK6E/w4kERr/4ib/4bt20O2vpg4PHeDDPzgkQKg6777Wxsbm3v7Bb2ivsHh0fHpZPTtolTzXiLxTLW3YAaLoXiLRQoeTfRnEaB5J1gUp/7nSeujYjVI04T7kd0pEQoGEUrPdQbjUGp7FbcBcg68XJShhzNQemrP4xZGnGFTFJjep6boJ9RjYJJPiv2U8MTyiZ0xHuWKhpx42eLU2fk0ipDEsbalkKyUH9PZDQyZhoFtjOiODar3lz8z+ulGN76mVBJilyx5aIwlQRjMv+bDIXmDOXUEsq0sLcSNqaaMrTpFG0I3urL66RdrXjXFfe+Wq65eRwFOIcLuAIPbqAGd9CEFjAYwTO8wpsjnRfn3flYtm44+cwZ/IHz+QPISo1n</latexit>

CNN
<latexit sha1_base64="bWcAvIskIFH6M//q7BF22uy7w4o=">AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KklB9FjoxVOpaD+gDWWz3bRLN5uwOxFK6E/w4kERr/4ib/4bt20O2vpg4PHeDDPzgkQKg6777Wxsbm3v7Bb2ivsHh0fHpZPTtolTzXiLxTLW3YAaLoXiLRQoeTfRnEaB5J1gUp/7nSeujYjVI04T7kd0pEQoGEUrPdQbjUGp7FbcBcg68XJShhzNQemrP4xZGnGFTFJjep6boJ9RjYJJPiv2U8MTyiZ0xHuWKhpx42eLU2fk0ipDEsbalkKyUH9PZDQyZhoFtjOiODar3lz8z+ulGN76mVBJilyx5aIwlQRjMv+bDIXmDOXUEsq0sLcSNqaaMrTpFG0I3urL66RdrXjXFfe+Wq65eRwFOIcLuAIPbqAGd9CEFjAYwTO8wpsjnRfn3flYtm44+cwZ/IHz+QPISo1n</latexit>

CNN

<latexit sha1_base64="1HSo0US2Liq0DbsE5UFWkS4BZ58=">AAACFnicbVDLSsNAFJ34rPUVdekmWAQ3lqQguizowmWFvqANZTK5aYfOTMLMRCihX+HGX3HjQhG34s6/cdJmUVsPDBzOuXfuvSdIGFXadX+stfWNza3t0k55d2//4NA+Om6rOJUEWiRmsewGWAGjAlqaagbdRALmAYNOML7N/c4jSEVj0dSTBHyOh4JGlGBtpIF92ScgNMi8P2tKLFQUSw5yWl407oDEoREHdsWtujM4q8QrSAUVaAzs734Yk5SbrwjDSvU8N9F+hqWmhIEZkipIMBnjIfQMFZiD8rPZWVPn3CihY/YxT2hnpi52ZJgrNeGBqeRYj9Syl4v/eb1URzd+RkWSahBkPihKmaNjJ8/ICakEotnEEEwkNbs6ZIQlJiYPVTYheMsnr5J2repdVd2HWqXuFnGU0Ck6QxfIQ9eoju5RA7UQQU/oBb2hd+vZerU+rM956ZpV9JygP7C+fgFXJ6Cq</latexit>

Transformer
<latexit sha1_base64="1HSo0US2Liq0DbsE5UFWkS4BZ58=">AAACFnicbVDLSsNAFJ34rPUVdekmWAQ3lqQguizowmWFvqANZTK5aYfOTMLMRCihX+HGX3HjQhG34s6/cdJmUVsPDBzOuXfuvSdIGFXadX+stfWNza3t0k55d2//4NA+Om6rOJUEWiRmsewGWAGjAlqaagbdRALmAYNOML7N/c4jSEVj0dSTBHyOh4JGlGBtpIF92ScgNMi8P2tKLFQUSw5yWl407oDEoREHdsWtujM4q8QrSAUVaAzs734Yk5SbrwjDSvU8N9F+hqWmhIEZkipIMBnjIfQMFZiD8rPZWVPn3CihY/YxT2hnpi52ZJgrNeGBqeRYj9Syl4v/eb1URzd+RkWSahBkPihKmaNjJ8/ICakEotnEEEwkNbs6ZIQlJiYPVTYheMsnr5J2repdVd2HWqXuFnGU0Ck6QxfIQ9eoju5RA7UQQU/oBb2hd+vZerU+rM956ZpV9JygP7C+fgFXJ6Cq</latexit>

Transformer
<latexit sha1_base64="1HSo0US2Liq0DbsE5UFWkS4BZ58=">AAACFnicbVDLSsNAFJ34rPUVdekmWAQ3lqQguizowmWFvqANZTK5aYfOTMLMRCihX+HGX3HjQhG34s6/cdJmUVsPDBzOuXfuvSdIGFXadX+stfWNza3t0k55d2//4NA+Om6rOJUEWiRmsewGWAGjAlqaagbdRALmAYNOML7N/c4jSEVj0dSTBHyOh4JGlGBtpIF92ScgNMi8P2tKLFQUSw5yWl407oDEoREHdsWtujM4q8QrSAUVaAzs734Yk5SbrwjDSvU8N9F+hqWmhIEZkipIMBnjIfQMFZiD8rPZWVPn3CihY/YxT2hnpi52ZJgrNeGBqeRYj9Syl4v/eb1URzd+RkWSahBkPihKmaNjJ8/ICakEotnEEEwkNbs6ZIQlJiYPVTYheMsnr5J2repdVd2HWqXuFnGU0Ck6QxfIQ9eoju5RA7UQQU/oBb2hd+vZerU+rM956ZpV9JygP7C+fgFXJ6Cq</latexit>

Transformer
<latexit sha1_base64="V/WHVsl4AYMVKs6ih48AzugxW+w=">AAACFnicbVDLSsNAFJ34rPUVdekmWAQ3lqQguiyI4LJCX9CGMpnctENnJmFmIpTQr3Djr7hxoYhbceffOGmzqK0HBg7n3Dv33hMkjCrtuj/W2vrG5tZ2aae8u7d/cGgfHbdVnEoCLRKzWHYDrIBRAS1NNYNuIgHzgEEnGN/mfucRpKKxaOpJAj7HQ0EjSrA20sC+7BMQGmTenzUlFiqKJQc5LS8ad4LEoREHdsWtujM4q8QrSAUVaAzs734Yk5SbrwjDSvU8N9F+hqWmhIEZkipIMBnjIfQMFZiD8rPZWVPn3CihY/YxT2hnpi52ZJgrNeGBqeRYj9Syl4v/eb1URzd+RkWSahBkPihKmaNjJ8/ICakEotnEEEwkNbs6ZIQlJiYPVTYheMsnr5J2repdVd2HWqXuFnGU0Ck6QxfIQ9eoju5RA7UQQU/oBb2hd+vZerU+rM956ZpV9JygP7C+fgFmjKC0</latexit>

Transformer
<latexit sha1_base64="V/WHVsl4AYMVKs6ih48AzugxW+w=">AAACFnicbVDLSsNAFJ34rPUVdekmWAQ3lqQguiyI4LJCX9CGMpnctENnJmFmIpTQr3Djr7hxoYhbceffOGmzqK0HBg7n3Dv33hMkjCrtuj/W2vrG5tZ2aae8u7d/cGgfHbdVnEoCLRKzWHYDrIBRAS1NNYNuIgHzgEEnGN/mfucRpKKxaOpJAj7HQ0EjSrA20sC+7BMQGmTenzUlFiqKJQc5LS8ad4LEoREHdsWtujM4q8QrSAUVaAzs734Yk5SbrwjDSvU8N9F+hqWmhIEZkipIMBnjIfQMFZiD8rPZWVPn3CihY/YxT2hnpi52ZJgrNeGBqeRYj9Syl4v/eb1URzd+RkWSahBkPihKmaNjJ8/ICakEotnEEEwkNbs6ZIQlJiYPVTYheMsnr5J2repdVd2HWqXuFnGU0Ck6QxfIQ9eoju5RA7UQQU/oBb2hd+vZerU+rM956ZpV9JygP7C+fgFmjKC0</latexit> <latexit sha1_base64="V/WHVsl4AYMVKs6ih48AzugxW+w=">AAACFnicbVDLSsNAFJ34rPUVdekmWAQ3lqQguiyI4LJCX9CGMpnctENnJmFmIpTQr3Djr7hxoYhbceffOGmzqK0HBg7n3Dv33hMkjCrtuj/W2vrG5tZ2aae8u7d/cGgfHbdVnEoCLRKzWHYDrIBRAS1NNYNuIgHzgEEnGN/mfucRpKKxaOpJAj7HQ0EjSrA20sC+7BMQGmTenzUlFiqKJQc5LS8ad4LEoREHdsWtujM4q8QrSAUVaAzs734Yk5SbrwjDSvU8N9F+hqWmhIEZkipIMBnjIfQMFZiD8rPZWVPn3CihY/YxT2hnpi52ZJgrNeGBqeRYj9Syl4v/eb1URzd+RkWSahBkPihKmaNjJ8/ICakEotnEEEwkNbs6ZIQlJiYPVTYheMsnr5J2repdVd2HWqXuFnGU0Ck6QxfIQ9eoju5RA7UQQU/oBb2hd+vZerU+rM956ZpV9JygP7C+fgFmjKC0</latexit>

Transformer Transformer
Encoder Decoder Encoder Decoder Encoder Decoder

Figure 2. TrackFormer casts multi-object tracking as a set prediction problem performing joint detection and tracking-by-attention. The
architecture consists of a CNN for image feature extraction, a Transformer [50] encoder for image feature encoding and a Transformer
decoder which applies self- and encoder-decoder attention to produce output embeddings with bounding box and class information. At
frame t = 0, the decoder transforms Nobject object queries (white) to output embeddings either initializing new autoregressive track queries
or predicting the background class (crossed). On subsequent frames, the decoder processes the joint set of Nobject + Ntrack queries to follow
or remove (blue) existing tracks as well as initialize new tracks (purple).

ject queries (white) are decoded to output embeddings for nevertheless, allows for a short-term recovery from track
potential track initializations. Each valid object detec- loss. This is possible without any dedicated re-identification
tion {b00 , b10 , . . . } with a classification score above σobject , training; and furthermore, cements TrackFormer’s holistic
i.e., output embedding not predicting the background class approach by relying on the same attention mechanism as
(crossed), initializes a new track query embedding. Since for track initialization, identity preservation and trajectory
not all objects in a sequence appear on the first frame, the forming even through short-term occlusions.
track identities Kt=0 = {0, 1, . . . } only represent a sub-
set of all K. For the decoding step at any frame t > 0, 3.3. TrackFormer training
track queries initialize additional output embeddings asso- For track queries to work in interaction with object
ciated with different identities (colored). The joint set of queries and follow objects to the next frame, TrackFormer
Nobject +Ntrack output embeddings is initialized by (learned) requires dedicated frame-to-frame tracking training. As in-
object and (temporally adapted) track queries, respectively. dicated in Figure 2, we train on two adjacent frames and
The Transformer decoder transforms the entire set of optimize the entire MOT objective at once. The loss for
output embeddings at once and provides the input for the frame t measures the set prediction of all output embed-
subsequent MLPs to predict bounding boxes and classes for dings N = Nobject + Ntrack with respect to the ground truth
frame t. The number of track queries Ntrack changes be- objects in terms of class and bounding box prediction.
tween frames as new objects are detected or tracks removed. The set prediction loss is computed in two steps:
Tracks and their corresponding query can be removed ei-
ther if their classification score drops below σtrack or by (i) Object detection on frame t − 1 with Nobject object
non-maximum suppression (NMS) with an IoU threshold of queries (see t = 0 in Figure 2).
σNMS . A comparatively high σNMS only removes strongly
(ii) Tracking of objects from (i) and detection of new ob-
overlapping duplicate bounding boxes which we found to
jects on frame t with all N queries.
not be resolvable by the decoder self-attention.
The number of track queries Ntrack depends on the number
of successfully detected objects in frame t−1. During train-
Track query re-identification. The ability to decode an
ing, the MLP predictions ŷ = {ŷj }N
j=1 of the output embed-
arbitrary number of track queries allows for an attention-
dings from step (iv) are each assigned to one of the ground
based short-term re-identification process. We keep decod-
truth objects y or the background class. Each yi represents
ing previously removed track queries for a maximum num-
a bounding box bi , object class ci and identity ki .
ber of Ttrack-reid frames. During this patience window, track
queries are considered to be inactive and do not contribute
to the trajectory until a classification score higher than Bipartite matching. The mapping j = π(i) from ground
σtrack-reid triggers a re-identification. The spatial information truth objects yi to the joint set of object and track query pre-
embedded into each track query prevents their application dictions ŷj is determined either via track identity or costs
for long-term occlusions with large object movement, but, based on bounding box similarity and object class. For the

8847
former, we denote the subset of ground truth track identi- ground truth object matched with prediction i by yπ=i and
ties at frame t with Kt ⊂ K. Each detection from step (i) define the loss per query
is assigned to its respective ground truth track identity k
from the set Kt−1 ⊂ K. The corresponding output embed-
dings, i.e., track queries, inherently carry over the identity {\cal L}_{\rm query} = \begin {cases} - \lambda _{\rm cls} \log \hat {p}_{i}(c_{\pi = i}) + {\cal L}_{\rm box}(b_{\pi = i}, \hat {b}_{i}), & \text {if } i \in \pi \\ - \lambda _{\rm cls} \log \hat {p}_{i}(0), & \text {if } i \notin \pi . \\ \end {cases}
information to the next frame. The two ground truth track
identity sets describe a hard assignment of the Ntrack track
The bounding box loss Lbox is computed in the same fash-
query outputs to the ground truth objects in frame t:
ion as (3), but we differentiate its notation as the cost term
Kt ∩ Kt−1 : Match by track identity k. Cbox is generally not required to be differentiable.

Kt−1 \ Kt : Match with background class.


Track augmentations. The two-step loss computation,
Kt \ Kt−1 : Match by minimum cost mapping. see (i) and (ii), for training track queries represents only
a limited range of possible tracking scenarios. Therefore,
The second set of ground truth track identities Kt−1 \ Kt
we propose the following augmentations to enrich the set
includes tracks which either have been occluded or left the
of potential track queries during training. These augmen-
scene at frame t. The last set Kobject = Kt \ Kt−1 of pre-
tations will be verified in our experiments. We use three
viously not yet tracked ground truth objects remains to be
types of augmentations similar to [66] which lead to pertur-
matched with the Nobject object queries. To achieve this, we
bations of object location and motion, missing detections,
follow [7] and search for the injective minimum cost map-
and simulated occlusions.
ping σ̂ in the following assignment problem,

\hat {\sigma } = \argmin _{\sigma } \sum _{k_i \in K_\text {object}} {\cal C}_{match}(y_i, \hat {y}_{\sigma (i)}), \label {eq:matching_loss} (1) 1. The frame t − 1 for step (i) is sampled from a range of
frames around frame t, thereby generating challeng-
ing frame pairs where the objects have moved substan-
with index σ(i) and pair-wise costs Cmatch between ground tially from their previous position. Such a sampling
truth yi and prediction ŷi . The problem is solved with a allows for the simulation of camera motion and low
combinatorial optimization algorithm as in [47]. Given the frame rates from usually benevolent sequences.
ground truth class labels ci and predicted class probabilities
p̂i (ci ) for output embeddings i, the matching cost Cmatch 2. We sample false negatives with a probability of pFN
with class weighting λcls is defined as by removing track queries before proceeding with
step (ii). The corresponding ground truth objects in
{\cal C}_\text {match} = - \lambda _{\rm cls} \hat {p}_{\sigma (i)}(c_i) + {\cal C}_{\rm box}(b_{i}, \hat {b}_{\sigma (i)}). \label {eq:matching_cost} (2)
frame t will be matched with object queries and trig-
The authors of [7] report better performance without loga- ger a new object detection. Keeping the ratio of false
rithmic class probabilities. The Cbox term penalizes bound- positives sufficiently high is vital for a joined training
ing box differences by a combination of ℓ1 distance and of both query types.
generalized intersection over union (IoU) [39] cost Ciou ,
3. To improve the removal of tracks, i.e., by background
{\cal C}_{\rm box} = \lambda _{\rm \ell _1}||b_i - \hat {b}_{\sigma (i)}||_1 + \lambda _{\rm iou}{\cal C}_{\rm iou}(b_{i}, \hat {b}_{\sigma (i)}), \label {eq:box_cost} (3) class assignment, in occlusion scenarios, we comple-
ment the set of track queries with additional false pos-
with weighting parameters λℓ1 , λiou , ∈ ℜ. In contrast to ℓ1 , itives. These queries are sampled from output embed-
the scale-invariant IoU term provides similar relative errors dings of frame t−1 that were classified as background.
for different box sizes. The optimal cost mapping σ̂ deter- Each of the original track queries has a chance of pFP
mines the corresponding assignments in π(i). to spawn an additional false positive query. We chose
these with a large likelihood of occluding with the re-
Set prediction loss. The final MOT set prediction loss is spective spawning track query.
computed over all N = Nobject + Ntrack output predictions:
Another common augmentation for improved robust-
ness, is to applying spatial jittering to previous frame
{\cal L}_\text {MOT}(y, \hat {y}, \pi ) = \sum _{i=1}^{N} {\cal L}_{\rm query}(y, \hat {y}_i, \pi ). \label {eq:set_loss} (4)
bounding boxes or center points [66]. The nature of track
queries, which encode object information implicitly, does
The output embeddings which were not matched via track not allow for such an explicit perturbation in the spatial do-
identity or σ̂ are not part of the mapping π and will be as- main. We believe our randomization of the temporal range
signed to the background class ci = 0. We indicate the provides a more natural augmentation from video data.

8848
4. Experiments Decoder Queries. By design, TrackFormer can only de-
tect a maximum of Nobject objects. To detect the maximum
In this section, we present tracking results for number of 52 objects per frame in MOT17 [29], we train
TrackFormer on two MOTChallenge benchmarks, namely, TrackFormer with Nobject = 500 learned object queries. For
MOT17 [29] and MOTS20 [51]. Furthermore, we verify optimal performance, the total number of queries must ex-
individual contributions in an ablation study. ceed the number of ground truth objects per frame by a large
4.1. MOT benchmarks and metrics margin. The number of possible track queries is adaptive
and only practically limited by the abilities of the decoder.
Benchmarks. The MOT17 [29] benchmark consists of a
train and test set, each with 7 sequences and pedestrians Simulate MOT from single images. The encoder-
annotated with full-body bounding boxes. To evaluate the decoder multi-level attention mechanism requires substan-
tracking (data association) robustness independently, three tial amounts of training data. Hence, we follow a similar ap-
sets of public detections with varying quality are provided, proach as in [66] and simulate MOT data from the Crowd-
namely, DPM [15], Faster R-CNN [38] and SDP [59]. Human [44] person detection dataset. The adjacent training
MOTS20 [51] provides mask annotations for 4 train and frames t − 1 and t are generated by applying random spatial
test sequences of MOT17 but without annotations for small augmentations to a single image. To generate challenging
objects. The corresponding bounding boxes are not full- tracking scenarios, we randomly resize and crop of up to
body, but based on the visible segmentation masks. 20% with respect to the original image size.

Metrics. Different aspects of MOT are evaluated by a Training procedure. All trainings follow [68] and ap-
number of individual metrics [5]. The community focuses ply a batch size of 2 with initial learning rates of 0.0002
on two compound metrics, namely, Multiple Object Track- and 0.00002 for the encoder-decoder and backbone, respec-
ing Accuracy (MOTA) and Identity F1 Score (IDF1) [40]. tively. For public detections, we initialize with the model
While the former focuses on object coverage, the identity weights from [68] pretrained on COCO [27] and then fine-
preservation of a method is measured by the latter. For tune on MOT17 for 50 epochs with a learning rate drop af-
MOTS, we report MOTSA which evaluates predictions with ter 10 epochs. The private detections model is trained from
a ground truth matching based on mask IoU. scratch for 85 epochs on CrowdHuman [44] with simulated
adjacent frames and we drop the initial learning rates after
Public detections. The MOT17 [29] benchmark is eval- 50 epochs. To avoid overfitting to the small MOT17 dataset,
uated in a private and public detection setting. The latter we then fine-tune for additional 40 epochs on the combined
allows for a comparison of tracking methods independent CrowdHuman and MOT17 datasets. The fine-tuning starts
of the underlying object detection performance. MOT17 with the initial learning rates which are dropped after 10
provides three sets of public detections with varying qual- epochs. By the nature of track queries each sample has a
ity. In contrast to classic tracking-by-detection methods, different number of total queries N = Nobject + Ntrack . In
TrackFormer is not able to directly produce tracking out- order to stack samples to a batch, we pad the samples with
puts from detection inputs. Therefore, we report the results additional false positive queries. The training of the private
of TrackFormer and CenterTrack [66] in Table 1 by filter- detections model takes around 2 days on 7 × 32GB GPUs.
ing the initialization of tracks with a minimum IoU require-
ment. For more implementation details and a discussion on Mask training. TrackFormer predicts instance-level ob-
the fairness of such a filtering, we refer to the appendix. ject masks with a segmentation head as in [7] by gener-
ating spatial attention maps from the encoded image fea-
4.2. Implementation details
tures and decoder output embeddings. Subsequent upscal-
TrackFormer follows the ResNet50 [17] CNN feature ex- ing and convolution operations yield mask predictions for
traction and Transformer encoder-decoder architecture pre- all output embeddings. We adopt the private detection train-
sented in Deformable DETR [68]. For track queries, the de- ing pipeline from MOT17 but retrain TrackFormer with the
formable reference points for the current frame are dynam- original DETR [7] attention. This is due to the reduced
ically adjusted to the previous frame bounding box centers. memory consumption for single scale feature maps and in-
Furthermore, for the decoder we stack the feature maps of ferior segmentation masks from sparse deformable atten-
the previous and current frame and compute cross-attention tion maps. Furthermore, the benefits of deformable atten-
with queries over both frames. Queries are able to discrim- tion vanish on MOTS20 as it excludes small objects. After
inate between features from the two frames by applying a training on MOT17, we freeze the model and only train the
temporal feature encoding as in [55]. For more detailed hy- segmentation head on all COCO images containing persons.
perparameters, we refer to the appendix. Finally, we fine-tune the entire model on MOTS20.

8849
Method TbD sMOTSA ↑ IDF1 ↑ FP ↓ FN ↓ ID Sw. ↓
Method Data FPS ↑ MOTA ↑ IDF1 ↑ MT ↑ ML ↓ FP ↓ FN ↓ ID Sw. ↓
Train set (4-fold cross-validation)
Public
jCC [21] – – 51.2 54.5 493 872 25937 247822 1802
MHT DAM [22] × 48.0 – – – –
FWT [18] – – 51.3 47.6 505 830 24101 247921 2648 FWT [18] × 49.3 – – – –
Offline

eHAF [45] – – 51.8 54.7 551 893 33212 236772 1834 MOTDT [8] × 47.8 – – – –
TT [63] – – 54.9 63.1 575 897 20236 233295 1088 jCC [21] × 48.3 – – – –
MPNTrack [6] M+C – 58.8 61.7 679 788 17413 213594 1185
TrackRCNN [51] 52.7 – – – –
Lif T [19] M+C – 60.5 65.6 637 791 14966 206619 1189
MOTSNet [37] 56.8 – – – –
FAMNet [10] – – 52.0 48.7 450 787 14138 253616 3072 PointTrack [57] 58.1 – – – –
Tracktor++ [4] M+C 1.3 56.3 55.1 498 831 8866 235449 1987
TrackFormer 58.7 – – – –
Online

GSM [28] M+C – 56.4 57.8 523 813 14379 230174 1485
CenterTrack [66] – 17.7 60.5 55.7 580 777 11599 208577 2540 Test set
TMOH [46] – – 62.1 62.8 633 739 10951 201195 1897
TrackFormer – 7.4 62.3 57.6 688 638 16591 192123 4018 Track R-CNN [51] 40.6 42.4 1261 12641 567
Private TrackFormer 54.9 63.6 2233 7195 278

TubeTK [31] JTA – 63.0 58.6 735 468 27060 177483 4137
GSDT [54] 6M – 73.2 66.5 981 411 26397 120666 3891
FairMOT [64] CH+PD – 73.7 72.3 1017 408 27507 117477 3303 Table 2. Comparison of multi-object tracking and segmentation
PermaTrack [49] CH+PD – 73.8 68.9 1032 405 28998 115104 3699
methods evaluated on the MOTS20 [51] train and test sets. Meth-
GRTU [53] CH+6M – 75.5 76.9 1158 495 27813 108690 1572
Online

TLR [52] CH+6M – 76.5 73.6 1122 300 29808 99510 3369 ods indicated with TbD first perform tracking-by-detection with-
CTracker [35] – – 66.6 57.4 759 570 22284 160491 5529
out segmentation on SDP [60] public detections and then predict
CenterTrack [66] CH 17.7 67.8 64.7 816 579 18498 160332 3039 apply a Mask R-CNN [16] fine-tuned on MOTS20.
QuasiDense [32] – – 68.7 66.3 957 516 26589 146643 3378
TraDeS [56] CH – 69.1 63.9 858 507 20892 150060 3555
TrackFormer CH 7.4 74.1 68.0 1113 246 34602 108777 2829
contrast to our public detection model not only the detec-
tion but tracking performance are greatly improved. This is
Table 1. Comparison of multi-object tracking methods on due to the additional tracking data provided by simulating
the MOT17 [29] test set. We report private as well as pub- adjacent frames on CrowdHuman which satisfies the large
lic detection results and separate between online and offline ap-
data requirements of Transformers.
proaches. Both TrackFormer and CenterTrack filter tracks by re-
quiring a minimum IoU with public detections. For a detailed dis- Our tracking-by-attention approach achieves top perfor-
cussion on the fairness of such a filtering, we refer to the appendix. mance via global attention between encoded input pix-
We indicated additional training Data: CH=CrowdHuman [44], els and decoder queries without relying on additional mo-
PD=Parallel Domain [49] (synthetic), 6M=6 tracking datasets tion [4, 10] or appearance models [4, 8, 10]. Further-
as in [64], JTA [13] (synthetic), M=Market1501 [65] and more, the frame to frame association with track queries
C=CUHK03 [26]. Runtimes (FPS) are self-measured. avoids post-processing with heuristic greedy matching pro-
cedures [66] or additional graph optimization [28]. Our pro-
posed TrackFormer represents the first application of Trans-
4.3. Benchmark results formers to the MOT problem and could work as a blueprint
MOT17. Following the training procedure described for future research in this promising direction. In particu-
in Section 4.2, we evaluate TrackFormer on the lar, we expect great potential for methods going beyond the
MOT17 [29] test set and report results in Table 1. two-frame training/inference regime.
First of all, we isolate the tracking performance and
compare results in a public detection setting by applying
a track initialization filtering similar to [66]. However to MOTS20. In addition to object detection and tracking,
improve fairness, we filter not by bounding box center dis- TrackFormer is able to predict instance-level segmentation
tance as in [66] but a minimum IoU as detailed in the ap- masks. As reported in Table 2, we achieve state-of-the-
pendix. TrackFormer performs on-par with state-of-the-art art MOTS results in terms of object coverage (MOTSA)
results in terms of MOTA without pretraining on Crowd- and identity preservation (IDF1). All methods are evalu-
Human [44]. Our identity preservation performance is only ated in a private setting. A MOTS20 test set submission
surpassed by [46] and offline methods which benefit from is only recently possible, hence we also provide the 4-fold
the processing of entire sequences at once. cross-validation evaluation established in [51] and report
On private detections, we achieve a new state-of-the-art the mean best epoch results over all splits. TrackFormer sur-
both in terms of MOTA (+5.0) and IDF1 (1.7) for meth- passes all previous methods without relying on a dedicated
ods only trained on CrowdHuman [44]. Only the meth- tracking formulation for segmentation masks as in [57].
ods [49, 52, 53] which follow [64] and pretrain on 6 addi- In Figure 3, we present a qualitative comparison of Track-
tional tracking datasets (6M) surpass our performance. In Former and Track R-CNN [51] on two test sequences.

8850
Figure 3. We compare TrackFormer segmentation results with the popular Track R-CNN [51] on selected MOTS20 [51] test sequences.
The superiority of TrackFormer in terms of MOTSA in Table 2 can be clearly observed by the difference in pixel mask accuracy.

Method MOTA ↑ ∆ IDF1 ↑ ∆ Method Mask training MOTA ↑ IDF1 ↑


TrackFormer 71.3 73.4 × 61.9 56.0
TrackFormer
————— w\o ————— ———————————– 61.9 54.8
Pretraining on CrowdHuman 69.3 -2.0 71.8 -1.6
Track query re-identification 69.2 -0.1 70.4 -1.4
Track augmentations (FP) 68.4 -0.8 70.0 -0.4 Table 4. We demonstrate the effect of jointly training for track-
Track augmentations (Range) 64.0 -4.4 59.2 -10.8
ing and segmentation on a 4-fold split on the MOTS20 [51] train
Track queries 61.0 -3.0 45.1 -14.1
set. We evaluate with regular MOT metrics, i.e., matching to
ground truth with bounding boxes instead of masks.
Table 3. Ablation study on TrackFormer components. We report
MOT17 [29] training set private results on a 50-50 frame split. The version represents previous post-processing and matching
last row without (w\o) all components is only trained for object methods and demonstrates the benefit of jointly addressing
detection and associates tracks via greedy matching as in [66]. track initialization, identity and trajectory forming in our
unified TrackFormer formulation.
4.4. Ablation study
The ablation study on the MOT17 and MOTS20 training Mask information improves tracking. This ablation
sequences are evaluated in a private detection setting with a studies the synergies between segmentation and tracking
50-50 frame and 4-fold cross-validation split, respectively. training. Table 4 only evaluates bounding box tracking
performance and shows a +1.2 IDF1 improvement when
trained jointly with mask prediction. The additional mask
TrackFormer components. We ablate the impact of dif- information does not improve track coverage (MOTA) but
ferent TrackFormer components on the tracking perfor- resolves ambiguous occlusion scenarios during training.
mance in Table 3. Our full pipeline including pretraining
on the CrowdHuman dataset provides a MOTA and IDF1 5. Conclusion
of 71.3 and 73.4, respectively. The baseline without (w\o)
pretraining reduces this by -2.0 and -1.6 points, an effect We have presented a unified tracking-by-attention
expected to even more severe for the generalization to test. paradigm for detection and multi-object tracking with
The attention-based track query re-identification has a neg- Transformers. As an example of said paradigm, our end-
ligible effect on MOTA but improves IDF1 by 1.4 points. to-end trainable TrackFormer architecture applies autore-
If we further ablate our false positives (FP) and frame gressive track query embeddings to follow objects over a
range track augmentations, we see another drop of -5.2 sequence. We jointly tackle track initialization, identity and
MOTA and -11.2 IDF1 points. Both augmentations provide trajectory forming with a Transformer encoder-decoder ar-
the training which rich tracking scenarios and prevent an chitecture and not relying on additional matching, graph op-
early overfitting. The false negative track augmentations are timization or motion/appearance modeling. Our approach
indispensable for a joint training of object and track queries, achieves state-of-the-art results for multi-object tracking as
hence we refrain from ablating these. well as segmentation. We hope that this paradigm will fos-
Our baseline without any tracking components and track ter future work in Transformers for multi-object tracking.
queries is only trained for object detection. Data association
is performed via greedy center distance matching as in [66] Acknowledgements: We are grateful for discussions with
resulting in a huge drop of -3.0 MOTA and -14.1 IDF1. This Jitendra Malik, Karttikeya Mangalam, and David Novotny.

8851
References criminatively trained part based models. IEEE Trans. Pattern
Anal. Mach. Intell., 2009. 6
[1] Alexandre Alahi, Kratarth Goel, Vignesh Ramanathan,
[16] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir-
Alexandre Robicquet, Li Fei-Fei, and Silvio Savarese. So-
shick. Mask r-cnn. In IEEE Conf. Comput. Vis. Pattern
cial lstm: Human trajectory prediction in crowded spaces.
Recog., 2017. 2, 7
IEEE Conf. Comput. Vis. Pattern Recog., 2016. 2
[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
[2] Anton Andriyenko and Konrad Schindler. Multi-target track- Deep residual learning for image recognition. In IEEE Conf.
ing by continuous energy minimization. IEEE Conf. Comput. Comput. Vis. Pattern Recog., 2016. 2, 3, 6
Vis. Pattern Recog., 2011. 2
[18] Roberto Henschel, Laura Leal-Taixé, Daniel Cremers, and
[3] Jerome Berclaz, Francois Fleuret, Engin Turetken, and Pas- Bodo Rosenhahn. Improvements to frank-wolfe optimiza-
cal Fua. Multiple object tracking using k-shortest paths op- tion for multi-detector multi-object tracking. In IEEE Conf.
timization. IEEE Trans. Pattern Anal. Mach. Intell., 2011. Comput. Vis. Pattern Recog., 2017. 1, 2, 7
2
[19] Andrea Hornakova, Roberto Henschel, Bodo Rosenhahn,
[4] Philipp Bergmann, Tim Meinhardt, and Laura Leal-Taixé. and Paul Swoboda. Lifted disjoint paths with application in
Tracking without bells and whistles. In Int. Conf. Comput. multiple object tracking. In Int. Conf. Mach. Learn., 2020.
Vis., 2019. 1, 2, 7 2, 7
[5] Keni Bernardin and Rainer Stiefelhagen. Evaluating mul- [20] Hao Jiang, Sidney S. Fels, and James J. Little. A linear pro-
tiple object tracking performance: the clear mot metrics. gramming approach for multiple object tracking. IEEE Conf.
EURASIP Journal on Image and Video Processing, 2008, Comput. Vis. Pattern Recog., 2007. 2
2008. 6 [21] Margret Keuper, Siyu Tang, Bjoern Andres, Thomas Brox,
[6] Guillem Brasó and Laura Leal-Taixé. Learning a neural and Bernt Schiele. Motion segmentation & multiple object
solver for multiple object tracking. In IEEE Conf. Comput. tracking by correlation co-clustering. In IEEE Trans. Pattern
Vis. Pattern Recog., 2020. 1, 2, 7 Anal. Mach. Intell., 2018. 1, 2, 7
[7] Nicolas Carion, F. Massa, Gabriel Synnaeve, Nicolas [22] Chanho Kim, Fuxin Li, Arridhana Ciptadi, and James M.
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- Rehg. Multiple hypothesis tracking revisited. In Int. Conf.
end object detection with transformers. Eur. Conf. Comput. Comput. Vis., 2015. 1, 2, 7
Vis., 2020. 1, 2, 3, 5, 6 [23] Laura Leal-Taixé, Cristian Canton-Ferrer, and Konrad
[8] Long Chen, Haizhou Ai, Zijie Zhuang, and Chong Shang. Schindler. Learning by tracking: siamese cnn for robust
Real-time multiple people tracking with deeply learned can- target association. IEEE Conf. Comput. Vis. Pattern Recog.
didate selection and person re-identification. In Int. Conf. Worksh., 2016. 1, 2
Multimedia and Expo, 2018. 1, 2, 7 [24] Laura Leal-Taixé, Michele Fenzi, Alina Kuznetsova, Bodo
[9] Wongun Choi and Silvio Savarese. Multiple target tracking Rosenhahn, and Silvio Savarese. Learning an image-based
in world coordinate with single, minimally calibrated cam- motion context for multiple people tracking. IEEE Conf.
era. Eur. Conf. Comput. Vis., 2010. 2 Comput. Vis. Pattern Recog., 2014. 2
[10] Peng Chu and Haibin Ling. Famnet: Joint learning of fea- [25] Laura Leal-Taixé, Gerard Pons-Moll, and Bodo Rosenhahn.
ture, affinity and multi-dimensional assignment for online Everybody needs somebody: Modeling social and grouping
multiple object tracking. In Int. Conf. Comput. Vis., 2019. behavior on a linear programming multiple people tracker.
2, 7 Int. Conf. Comput. Vis. Workshops, 2011. 1, 2
[11] Qi Chu, Wanli Ouyang, Hongsheng Li, Xiaogang Wang, Bin [26] Wei Li, Rui Zhao, Tong Xiao, and Xiaogang Wang. Deep-
Liu, and Nenghai Yu. Online multi-object tracking using reid: Deep filter pairing neural network for person re-
cnn-based single object tracker with spatial-temporal atten- identification. In IEEE Conf. Comput. Vis. Pattern Recog.,
tion mechanism. In Proceedings of the IEEE International 2014. 7
Conference on Computer Vision, pages 4836–4845, 2017. 1, [27] Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir
2 Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva
[12] Patrick Dendorfer, Aljosa Osep, Anton Milan, Daniel Cre- Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft
mers, Ian Reid, Stefan Roth, and Laura Leal-Taixé. Motchal- coco: Common objects in context. arXiv:1405.0312, 2014.
lenge: A benchmark for single-camera multiple target track- 6
ing. Int. J. Comput. Vis., 2020. 1 [28] Qiankun Liu, Qi Chu, Bin Liu, and Nenghai Yu. Gsm: Graph
[13] Matteo Fabbri, Fabio Lanzi, Simone Calderara, Andrea similarity model for multi-object tracking. In Int. Joint Conf.
Palazzi, Roberto Vezzani, and Rita Cucchiara. Learning to Art. Int., 2020. 1, 2, 7
detect and track visible and occluded body joints in a vir- [29] Anton Milan, Laura Leal-Taixé, Ian D. Reid, Stefan Roth,
tual world. In European Conference on Computer Vision and Konrad Schindler. Mot16: A benchmark for multi-object
(ECCV), 2018. 7 tracking. arXiv:1603.00831, 2016. 2, 6, 7, 8
[14] Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. [30] Aljoša Ošep, Wolfgang Mehner, Paul Voigtlaender, and Bas-
Detect to track and track to detect. In ICCV, 2017. 2 tian Leibe. Track, then decide: Category-agnostic vision-
[15] Pedro F. Felzenszwalb, Ross B. Girshick, David A. based multi-object tracking. IEEE Int. Conf. Rob. Aut., 2018.
McAllester, and Deva Ramanan. Object detection with dis- 2

8852
[31] Bo Pang, Yizhuo Li, Yifan Zhang, Muchen Li, and Cewu Lu. [45] H. Sheng, Y. Zhang, J. Chen, Z. Xiong, and J. Zhang. Het-
Tubetk: Adopting tubes to track multi-object in a one-step erogeneous association graph fusion for target association in
training model. In Proceedings of the IEEE/CVF Conference multiple object tracking. IEEE Transactions on Circuits and
on Computer Vision and Pattern Recognition (CVPR), June Systems for Video Technology, 2019. 2, 7
2020. 7 [46] Daniel Stadler and Jurgen Beyerer. Improving multiple
[32] Jiangmiao Pang, Linlu Qiu, Xia Li, Haofeng Chen, Qi Li, pedestrian tracking by track management and occlusion han-
Trevor Darrell, and Fisher Yu. Quasi-dense similarity learn- dling. In Proceedings of the IEEE/CVF Conference on Com-
ing for multiple object tracking. In IEEE/CVF Conference puter Vision and Pattern Recognition (CVPR), pages 10958–
on Computer Vision and Pattern Recognition, June 2021. 2, 10967, June 2021. 7
7 [47] Russell Stewart, Mykhaylo Andriluka, and Andrew Y Ng.
[33] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Łukasz End-to-end people detection in crowded scenes. In Proceed-
Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Im- ings of the IEEE conference on computer vision and pattern
age transformer. arXiv preprint arXiv:1802.05751, 2018. 2 recognition, pages 2325–2333, 2016. 2, 5
[34] Stefano Pellegrini, Andreas Ess, Konrad Schindler, and [48] Siyu Tang, Mykhaylo Andriluka, Bjoern Andres, and Bernt
Luc Van Gool. You’ll never walk alone: modeling social Schiele. Multiple people tracking by lifted multicut and per-
behavior for multi-target tracking. Int. Conf. Comput. Vis., son re-identification. In IEEE Conf. Comput. Vis. Pattern
2009. 2 Recog., 2017. 2
[35] Jinlong Peng, Changan Wang, Fangbin Wan, Yang Wu, [49] Pavel Tokmakov, Jie Li, Wolfram Burgard, and Adrien
Yabiao Wang, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Gaidon. Learning to track with object permanence. In Int.
Huang, and Yanwei Fu. Chained-tracker: Chaining paired at- Conf. Comput. Vis., 2021. 2, 7
tentive regression results for end-to-end joint multiple-object [50] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
detection and tracking. In Proceedings of the European Con- reit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia
ference on Computer Vision, 2020. 7 Polosukhin. Attention is all you need. In Adv. Neural Inform.
[36] Hamed Pirsiavash, Deva Ramanan, and Charless C. Fowlkes. Process. Syst., 2017. 1, 2, 3, 4
Globally-optimal greedy algorithms for tracking a variable
[51] Paul Voigtlaender, Michael Krause, Aljosa Osep, Jonathon
number of objects. IEEE Conf. Comput. Vis. Pattern Recog.,
Luiten, Berin Balachandar Gnana Sekar, Andreas Geiger,
2011. 2
and Bastian Leibe. Mots: Multi-object tracking and segmen-
[37] Lorenzo Porzi, Markus Hofinger, Idoia Ruiz, Joan Serrat,
tation. In IEEE Conf. Comput. Vis. Pattern Recog., 2019. 2,
Samuel Rota Bulo, and Peter Kontschieder. Learning multi-
6, 7, 8
object tracking and segmentation from automatic annota-
[52] Qiang Wang, Yun Zheng, Pan Pan, and Yinghui Xu. Multi-
tions. In IEEE Conf. Comput. Vis. Pattern Recog., 2020. 2,
ple object tracking with correlation learning. In IEEE Conf.
7
Comput. Vis. Pattern Recog., 2021. 7
[38] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.
[53] Shuai Wang, Hao Sheng, Yang Zhang, Yubin Wu, and Zhang
Faster r-cnn: Towards real-time object detection with region
Xiong. A general recurrent tracking framework without real
proposal networks. Adv. Neural Inform. Process. Syst., 2015.
data. In Int. Conf. Comput. Vis., 2021. 7
1, 6
[39] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir [54] Yongxin Wang, Kris Kitani, and Xinshuo Weng. Joint object
Sadeghian, Ian Reid, and Silvio Savarese. Generalized in- detection and multi-object tracking with graph neural net-
tersection over union: A metric and a loss for bounding box works. In IEEE Int. Conf. Rob. Aut., May 2021. 2, 7
regression. In IEEE Conf. Comput. Vis. Pattern Recog., 2019. [55] Yuqing Wang, Zhaoliang Xu, Xinlong Wang, Chunhua
5 Shen, Baoshan Cheng, Hao Shen, and Huaxia Xia. End-
[40] Ergys Ristani, Francesco Solera, Roger S. Zou, Rita Cuc- to-end video instance segmentation with transformers. In
chiara, and Carlo Tomasi. Performance measures and a data Proc. IEEE Conf. Computer Vision and Pattern Recognition
set for multi-target, multi-camera tracking. In Eur. Conf. (CVPR), 2021. 6
Comput. Vis. Workshops, 2016. 6 [56] Jialian Wu, Jiale Cao, Liangchen Song, Yu Wang, Ming
[41] Ergys Ristani and Carlo Tomasi. Features for multi-target Yang, and Junsong Yuan. Track to detect and segment: An
multi-camera tracking and re-identification. IEEE Conf. online multi-object tracker. In Proceedings of the IEEE Con-
Comput. Vis. Pattern Recog., 2018. 2 ference on Computer Vision and Pattern Recognition, 2021.
[42] Alexandre Robicquet, Amir Sadeghian, Alexandre Alahi, 2, 7
and Silvio Savarese. Learning social etiquette: Human tra- [57] Zhenbo Xu, Wei Zhang, Xiao Tan, Wei Yang, Huan Huang,
jectory prediction. Eur. Conf. Comput. Vis., 2016. 2 Shilei Wen, Errui Ding, and Liusheng Huang. Segment as
[43] Paul Scovanner and Marshall F. Tappen. Learning pedestrian points for efficient online multi-object tracking and segmen-
dynamics from the real world. Int. Conf. Comput. Vis., 2009. tation. In Eur. Conf. Comput. Vis., 2020. 2, 7
2 [58] Kota Yamaguchi, Alexander C. Berg, Luis E. Ortiz, and
[44] Shuai Shao, Zijian Zhao, Boxun Li, Tete Xiao, Gang Yu, Tamara L. Berg. Who are you with and where are you going?
Xiangyu Zhang, and Jian Sun. Crowdhuman: A benchmark IEEE Conf. Comput. Vis. Pattern Recog., 2011. 2
for detecting human in a crowd. arXiv:1805.00123, 2018. 6, [59] Fan Yang, Wongun Choi, and Yuanqing Lin. Exploit all
7 the layers: Fast and accurate cnn object detector with scale

8853
dependent pooling and cascaded rejection classifiers. IEEE
Conf. Comput. Vis. Pattern Recog., 2016. 6
[60] Fan Yang, Wongun Choi, and Yuanqing Lin. Exploit all
the layers: Fast and accurate cnn object detector with scale
dependent pooling and cascaded rejection classifiers. IEEE
Conf. Comput. Vis. Pattern Recog., pages 2129–2137, 2016.
7
[61] Qian Yu, Gerard Medioni, and Isaac Cohen. Multiple tar-
get tracking using spatio-temporal markov chain monte carlo
data association. IEEE Conf. Comput. Vis. Pattern Recog.,
2007. 2
[62] Li Zhang, Yuan Li, and Ramakant Nevatia. Global data
association for multi-object tracking using network flows.
IEEE Conference on Computer Vision and Pattern Recog-
nition (CVPR), 2008. 2
[63] Y. Zhang, H. Sheng, Y. Wu, S. Wang, W. Lyu, W. Ke, and Z.
Xiong. Long-term tracking with deep tracklet association.
IEEE Trans. Image Process., 2020. 2, 7
[64] Yifu Zhang, Chunyu Wang, Xinggang Wang, Wenjun Zeng,
and Wenyu Liu. Fairmot: On the fairness of detection and
re-identification in multiple object tracking. International
Journal of Computer Vision, pages 1–19, 2021. 7
[65] Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jing-
dong Wang, and Qi Tian. Scalable person re-identification:
A benchmark. In Int. Conf. Comput. Vis., 2015. 7
[66] Xingyi Zhou, Vladlen Koltun, and Philipp Krähenbühl.
Tracking objects as points. ECCV, 2020. 1, 2, 5, 6, 7, 8
[67] Ji Zhu, Hua Yang, Nian Liu, Minyoung Kim, Wenjun Zhang,
and Ming-Hsuan Yang. Online multi-object tracking with
dual matching attention networks. In Eur. Conf. Comput.
Vis., 2018. 1, 2
[68] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang,
and Jifeng Dai. Deformable detr: Deformable transformers
for end-to-end object detection. Int. Conf. Learn. Represent.,
2021. 2, 6

8854

You might also like