Multi-Object Tracking as Attention Mechanism

Abstract

We propose a conceptually simple and thus fast multi-object tracking (MOT) model that does not require any attached modules, such as the Kalman filter, Hungarian algorithm, transformer blocks, or graph networks. Conventional MOT models are built upon the multi-step modules listed above, and thus the computational cost is high. Our proposed end-to-end MOT model, TicrossNet, is composed of a base detector and a cross-attention module only. As a result, the overhead of tracking does not increase significantly even when the number of instances (Nt) increases. We show that TicrossNet runs in real-time; specifically, it achieves 32.6 FPS on MOT17 and 31.0 FPS on MOT20 (Tesla V100), which includes as many as >100 instances per frame. We also demonstrate that TicrossNet is robust to Nt; thus, it does not have to change the size of the base detector, depending on Nt, as is often done by other models for real-time processing.

Architecture

To make an efficient model for MOT, we attempt to reduce the number of modules and module complexity. Our core idea is to use the cross-attention mechanism [9] for MOT modeling. This idea is based on the similarity between the crossattention mechanism and the key processes of MOT. We argue that this similarity enables us to perform MOT using only one cross-attention module (and a base detector). As a result, we can complete all the key processes of MOT in GPU only unlike conventional models. 

Following the idea above, we propose a conceptually simple and fast tracker, called tracking crossword network (TicrossNet), which completes all the MOT key processes using the cross-attention mechanism. It requires only minor modifications to the vanilla cross-attention mechanism, i.e., a softmax normalization, feature clipping, and micro convolutional neural network (CNN), which do not increase computational cost significantly. As a result, the overhead of the tracking process does not increase significantly even when the number of instances increases. Note that TicrossNet uses the cross-attention mechanism for efficient MOT modeling, not only for feature extraction like conventional transformers.

Detail structure of TicrossNet

Results (MOT17 and MOT20 benchmarks)

TicrossNet achieves 32.6 FPS on MOT17 and 31.0 FPS on MOT20 even though the latter includes as many as >100 instances per frame, while the other models including the state-of-the-art (SOTA) MOT model, ByteTrack, significantly slow down (except for the slow networks, i.e., TransCenter and TransTrack). Note that the video frame rates of MOT17 and MOT20 are 30 and 25 FPS, respectively; thus, we can safely say that TicrossNet runs in real-time. In terms of MOTA, IDF1, IDs, TicrossNet performs similarly to MOTR, which is the only end-to-end MOT model other than TicrossNet, but TicrossNet is significantly faster.

Nonetheless, in addition to the real-time speed, TicrossNet has an advantage that more than makes up for the low MOTA, IDF1, and IDs : the robustness to the number of instances (Nt). The right figure shows Nt vs. module latency. For fair comparison, the same GPU (RTX 2080 Ti) is used, while the left table is not. We pick out three fast models from the left table. The right figure shows that the computational cost of TicrossNet does not increase significantly even when Nt increases, unlike the other fast models including the SOTA MOT model, ByteTrack. This is because (1) TicrossNet can process all the key processes of MOT on GPU unlike the others, and (2) TicrossNet does not require the attached modules for tracking that tend to significantly increase computational cost when Nt is large, as shown in the right figure. Therefore, this result proves the robustness of TicrossNet to Nt.