This section provides the details of the systems under test.
High-level architecture of MSF-based MOT Perception System
VirTrack: TBD & Deep Fusion
JMODT: JDT & Deep Fusion
DFMOT: TBD & Late fusion
YONTD: JDT& Late fusion
The details are as follows.
VirTrack
VirTrack is a TBD-based multi-object tracking (MOT) system that sequentially performs detection and tracking tasks.
In the detection stage, it utilizes the VirConv model to identify objects. VirConv begins by generating a virtual point cloud from a depth map and employs a StVD (Stochastic Voxel Discard) scheme to prioritize and retain the most critical virtual points using bin-based sampling. It further incorporates an NRConv (Noise-Resistant Submanifold Convolution) layer to encode geometric features of voxels in both 3D space and 2D image space. By combining StVD and NRConv, the VirConv model introduces a VirConv operator, which effectively encodes the voxel features of virtual points, ensuring robust and accurate object detection.
In the tracking stage, VirTrack employs a 3D tracker with a novel data association scheme guided by prediction confidence. This approach allows for the effective utilization of object features in point clouds and ensures the tracking of temporarily missed objects. The tracker takes the detected results as input, estimates the current states of tracked objects using a predictor based on a constant acceleration (CA) motion model, and assigns confidence levels to each predicted state, including those for temporarily missed detections. These predicted states are then associated with detected states by leveraging prediction confidence and an aggregated pairwise cost. Finally, the matched pairs are updated to complete the tracking process, resulting in accurate and reliable multi-object tracking.
The system architecture of VirTrack
JMODT[4] uses LI-Fusion modules to produce 3D bounding boxes and association confidences for online mixed-integer programming. Robust affinity computation and data association methods are specifically proposed to improve multi-object tracking performance.
The architecture of our system includes five main modules: (1) region proposal network (RPN), (2) parallel detection and correlation networks, (3) affinity computation, (4) data association, and (5) track management.
The tracking pipeline consists of five stages: (1) RPN takes calibrated sensor data from paired frames as input and generates regions of interest (RoI) and multi-modal features of the region proposals; (2) the parallel detection and correlation networks use the RoI and proposal features to generate detection results, Re-ID affinities and start-end probabilities; (3) the Re-ID affinities are further refined via the motion prediction and match score ranking modules; (4) the mixedinteger programming module performs comprehensive data association based on the detection results and computed affinities; (5) the association results are further managed to achieve continuous tracks despite object occlusions and reappearances
The system architecture of JMODT
DFMOT employs a four-level deep association mechanism to make use of different advantages of cameras and LiDARs. This mechanism does not involve complex cost functions or feature extraction networks while fusing the 2D and 3D trajectories.
First, DFMOT calculate the IoU between 2D bounding boxes from image and 2D bounding boxes projected by 3D bounding boxes from lidar. The detected objects are classified into three categories according to the matching results, i.e. only in the image, and only in the radar, exist simultaneously in both domains .
Next, DFMOT uses the foregoing three types of objects as inputs to deep association. This deep association mechanism comprises four levels of data association.
In the 1st level of association, the existing 3D trajectories are associated with the fused detections.
In the 2nd level of association, the unmatched trajectories from the previous stage are associated with the detections only in the LiDAR domain.
The 3rd level of association is only for 2D tracking in the image domain - associating 2D trajectories with objects detected by the 2D detector only.
In the 4th level of association, the unmatched 3D trajectories are associated with trajectories in the image domain in the third stage .
The system architecture of DFMOT
YONTD
YONTD employs an end-to-end multi-modal fusion tracking approach, enabling a single model to seamlessly handle both detection and tracking tasks. This eliminates the need for the complex data association processes typically required in the classic TBD paradigm. The multi-modal fusion-based multi-object tracking framework is composed of three main modules: the data input module, the detection and trajectory regression module, and track management. The detection and trajectory regression module incorporates several essential components, including two-stage 2D and 3D detectors, confidence fusion, and trajectory-based NMS, and operates as follows:
3D Detection and Prediction: Bounding boxes for the current frame are generated using a two-stage 3D detector on the current frame's point cloud. The trajectory states from the previous frame are predicted using a Kalman filter.
Pose Compensation: Motion transformation between frames is derived from GPS/IMU data and used to compensate for predicted trajectory states.
Trajectory Regression: The compensated trajectories serve as proposals for the two-stage detector, which regresses their pose and confidence in the current frame.
2D Projection and Detection: The regressed 3D trajectories are projected onto a 2D image, and a two-stage image-based detector processes these projections and the current frame image to assess trajectory confidence.
Confidence Fusion: Confidence scores from the 2D and 3D detectors are fused with historical confidence values.
NMS and Data Association: The fused confidence scores guide NMS, ranking detections by trajectory confidence to achieve association and finalize results.
Note that the trajectory management module is adopted from DFMOT.
The system architecture of YONTD
BevFusion
BevFusion is a deep fusion-based multi-sensor perception system that integrates features from multiple modalities—typically LiDAR point clouds and multi-view camera images—into a unified bird’s-eye view (BEV) representation. In its design, raw sensory inputs are first transformed into modality-specific feature maps, with camera features being projected into the BEV space via geometric view transformation, and LiDAR features encoded directly in BEV using point-based or voxel-based encoders. These features are then fused at the BEV level, enabling the network to jointly exploit the geometric precision of LiDAR and the rich semantic context of camera images. By operating entirely in the BEV domain, {\bevfusion} facilitates consistent spatial reasoning, efficient cross-modal alignment, and scalability to multiple perception tasks such as 3D object detection, tracking, and map segmentation, making it a strong baseline in autonomous driving research. In our implementation, we adopt the TBD strategy from the original paper, which performs object detection in the fused BEV space and then applies a greedy-matching-based tracker to build tracking pipeline.
Configuration
These MOT perception systems utilize pre-trained models and are configured with the default tracker settings provided by their respective authors. Moverover, for single-sensor detection systems (e.g., used as branches in MOT systems), we employ CascadeRCNN as the camera detector and VoxelRCNN as the LiDAR detector.