In this section, we introduce how we collect subject MSF systems and explain the selection criteria and sources. Then we provide a detailed list of papers and analyze the trend of MSF systems from four perspectives: journal, task, modality, and dataset.
In order to collect the most representative MSF systems into our benchmark, we list the following requirements to select valid candidates from vast papers.
MSF-based perception systems are typically equipped with multiple sensors to sense the environment and combine data from different sources to achieve more accurate and reliable sensing information. Therefore, we focus on MSF systems involving two or more types of different sensors and ignore the data fusion with individual sensor ( e.g., multi-view data fusion from LiDAR).
In addition, as we stated in the introduction, AI-enabled MSF systems have a superior performance in processing and extracting complex semantic information from sensors’ data. Therefore, we pay more attention to AI-enabled MSF systems equipped with more sensors with complementary characteristics and different physical properties (e.g., camera and LiDAR). Such systems more representative because they are required novel and typical AI-enabled fusion methods to handle more challenging heterogeneous data fusion.
There are plenty of exciting MSF projects designed and established by companies and researchers, but not all are open-sourced. In our paper collection, more than half of the papers do not provide a public code repository. In order to analyze and evaluate the system performance, we need to find MSF projects in which the complete systems are available.
In addition to the public code repository, the data for the MSF project must be available, so that we can evaluate the performance of robustness against synthesize large-scale corruption datasets using corruption patterns. Data availability is a prerequisite for our evaluation of system performance. In addition, we focus on multimodal datasets that contain a large number of publicly available data (e.g., KITTI).
An MSF system should be designed for representative perception tasks with real-world applications. While browsing the available candidate MSF, we find that object detection, object tracking, and depth completion are the most common tasks, especially object detection. In order to obtain convincing evaluation results for our experiments, we select as many representative tasks as possible rather than limiting them to one task.
To collect as many appropriate AI-enabled MSF perception systems as possible for our study, we mainly focus on two sources: (1) the leaderboard of KITTI benchmark, and (2) MSF-related literature.
KITTI is one of the most popular autonomous driving datasets, which supports multiple perception tasks, including 2D and 3D object detection and tracking, depth completion, etc. KITTI adopts four high-resolution cameras, a Velodyne HDL-64E LiDAR, and an advanced positioning system to collect data from different real-world driving scenarios. Data from KITTI have been processed (synced and rectified) and well-labeled. In addition, the KITTI dataset provides an official development kit and public leaderboard, which helps researchers to evaluate and compare the performance of MSF systems.
We collect papers published in relevant top-tier conferences and journals during the last four years, covering software engineering, robotics, computer vision, etc. Here is a complete list of the selected venue.
software engineering: ICSE, FSE, ASE, ISSTA, ISSRE, FASE, ICST, SOSP, OSDI, PLDI
robotics: ICRA, IROS, RSS, RAL
transportation: ITSC, IV, ICCAR, TITS, TIV
machine learning: Neurips, ICML, AAAI, ICLR, IJCAI
computer vision: CVPR, ECCV, ICCV
security: CCS, USENIX, S&P, NDSS
We conducted our initial search by querying using 'multi-sensor' and 'deep learning' or 'multisensor' and 'deep learning' or 'multi-modal' and 'deep learning' or 'multimodal' and 'deep learning' or 'data fusion' and 'deep learning'. We then went through the titles and abstracts and skimmed through the papers to determine if they satisfied the criteria. In total, we have collected 27 relevant papers shown in next subsection.
Paper Lists
This table contains a list of open-source MSF systems. In column "Modality" , C means camera, L means Lidar, R means radar, M means map, G means GPS, and A means audio.
Note that during our search, we found that more than half of the papers are not open-source. Some of these papers have experimented with fusion using other sensors, such as IMU and ultrasonic device, others are evaluated in other tasks, such as point cloud completion, and physical device control. Readers can access a complete list of papers, including those that are not open-source, from this [Link].
(1)
(3)
(2)
(4)
We count the number of papers by "Task", "mode", "Modality", "Conference/Journal"and "Dataset", respectively, and obtain the following insights:
Conference/Journal. We can find that most of the papers was published in computer vision conferences (i.e. CVPR, ECCV, ICCV) and robotics-related conferences (ICRA,IROS, RAL). In other fields, there are hardly any relevant papers, such as machine learning , transportation, especially in software engineering. As a complex software system deployed in a safety-critical application systems such as self-driving, MSF based perception systems should receive more attention.
Task. We can find that most of the studies focus object detection (18/27). Object detection is one of the most classic perception tasks, it aims to locate, classify and estimate oriented bounding boxes in the 3D space. This benchmark cover three most common tasks, i.e. object detection, object tracking, depth completion.
Modality. We can find that most of the studies focused on camera and LiDAR fusion(23/27). There are two reasons for why camera-LiDAR fusion models more effective and popular compared with other sensors.
First, both camera and LiDAR are sensors with excellent sensing capabilities. The camera is more effective in capturing semantic information and LiDAR could provide more accurate geographic information.
Second, cameras and LIDAR have more complementary characteristics. For example, cameras struggle with poor lighting conditions, while LiDAR can work in the dark. LIDAR has low-refresh rates, while the camera can capture the texture and details of the object.
Third, fusing heterogeneous data is more challenging and representative. The camera and LIDAR capture image and point cloud data , respectively. Images are regular, ordered, and discrete, while point clouds are irregular, disordered, and continuous.
Dataset. We found that more than two-thirds (19/27) of the MSF systems are evaluated on KITTI. To this end, this benchmark use the KITTI base dataset to construct KITTI-C to benchmark AI-enabled MSF systems' performance and robustness. Note that corruption patterns used in our benchmark can also generalize to other datasets, such as NuScenes.
We select seven state-of-the-art MSF systems from 3 different tasks (i.e., object detection, object tracking, depth completion) and three different fusion mechanisms (i.e., late fusion, deep fusion, and weak fusion) in paper lists.
Object Detection: EPNet, FConv, CLOCs
Object Tracking: JMODT, DFMOT
Depth Completion: TWISE, MDANet
We implement all MSF systems with PyTorch 1.8 and Python 3.7. For each system, we use default configurations to ensure a consistent runtime environment. Due to environmental conflicts, we are unable to reproduce the MSF systems of paper #1, #4, #19 and #22 in paper lists. All subject MSF we collected for benchmarking with their system architecture, application tasks, and image illustrations are shown in Benchmarks.