Fig. 1. An illustration of the multi-sensor fusion perception system
We conducted preliminary experiments on the pipeline proposed in CLOCs [1], a representative multi-sensor fusion perception system. This pipeline consists of three ML components and one non-ML component and has been utilized by Gao et al. for testing perception fusion systems [2]. Our supplementary experiments aim to address two key questions:
How much adaptations are required for RegTrieve’s design and implementation for new application scenarios?
How effective is RegTrieve in application scenarios beyond Spoken QA?
Next, we will introduce the additional application scenario, evaluation setups, and evaluation results.
Application Scenario Description
We briefly introduce the CLOCs here; for details, please refer to its paper [1]. The system aims to fuse the results of 2D and 3D object detection to achieve more accurate 3D object detection outcomes. As the Fig. 1 shows, it consists of three ML components (a 2D object detection model, a 3D object detection model, and a fusion model) and one non-ML component (the Intersection Filter). First, the 2D images from a camera and the 3D point cloud data from LiDAR are processed by the respective 2D and 3D object detection models to generate 2D and 3D detected bounding boxes. Then, the Intersection Filter applies rule-based filtering to identify overlapping bounding boxes. Finally, the fusion model, a CNN-based model trained on labeled data, merges the filtered bounding boxes to produce the final detection results.
Evaluation Setups
Update Scenario
We use Cascade R-CNN [3] and YOLO-v5 [4] as the 2D object detection models, and SECOND [5] as the 3D detection model. The fused model is the CLOCs model [1]. In the experiments, we updated the 2D object detection model (replacing R-CNN with YOLO-v5) while keeping the other components unchanged.
Metrics
In object detection tasks, true positives and false positives are typically calculated based on the IOU (Intersection over Union) between predicted and ground truth bounding boxes. Recall, precision, and regression errors are all computed based on IOU. Recall is used to measure the accuracy of the 2D object detection model, as any redundant boxes will be filtered out by the Intersection Filter, making precision less important. F1 score is used to evaluate the overall system accuracy.
Methods
Evaluation Results
The experimental results are shown in the table above. As can be seen, all three settings of RegTrieve reduce more regression errors without compromising model-level recall, compared to baselines. Among them, System+ reduces the most regression errors. Since Simple-Old retains many non-overlapped bounding boxes from the old model, it has fewer regression errors. However, it does not consider enough results from the new model, leading to a lower accuracy. Although RegTrieve causes a slight decrease in system-level F1, the reduction is minimal compared to baselines, while also reducing a relatively larger number of system regression errors.
Summary
Here are the answers to the two questions in the beginning:
RegTrieve Adaption. When applying RegTrieve to this application scenario, the overall framework is same to the Spoken QA scenario, as both rely on loss-based model accuracy estimation to compute ensemble weights. However, the methods for calculating the loss and merging bounding boxes differ from the Spoken QA scenario. Note that baseline methods need to be adapted for bounding box merging in this new scenario. Moreover, the Uncertainty-Dropout method cannot be used because the models lacks a dropout layer.
RegTrive Effectiveness. Compared to the baseline methods, RegTrieve reduces more regression errors while maintaining system accuracy in the multi-sensor perception fusion scenario. Additionally, considering system-level losses helps reduce more system-level regression errors than only considering model-level losses.
Due to time constraints, we currently provide results for only one model update scenario. We will provide additional results for other model update scenarios in the future.
Reference:
[1] Pang et al. CLOCs: Camera-LiDAR Object Candidates Fusion for 3D Object Detection. IROS 2022.
[2] Gao et al. MultiTest: Physical-Aware Object Insertion for Testing Multi-sensor Fusion Perception Systems. ICSE 2024.
[3] Cai et al. Cascade r-cnn: High quality object detection and instance segmentation. Arxiv 2019.
[4] YOLO-v5. https://zenodo.org/records/7347926
[5] Yan et al. Second: Sparsely embedded convolutional detection. Sensors 2018.
We conducted a supplementary evaluation of the datastore storage and I/O time costs in seconds across seven scenarios. Given the testing dataset, we repeated the inference process three times and used the average as the final I/O time and inference time results.
The "model" and "system" under the "datastore storage" column in the table represent the sizes of the datastores that need to be introduced when using model-level and system-level loss in RegTrieve, respectively. As shown, the datastore storage size ranges from 2GB to 8GB, depending on the model parameters in the specific scenario. This size can generally be fully loaded into the RAM of mainstream machines.
Additionally, we have recorded the time spent loading the datastore, model, and dataset, as well as the total inference time. It can be observed that the time taken to load the datastore is similar to that of loading the model and dataset. The latter two are essential for normal model inference, so the additional I/O cost introduced by RegTrieve is relatively small. Moreover, the I/O time during the initialization process (load model/dataset/datastore) is minimal compared to the total inference time.
Due to time constraints, we have currently evaluated the impact of the hyperparameter K on the performance of the RegTrieve Model, System, and System+ in Scenario 1 and Scenario 5. We will add results for other scenarios in the future.
Scenario 1. The impact of K on the performance under Model setting of RegTrieve
Scenario 1: The impact of K on the performance under System setting of RegTrieve
Scenario 1: The impact of K on the performance under System+ setting of RegTrieve
Scenario 5: The impact of K on the performance under Model setting of RegTrieve
Scenario 5: The impact of K on the performance under System setting of RegTrieve
Scenario 5: The impact of K on the performance under System+ setting of RegTrieve
As shown in the figures above, in both scenarios and for the three settings of RegTrieve, the F1 score initially increases with K and then stabilizes. Similarly, the system regression error rate decreases with increasing K and then stabilizes. This trend is expected, as a larger K means more neighbors, leading to more accurate estimates of model accuracy and ensemble weights. However, beyond 15 or 20 neighbors, the newly added neighbors are increasingly distant from the current sample and contribute less to the final result.
Therefore, setting K=20 in our experiments is a reasonable choice, as it is universally applicable across different scenarios and model settings, without requiring complex hyperparameter tuning.