This section summarizes the seven MSF systems selected in our benchmark. These seven systems cover three different tasks and three different fusion mechanisms.
The collected MSF systems.
EPNet[1] proposes a LiDAR-guided Image Fusion (LI-Fusion) module to improve the 3D detection performance. The LI-Fusion module employs a multi-scale point-wise feature fusion scheme to enable the interaction between the hidden features of point cloud and image data.
EPNet consists of a two-stream RPN for proposal generation and a refinement network for bounding box refining, which can be trained end-to-end. The two-stream RPN effectively combines the LiDAR point feature and semantic image feature via the proposed LI-Fusion module.
The LiDAR-guided image fusion module consists of a grid generator, an image sampler, and a LI-Fusion layer.
First, LI-Fusion module project the LiDAR points onto the camera image and denote the mapping matrix as M. The grid generator takes a LiDAR point cloud and a mapping matrix M as inputs, and outputs the pointwise correspondence between the LiDAR points and the camera image under different resolutions.
Next, LI-Fusion module use an image sampler to get the semantic feature representation for each point.
Then, LI-Fusion module use a LiDAR-guided fusion layer, which utilizes the LiDAR feature to adaptively estimate the importance of the image feature in a point-wise manner.
Finally, LI-Fusion module combine the LiDAR feature FP and the semantic image feature FI in a concatenation manner.
The system architecture of EPNet
Illustration of the LI-Fusion module
FConv[2] uses a frustum-based point cloud feature extraction method. F-ConvNet first gets 2D region proposals in RGB images, then generates a sequence of frustums for each 2D region proposal to aggregate local point-wise features as frustum-level feature vectors before feeding into an FC (fully-connected) layer for prediction.
First, Given 2D region proposals in an RGB image, FConv generates a sequence of frustums for each region proposal, and uses the obtained frustums to group local points.
Next, FConv aggregates point-wise features as frustum-level feature vectors, and arrays these feature vectors as a feature map for use of its subsequent component of fully convolutional network (FCN), which spatially fuses frustum-level features and supports an end-to-end and continuous estimation of oriented boxes in the 3D space.
Step1.Generate a sequence of frustums from a region proposal in an RGB image
Step2.Aggregates point-wise features as frustum-level feature vectors
CLOCs[3] leverages geometric and semantic consistencies of 2D and 3D output candidates to produce more accurate final detection results.
Geometric consistency. CLOCs uses an image-based Intersection over Union (IoU) of the 2D bounding box and the bounding box of the projected corners of the 3D detection, to quantify geometric consistency between a 2D and a 3D detection.
Semantic consistency. CLOCs associate detections of the same category during fusion. CLOCs avoid thresholding detections at this stage (or use very low thresholds), and leave thresholding to the final output based on the final fused score.
First, CLOCs converts individual 2D and 3D detection candidates into a set of consistent joint detection candidates (a sparse tensor, the blue box) which can be fed into the fusion network ;
Then, CLOCs uses a 2D CNN to process the non-empty elements in the sparse input tensor;
Finally, this processed tensor is mapped to the desired learning targets, a probability score map, through maxpooling.
The system architecture of CLOCs
JMODT[4] uses LI-Fusion modules to produce 3D bounding boxes and association confidences for online mixed-integer programming. Robust affinity computation and data association methods are specifically proposed to improve multi-object tracking performance.
The architecture of our system includes five main modules: (1) region proposal network (RPN), (2) parallel detection and correlation networks, (3) affinity computation, (4) data association, and (5) track management.
The tracking pipeline consists of five stages: (1) RPN takes calibrated sensor data from paired frames as input and generates regions of interest (RoI) and multi-modal features of the region proposals; (2) the parallel detection and correlation networks use the RoI and proposal features to generate detection results, Re-ID affinities and start-end probabilities; (3) the Re-ID affinities are further refined via the motion prediction and match score ranking modules; (4) the mixedinteger programming module performs comprehensive data association based on the detection results and computed affinities; (5) the association results are further managed to achieve continuous tracks despite object occlusions and reappearances
The system architecture of JMODT
DFMOT employs a four-level deep association mechanism to make use of different advantages of cameras and LiDARs. This mechanism does not involve complex cost functions or feature extraction networks while fusing the 2D and 3D trajectories.
First, DFMOT calculate the IoU between 2D bounding boxes from image and 2D bounding boxes projected by 3D bounding boxes from lidar. The detected objects are classified into three categories according to the matching results, i.e. only in the image, and only in the radar, exist simultaneously in both domains .
Next, DFMOT uses the foregoing three types of objects as inputs to deep association. This deep association mechanism comprises four levels of data association.
In the 1st level of association, the existing 3D trajectories are associated with the fused detections.
In the 2nd level of association, the unmatched trajectories from the previous stage are associated with the detections only in the LiDAR domain.
The 3rd level of association is only for 2D tracking in the image domain - associating 2D trajectories with objects detected by the 2D detector only.
In the 4th level of association, the unmatched 3D trajectories are associated with trajectories in the image domain in the third stage .
The system architecture of DFMOT
TWISE[6] first uses a twin-surface representation that explicitly models both foreground and background depths in the difficult occlusion-boundary regions. Then it predicts the final depth map by fusing the features from the foreground and background surfaces.
First, TWISE uses Lidar data and image (a) extrapolate the estimates of foreground depth (b) and background depth (c).
Next, TWISE Fuses (a) (b) (c) to the completed depth (d) along with a weight (e). The foreground-background depth difference (f) is small except at depth discontinuities.
Visualization of depth maps generated by TWISE
MDANet[7] contains three stacked multi-modal deep aggregation blocks (MDA). Each MDA block consists of multiple connections and aggregation pathways for deeper fusion. MDANet also uses a deformable guided fusion layer to guide the generation of the dense depth map.
MDANet can be seen as consisting of three parts: RGB Encoder, Depth Pre-completion and Multi-Modal Deep Aggregation.
In RGB Encoder, MDANet employ a series of down-sample operations to extract image features.
In Depth Pre-completion, MDANet generate the semi-dense depth input via pre-completion algorithm.
In Multi-Modal Deep Aggregation, MDANet stack three MDA Blocks to receive image features and semi-dense depth map with the corresponding resolution. Except for the stage 0, MDA Block also absorbs the information from the previous stage, as shown in the green dotted box. Then each down-sampled depth feature will be aggregated with previous stage information , and each up-sampled depth feature will be aggregated with the image feature. Finally, dense depth output is calculated via Deformable Guided Aggregated Layer .
Left part: The system architecture of MDANet. Right part: the detail of MDA block
[1] T. Huang, Z. Liu, X. Chen, and X. Bai, “Epnet: Enhancing point features with image semantics for 3d object detection,” in European Conference on Computer Vision Springer, 2020, pp. 3
[2] Z. Wang and K. Jia, “Frustum convnet: Sliding frustums to aggregate local point-wise features for amodal 3d object detection,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2019, pp. 1742–1749.
[3] S. Pang, D. Morris, and H. Radha, “Clocs: Camera-lidar object candi dates fusion for 3d object detection,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020, pp. 10 386–10 393.
[4] K. Huang and Q. Hao, “Joint multi-object detection and tracking with camera-lidar fusion for autonomous driving,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2021, pp. 6983–6989.
[5] X. Wang, C. Fu, Z. Li, Y. Lai, and J. He, “Deepfusionmot: A 3d multi-object tracking framework based on camera-lidar fusion with deep association,” arXiv preprint arXiv:2202.12100, 2022.
[6] S. Imran, X. Liu, and D. Morris, “Depth completion with twin surface extrapolation at occlusion boundaries,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 2583–2592.
[7] Y. Ke, K. Li, W. Yang, Z. Xu, D. Hao, L. Huang, and G. Wang, “Mdanet: Multi-modal deep aggregation network for depth completion,” in 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021, pp. 4288–4294.