ACML - LiDAR for Multi-Modal

LiDAR and Camera Multi-Modal Fusion for Autonomous Vehicles

Yan-Hao Chen and Jui-Chiu Chiang

Fig. 1 Overall of the proposed architecture

Abstract

With the rapid advancement of autonomous driving, perception has become a key factor in ensuring driving safety and reliable decision-making. LiDAR provides accurate 3D geometry and distance information, while cameras capture rich texture and semantic details. However, many existing fusion frameworks are complex and limited in handling heterogeneous features. To address this, we propose a novel query-based and BEV-based multimodal fusion architecture inspired by BEVFormer.

Experiments on the nuScenes dataset show that our method achieves 72.2% NDS and 70.3% mAP for 3D object detection, outperforming baseline BEVFormer (51.7% NDS, 41.6% mAP), and lidar-only methods PointPillars (61.3% NDS, 52.3% mAP), and multi-modality methods FUTR3D (68.3% NDS, 64.5% mAP). For BEV map segmentation, our model reaches 64.8% mIoU, surpassing PointPainting (49.1%) and MVP (49.0%). These results highlight the effectiveness and potential of the proposed framework for robust autonomous driving perception.

PROPOSED METHOD

As shown in Fig. 1, The framework consists of two key components: a Fusion Module, which employs a query-based attention mechanism to integrate LiDAR BEV queries with image features for enhanced scene understanding; and a BEV Enhanced Module, which transforms multi-stage image features into BEV representations and connects them with LiDAR BEV for stronger spatial–texture representation.

I.Fusion Module

Fusion Module consisted of self-attention, LiDAR cross-attention and Image Cross-attention.

A.Self-attention

To capture correlations within single-modality features, we adopt a self-attention mechanism. However, global self-attention may allow distant and irrelevant features to influence each other, leading to feature ambiguity. To mitigate this issue, we introduce a window partition strategy inspired by UniTR, as illustrated in Fig. 2. Specifically, each modality is divided into windows of size 30×30 tokens, and within each window, every set of 90 tokens is processed using self-attention. This design effectively models local dependencies while reducing unnecessary interactions across distant regions.

Fig 2. Self-attention

Fig3. LiDAR Cross-attention

Fig4. Image Cross-attention

B.LiDAR Cross-attention

To model the correlation between neighboring grids in the LiDAR BEV space, we introduce a LiDAR Cross-Attention mechanism. As illustrated in Fig. 3, this module is composed of a sliding-window grouping strategy and attention operation. Specifically, the 1-D LiDAR BEV tokens are divided into groups, where each group contains three tokens (L_K) serving as keys and values. The queries are the LiDAR BEV tokens themselves, as shown in Fig. 1. We then apply a cross-attention mechanism between each single query (Q_q) and its corresponding three key–value tokens(L_K ), enabling local spatial interactions within the LiDAR BEV representation.

C.Image Cross-attention

In the Fusion Module (Fig. 3.1), we design an Image Cross-Attention mechanism to fuse LiDAR BEV queries (Q_q) with image features (I_total). Its purpose is to compensate for LiDAR’s lack of texture and semantic details, thereby enhancing semantic alignment and complementarity across modalities.
As shown in Fig. 4, Image Cross-Attention consists of a projection matrix and a window partition strategy. The projection matrix maps each LiDAR BEV query onto the image plane, where attention is performed with the corresponding image features. After establishing these correspondences, the image plane is divided into 30×30 windows, and together with the projected LiDAR queries, attention mechanism is applied to every set of 90 multimodal tokens (M ̃_Set). This design effectively bridges modality gaps and strengthens feature integration.

II.BEV Enhanced Module

To further enrich the semantic and geometric representation of fused features in BEV space, we design a lightweight yet effective BEV Enhanced Module following the Fusion Module (Fig5.).
As illustrated in Fig1., multi-stage image features are progressively extracted through self-attention and cross-attention during the Fusion Module. At each fusion stage(N=1, 2, 3), the enhanced image features are collected and concatenated, forming multi-stage inputs for the Image BEV Generation module. Using the Lift-Splat-Shoot (LSS) view-transform, to generated Image BEV map, then concatenated with the Updated LiDAR BEV.

Fig5. BEV Enhanced Module

Experiment

I.Experimental Setup

In this section, we conduct experiments on 3D object detection and BEV map segmentation tasks using nuScenes dataset, which is challenging large-scale outdoor benchmark that provides diverse annotations for various tasks, (e.g., 3D object detection and BEV map segmentation). It contains 40,157 annotated samples, each with 6 monocular camera images covering a 360-degree FoV and a 32-beam LiDAR.

II.Experimental Results

As shown in Table 1., our method achieves 72.2% NDS and 70.3% mAP, outperforming baseline methods such as BEVFormer (NDS 51.7%, mAP 41.6%) and LiDAR-only PointPillars (NDS 61.3%, mAP 52.3%), as well as strong multi-modal methods like FUTR3D (NDS 68.3%, mAP 64.5%). In BEV Map segmentation, as shown in Table2., our methods achieves 64.8% mIoU, outperforming single modality methods, also strong multi-modal methods like Pointpainting[4] 49.1% mIoU, MVP[5] 49.0% mIoU. These results demonstrate the effectiveness and potential of our approach for real-world autonomous driving applications.

Table1. Experimental Comparison on 3D object detection

Table2. Experimental Comparison on BEV Map Segmentation

Page updated

Google Sites

Report abuse