Yan-Hao Chen and Jui-Chiu Chiang
Fig. 1 Overall of the proposed architecture
With the rapid advancement of autonomous driving, perception has become a key factor in ensuring driving safety and reliable decision-making. LiDAR provides accurate 3D geometry and distance information, while cameras capture rich texture and semantic details. However, many existing fusion frameworks are complex and limited in handling heterogeneous features. To address this, we propose a novel query-based and BEV-based multimodal fusion architecture inspired by BEVFormer.
Experiments on the nuScenes dataset show that our method achieves 72.2% NDS and 70.3% mAP for 3D object detection, outperforming baseline BEVFormer (51.7% NDS, 41.6% mAP), and lidar-only methods PointPillars (61.3% NDS, 52.3% mAP), and multi-modality methods FUTR3D (68.3% NDS, 64.5% mAP). For BEV map segmentation, our model reaches 64.8% mIoU, surpassing PointPainting (49.1%) and MVP (49.0%). These results highlight the effectiveness and potential of the proposed framework for robust autonomous driving perception.
As shown in Fig. 1, The framework consists of two key components: a Fusion Module, which employs a query-based attention mechanism to integrate LiDAR BEV queries with image features for enhanced scene understanding; and a BEV Enhanced Module, which transforms multi-stage image features into BEV representations and connects them with LiDAR BEV for stronger spatial–texture representation.
Fusion Module consisted of self-attention, LiDAR cross-attention and Image Cross-attention.
A.Self-attention
To capture correlations within single-modality features, we adopt a self-attention mechanism. However, global self-attention may allow distant and irrelevant features to influence each other, leading to feature ambiguity. To mitigate this issue, we introduce a window partition strategy inspired by UniTR, as illustrated in Fig. 2. Specifically, each modality is divided into windows of size 30×30 tokens, and within each window, every set of 90 tokens is processed using self-attention. This design effectively models local dependencies while reducing unnecessary interactions across distant regions.
Fig 2. Self-attention
Fig3. LiDAR Cross-attention
Fig4. Image Cross-attention
To model the correlation between neighboring grids in the LiDAR BEV space, we introduce a LiDAR Cross-Attention mechanism. As illustrated in Fig. 3, this module is composed of a sliding-window grouping strategy and attention operation. Specifically, the 1-D LiDAR BEV tokens are divided into groups, where each group contains three tokens (L_K) serving as keys and values. The queries are the LiDAR BEV tokens themselves, as shown in Fig. 1. We then apply a cross-attention mechanism between each single query (Q_q) and its corresponding three key–value tokens(L_K ), enabling local spatial interactions within the LiDAR BEV representation.
In the Fusion Module (Fig. 3.1), we design an Image Cross-Attention mechanism to fuse LiDAR BEV queries (Q_q) with image features (I_total). Its purpose is to compensate for LiDAR’s lack of texture and semantic details, thereby enhancing semantic alignment and complementarity across modalities.
As shown in Fig. 4, Image Cross-Attention consists of a projection matrix and a window partition strategy. The projection matrix maps each LiDAR BEV query onto the image plane, where attention is performed with the corresponding image features. After establishing these correspondences, the image plane is divided into 30×30 windows, and together with the projected LiDAR queries, attention mechanism is applied to every set of 90 multimodal tokens (M ̃_Set). This design effectively bridges modality gaps and strengthens feature integration.
To further enrich the semantic and geometric representation of fused features in BEV space, we design a lightweight yet effective BEV Enhanced Module following the Fusion Module (Fig5.).
As illustrated in Fig1., multi-stage image features are progressively extracted through self-attention and cross-attention during the Fusion Module. At each fusion stage(N=1, 2, 3), the enhanced image features are collected and concatenated, forming multi-stage inputs for the Image BEV Generation module. Using the Lift-Splat-Shoot (LSS) view-transform, to generated Image BEV map, then concatenated with the Updated LiDAR BEV.
Fig5. BEV Enhanced Module
In this section, we conduct experiments on 3D object detection and BEV map segmentation tasks using nuScenes dataset, which is challenging large-scale outdoor benchmark that provides diverse annotations for various tasks, (e.g., 3D object detection and BEV map segmentation). It contains 40,157 annotated samples, each with 6 monocular camera images covering a 360-degree FoV and a 32-beam LiDAR.
As shown in Table 1., our method achieves 72.2% NDS and 70.3% mAP, outperforming baseline methods such as BEVFormer (NDS 51.7%, mAP 41.6%) and LiDAR-only PointPillars (NDS 61.3%, mAP 52.3%), as well as strong multi-modal methods like FUTR3D (NDS 68.3%, mAP 64.5%). In BEV Map segmentation, as shown in Table2., our methods achieves 64.8% mIoU, outperforming single modality methods, also strong multi-modal methods like Pointpainting[4] 49.1% mIoU, MVP[5] 49.0% mIoU. These results demonstrate the effectiveness and potential of our approach for real-world autonomous driving applications.
Table1. Experimental Comparison on 3D object detection
Table2. Experimental Comparison on BEV Map Segmentation