Accurate 3D object detection is essential for autonomous driving and advanced perception systems, yet achieving reliable performance remains challenging when relying solely on camera data. This study enhances the GaussianLSS 3D object detection framework by incorporating LiDAR–camera fusion to improve Bird’s-EyeView (BEV) perception. While GaussianLSS models per-pixel depth as Gaussian distributions for efficient and uncertainty-aware BEV generation, its reliance on camera-only inputs limits geometric accuracy in long-range and occluded scenes. To address this limitation, we introduce a point cloud branch that extracts point cloud features and fuses them with multi-view image features before Gaussian depth modeling. This design allows the network to exploit both the semantic richness of images and the geometric precision of LiDAR. Experimental results show that the proposed fusion method achieves detection performance comparable to the original GaussianLSS baseline on the public benchmark dataset for autonomous driving, nuScenes. Qualitative results further demonstrate that the fused model can successfully detect challenging objects that the baseline occasionally misses, confirming the effectiveness of the proposed integration.
Figure 1. Overview of the proposed architecture.
The proposed architecture extends GaussianLSS by introducing an early-fusion strategy that combines LiDAR and multi-view camera features before Gaussian depth estimation. This design preserves the efficiency and uncertainty-aware properties of GaussianLSS, while addressing its limitations in long-range and occluded scenarios.
Image Branch
Multi-view images are processed by a ResNet-50 backbone to extract multi-scale semantic features. This branch provides rich contextual information about objects, textures, and scene layouts, but suffers from depth ambiguity when relying solely on cameras.
LiDAR Branch
Raw point clouds are voxelized into vertical pillars using the PointPillars backbone, generating pseudo-images that capture dense geometric structures. These pseudo-images are further refined through a series of convolutional blocks, progressively aligning their spatial resolution and semantic representation with the image features.
Fusion & Gaussian Depth Modeling
LiDAR and image features are concatenated to form a unified multi-modal tensor. This fused representation is passed through the GaussianLSS module, which models per-pixel depth as Gaussian distributions and applies Gaussian splatting to produce uncertainty-aware BEV features. Finally, the BEV features are forwarded to the 3D object detection head, which predicts oriented 3D bounding boxes and class scores.
Key Advantage
By integrating LiDAR geometry early in the pipeline, the model benefits from both the semantic richness of images and the geometric accuracy of LiDAR. This significantly improves robustness in long-range perception, heavily occluded areas, and complex driving environments, while retaining the computational efficiency of GaussianLSS.
Experiments are conducted on the nuScenes dataset (1,000 driving scenes with synchronized camera and LiDAR data, using both full and mini splits), and performance is evaluated with standard nuScenes metrics.The proposed method achieves results comparable to the baseline GaussianLSS, with minor variations across classes, but shows clear qualitative advantages in detecting challenging objects. Ablation studies further confirm the benefit of progressive LiDAR feature refinement, with a 512-channel configuration yielding the best trade-off between accuracy and stability.
Table 1. Overall comparison on nuScenes validation set. Bold indicates best performance, ↑ indicates higher is better, ↓ indicates lower is better
Figure 2. Qualitative detection comparisons between the baseline and our method
In this work, we integrated LiDAR-derived pseudo images into the GaussianLSS framework to enhance multi-modal 3D object detection on the nuScenes dataset. The proposed method achieved detection performance that is overall comparable to the baseline, with only slight variations in mAP and NDS despite the limited training schedule of 50 epochs. Qualitative comparisons further demonstrate that our approach can successfully detect certain challenging objects (e.g., bicycles and buses) that the baseline occasionally misses, indicating the potential benefits of incorporating LiDAR features.