End-to-end autonomous driving systems must reliably interpret complex urban environments using both visual and geometric information. However, conventional models often struggle to fuse camera and LiDAR data effectively, leading to failures in dense traffic, occlusions, or low-light conditions.
Transfuser addresses this limitation by introducing a Transformer-based multi-sensor fusion architecture that jointly processes RGB images and LiDAR point cloud features. The model aligns spatial cues from both modalities through cross-attention, enabling richer scene understanding and more accurate waypoint prediction.
In this study, Transfuser is trained and evaluated on the CARLA autonomous driving benchmark. Experimental results show that its attention-based fusion significantly improves performance in challenging scenarios—especially those involving dynamic agents, complex intersections, or partial sensor degradation. Compared to single-modality models, Transfuser demonstrates higher route completion rates, better collision avoidance, and more stable lateral control.
These findings highlight that Transformer-driven multi-modal fusion is a powerful approach for building reliable, real-world autonomous driving systems. Transfuser’s ability to integrate complementary sensor information makes it a strong foundation for secure, robust, and scalable autonomous driving pipelines.
End-to-end autonomous driving models have achieved promising performance by directly mapping raw sensor inputs to driving actions. However, many existing approaches remain limited in interpretability and robustness, as their fusion mechanisms often obscure how different sensor modalities and intermediate representations influence decision-making.
InterFuser is proposed to address this challenge by introducing an intermediate-aware sensor fusion architecture for end-to-end autonomous driving. Instead of performing fusion only at a single latent level, InterFuser explicitly integrates multi-modal features across multiple semantic stages, enabling more structured interaction between perception, reasoning, and control.
The architecture leverages Transformer-based fusion blocks to combine RGB camera inputs and LiDAR-derived features while preserving modality-specific information at intermediate layers. Through staged cross-attention and hierarchical feature alignment, InterFuser facilitates clearer information flow and improves the model’s ability to reason under ambiguous or partially degraded sensor conditions.
In this study, InterFuser is implemented and evaluated in the CARLA autonomous driving benchmark. Qualitative and quantitative results indicate that the model exhibits more stable driving behavior, improved handling of complex intersections, and increased resilience to occlusions and dynamic traffic compared to conventional single-stage fusion approaches. Notably, the intermediate fusion design also enables deeper analysis of internal representations, offering insights into how different sensor cues contribute to final control decisions.
These results suggest that intermediate-level multi-modal fusion is a promising direction for building autonomous driving systems that are not only robust but also more interpretable. InterFuser provides a flexible research framework for exploring explainable end-to-end driving and serves as a strong foundation for future work on secure, reliable, and transparent autonomous driving systems.
Scene understanding in autonomous driving requires modeling not only individual objects but also the relationships between them. Conventional perception pipelines often treat objects independently, limiting the ability to reason about contextual interactions such as proximity, occlusion, and traffic dynamics. Graph Neural Networks (GNNs) provide a natural framework for structured reasoning by representing traffic scenes as graphs, where nodes correspond to objects and edges encode spatial or semantic relationships.
In this work, we propose a scene graph-based object classification framework built on Graph Neural Networks. Scene graphs are constructed from CARLA simulation data, where each frame is represented as a structured graph composed of node features, edge attributes, object labels, and ego vehicle state. A baseline Graph Convolutional Network (GCN) model is first implemented to perform multi-task learning, predicting both object type (vehicle, pedestrian, traffic_light, traffic_sign) and object visibility. This establishes a structured alternative to traditional independent object classification.
To further capture temporal dynamics in evolving traffic environments, we extend the baseline to a dynamic spatio-temporal GNN architecture. Consecutive scene graphs are organized into temporal sequences, enabling the model to incorporate historical context from previous frames. Dynamic node embeddings are formed by fusing prior hidden representations with current node features through gated mechanisms. Spatio-temporal correlation is enhanced via cross-attention modules that selectively extract relevant information from past frames. Additionally, the adjacency structure is updated dynamically at each timestep, allowing the model to reflect changing interactions among traffic participants.
The proposed framework is evaluated in the CARLA autonomous driving environment. Experimental results demonstrate that dynamic graph modeling improves classification stability and robustness under occlusions, dense traffic, and partially degraded observations. The multi-task visibility prediction further enhances the model’s awareness of perceptual uncertainty. Compared to static graph baselines, the dynamic GNN exhibits improved accuracy and stronger generalization across complex driving scenarios.
These findings highlight the effectiveness of graph-based structured reasoning for autonomous scene understanding. By explicitly modeling object relationships and temporal evolution, GNN-based scene graph learning provides a promising direction for building more context-aware, robust, and interpretable perception systems in autonomous driving research.
Autonomous driving models are often evaluated independently, making it difficult to directly compare how architectural differences influence real-time driving behavior. To address this, we developed a unified evaluation interface that runs seven architecturally heterogeneous autonomous driving models simultaneously within the same simulation environment. Each model differs in perception backbones, fusion strategies, and planning representations, yet all operate under identical simulation conditions with standardized input streams and synchronized scenarios.
The system provides real-time visualization of steering, throttle, braking, and safety-related control signals across models. By aligning outputs temporally, the framework enables direct behavioral comparison rather than relying solely on aggregate performance metrics. This makes it possible to observe how different architectural choices affect acceleration patterns, braking responses, lane stability, and obstacle handling in complex environments.
Designed as a structured behavioral analysis platform, the interface supports scenario-based failure analysis and safety evaluation under controlled conditions. It is extensible to adversarial perturbation experiments, constraint-based assessments, and supervisory modules such as rule-based or LLM-driven safety monitors. By benchmarking heterogeneous models within a consistent and reproducible setup, this system provides a scalable foundation for comparative analysis and safety validation in autonomous driving research.