Codes are available at https://anonymous.4open.science/r/Causal-Fuzzer-2464
Simulation-based testing is essential for evaluating the safety of Autonomous Driving Systems (ADSs). Comprehensive evaluation requires testing across diverse scenarios that can trigger various types of violations under different conditions. While existing methods typically focus on individual diversity metrics, such as input scenarios, ADS-generated motion commands, and system violations, they often fail to capture the complex interrelationships among these elements. For instance, identical motion commands can produce different collision risks in varying scenes, and the same collision may result from different commands under different scenarios. This oversight leads to gaps in testing coverage, potentially missing critical issues in the ADS under evaluation. In this paper, we propose Causal-Fuzzer, the first causality-aware fuzzing technique that enables efficient and comprehensive testing of ADSs by constructing causal graphs to model the interrelationships among scenarios, actions, and violations. Unlike existing methods that treat diversity metrics independently, we recognize these elements are causally interconnected and use their relationships to identify more diverse violations triggered by fundamentally different causal mechanisms. Specifically, Causal-Fuzzer proposes (1) a causality-based feedback mechanism that quantifies the combined diversity of test scenarios by assessing whether they activate new causal relationships, and (2) a causality-driven mutation strategy that prioritizes mutations on input scenario elements with higher causal impact on ego action changes and violation occurrence to enable interpretable and efficient test generation. We evaluated Causal-Fuzzer on an industry-grade ADS Apollo, with a high-fidelity simulator LGSVL. Our empirical results demonstrate that Causal-Fuzzer significantly outperforms existing methods in (1) identifying a greater diversity of violations (96.5 violations on average, compared to 66.9 for the best baseline method), (2) providing enhanced testing sufficiency with improved coverage of causal relationships (13.6 unique scene-action-violation patterns on average, compared to 8.6 for the best baseline method), and (3) achieving greater efficiency in detecting critical scenarios, strong robustness under noise conditions, and good generalizability across varying scenario complexities and violation types.
A collision caused by an NPC in the right rear.
A collision with an NPC in the front-right, caused by the ego vehicle.