Codes are available at https://anonymous.4open.science/r/Causal-Fuzzer-2464
Simulation-based testing is a critical way to evaluate the safety of Autonomous Driving Systems (ADSs). To comprehensively evaluate ADSs, we need to test them across diverse scenarios that can induce different types of violations arising from various conditions. Existing methods primarily focus on individual diversity metrics, such as the diversity of input scenarios, ADS-generated motion commands, and system violations, or on linear combinations of these metrics, often overlooking the complex interrelationships among them. For example, the same motion command in different scenes can cause different collision risks, and the same collision can be caused by different motion commands under different scenarios. Therefore, they will ignore some distinct critical scenarios, resulting in missing issues in the ADS under test. However, quantifying the interrelationships among different diversity metrics is challenging. In this paper, we propose a novel causality-aware fuzzing technique, Causal-Fuzzer, to enable efficient and comprehensive testing of ADSs by exploring diverse interrelationships among different diversity metrics. The core of Causal-Fuzzer is constructing a causal graph to model the interrelationships among the diversity of input scenarios, ADS motion commands, and system violations. Then the causal graph will guide the process of critical scenario generation. Specifically, we propose (1) a causality-based feedback method that quantifies the combined diversity of a test scenario by assessing whether it activates new causal relationships, and (2) a causality-driven mutation strategy that prioritizes mutations on input scenario elements with a higher causal impact on ego action changes and the occurrence of violations, rather than treating all elements equally. We evaluated Causal-Fuzzer on an industry-grade ADS Apollo, with a high-fidelity simulator LGSVL. Empirical evaluation results show that Causal-Fuzzer effectively identifies a greater diversity of violations (98.4 on average, compared to 42.4 for the best baseline method) while providing enhanced testing sufficiency, with improved coverage of causal relationships (12.9 on average, compared to 6.8) and greater efficiency in detecting the first critical scenarios (32.1 scenarios on average, compared to 71.5 for the best baseline method), surpassing existing methods. Our source code and experimental results are available at https://sites.google.com/view/causal-fuzzer.
A collision caused by an NPC in the right rear.
A collision with an NPC in the front-right, caused by the ego vehicle.