Characterizing the Robustness of Black-Box LLM Planners Under Perturbed Observations with Adaptive Stress Testing

Neeloy Chakraborty, John Pohovey, Melkior Ornik, and Katherine Driggs-Campbell

University of Illinois, Urbana-Champaign

Accepted in ACL Findings 2026

Abstract

Large language models (LLMs) have recently demonstrated success in decision-making tasks including planning, control, and prediction, but their tendency to hallucinate unsafe and undesired outputs poses risks. This unwanted behavior is further exacerbated in environments where sensors are noisy or unreliable. Characterizing the behavior of LLM planners to varied observations is necessary to proactively avoid failures in safety-critical scenarios. We specifically investigate the response of LLMs along two different perturbation dimensions. Like prior works, one dimension generates semantically similar prompts with varied phrasing by randomizing order of details, modifying access to few-shot examples, etc. Unique to our work, the second dimension simulates access to varied sensors and noise to mimic raw sensor or detection algorithm failures. An initial case study in which perturbations are manually applied show that both dimensions lead LLMs to hallucinate in a multi-agent driving environment. However, manually covering the entire perturbation space for several scenarios is infeasible. As such, we propose a novel method for efficiently searching the space of prompt perturbations using adaptive stress testing (AST) with Monte-Carlo tree search (MCTS). Our AST formulation enables discovery of scenarios, sensor configurations, and prompt phrasing that cause language models to act with high uncertainty or even crash. By generating MCTS prompt perturbation trees across diverse scenarios, we show through extensive experiments that offline analyses can be used to proactively understand potential failures that may arise at runtime.

Manual Perturbation Case Study

We begin with a case study analyzing how language models act in a driving environment under different scenarios and varied prompt and observation settings. We find that models have unique traits at runtime, making it infeasible to use proxy models to characterize the behaviors of others. Furthermore, prompt and observation perturbations drastically impact the action inconsistency rates of models.

Automatic Characterization with Adaptive Stress Testing

We tackle the problem of automatically characterizing the robustness of models under perturbations with adaptive stress testing using Monte-Carlo tree search. More capable models like Qwen are found to hallucinate less frequently than models like Llama and Dolphin. Furthermore, our framework enables engineers to define a custom undesirability function to seek out risky behaviors during characterization. We define undesirability functions to find perturbed scenarios that lead to uncertain decision-making or critical crashes with other vehicles.

Example Characterizations

We characterized models in driving, moon landing, and robot crowd navigation tasks with our method.

Recommended Citation

Page updated

Report abuse