Decision-making is crucial for embodied AI systems like intelligent vehicles and robots, enabling effective planning and action in physical environments. Recently, deep learning has gained attention for decision-making tasks, in addition to its widespread use in perception and prediction. However, challenges remain in handling open-world corner cases and achieving generalized performance.
Large language models (LLMs), with strong generalization and reasoning abilities from massive data, offer promising solutions. Recent works have built LLM-based systems that generate decisions and behavior plans for physical agents from textual descriptions. These systems must accurately respond to environmental inputs and goals. Yet, generic LLMs often struggle with complex, domain-specific tasks. Fine-tuning and retrieval-augmented generation (RAG) are used to adapt LLMs efficiently, avoiding costly training from scratch and enabling domain-specific reasoning.
Nonetheless, LLMs are vulnerable to attacks like jailbreaking and in-context backdoors, posing physical risks in embodied applications. Existing research doesn’t fully address threats from fine-tuning, RAG, and real-world grounding, which create new attack surfaces.
We propose BALD, a comprehensive Backdoor Attack framework against LLM-based Decision-making in embodied AI. BALD explores: (1) Word injection – embedding triggers in prompts, (2) Scenario manipulation – altering physical environments to trigger malicious behavior, and (3) Knowledge injection – placing triggers in RAG databases. Tested on GPT-3.5, LLaMA-2, PaLM, and platforms like HighwayEnv, nuScenes, and VirtualHome, BALD causes behaviors like incorrect lane changes, crashes, and placing a knife on a bed. Our attacks reach near 100% success for word and knowledge injection with minimal performance loss and over 65% success in scenario manipulation.
Overview of our proposed BALD (Backdoor Attacks against LLM-enabled Decision-making systems) framework. We propose three distinct attack mechanisms: word injection, scenario manipulation, and knowledge injection, with each targeting different stages of the representative abstraction of the LLM-based decision-making system pipeline.
We treat LLMs as a one-step optimizer to select the trigger words. The proposed strategies can: (1) maximize the attack effectiveness; (2) minimize the performance degradation on benign inputs; and (3) minimize the influence on benign LLMs
We propose a novel scenario-based trigger mechanism, utilizing a high-level distinct semantic scenario or environment as the trigger. We propose strategies to maximize the attack effectiveness while minimizing the rate the trigger is falsely triggered by other scenarios
We propose a novel dual trigger for retrieval and attack.
Retrieval: The correct data with trigger words is retrieved when similar scenarios are encountered
Attack: The trigger words activate the attack
Motivation: Fine-tuning can largely increase the performance of LLMs on specific embodied tasks.
Existing Approach: Attacks on ICL are much less effective in the fine-tuned embodied agent.
Effectiveness: BALD-word and BALD-RAG can achieve nearly 100% attack success rate, while BALD-scene is slightly less effective. All have shown severe safety consequences.
Mitigation: Inference stage defense can hardly defend against our attacks.
[ICLR'25] Can We Trust Embodied Agents? Exploring Backdoor Attacks against Embodied LLM-Based Decision-Making Systems
Ruochen Jiao*, Shaoyuan Xie* (co-first author), Justin Yue, Takami Sato, Lixu Wang, Yixuan Wang, Qi Alfred Chen, Qi Zhu
The Thirteenth International Conference on Learning Representations (ICLR) 2025. Acceptance rate 32.08% = 3689/11500
BibTex for citation:
@inproceedings{
jiao2025can,
title={Can We Trust Embodied Agents? Exploring Backdoor Attacks against Embodied {LLM}-Based Decision-Making Systems},
author={Ruochen Jiao and Shaoyuan Xie and Justin Yue and TAKAMI SATO and Lixu Wang and Yixuan Wang and Qi Alfred Chen and Qi Zhu},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025}
}
Ruochen Jiao, Ph.D. student, Northwestern University
Shaoyuan Xie, Ph.D. student, University of California, Irvine
Justin Yue, Undergraduate student, University of California, Irvine
Takami Sato, Ph.D. student, University of California, Irvine
Lixu Wang, Ph.D. student, Northwestern University
Yixuan Wang , Ph.D. student, Northwestern University
Qi Alfred Chen, Assistant Professor, University of California, Irvine
Qi Zhu, Professor, Northwestern University
This research was supported by
NSF under grants CNS-2145493;
USDOT under Grant 69A3552348327 for the CARMEN+ University Transportation Center.