RAT: Adversarial Attacks on Deep Reinforcement Agents for Targeted Behaviors
RAT: Adversarial Attacks on Deep Reinforcement Agents for Targeted Behaviors
Abstract
Evaluating deep reinforcement learning (DRL) agents against targeted behavior attacks is critical for assessing their robustness. These attacks aim to manipulate the victim into specific behaviors that align with the attacker’s objectives, often bypassing traditional reward-based defenses. Prior methods have primarily focused on reducing cumulative rewards; however, rewards are typically too generic to capture complex safety requirements effectively. As a result, focusing solely on reward reduction can lead to suboptimal attack strategies, particularly in safety-critical scenarios where more precise behavior manipulation is needed. To address these challenges, we propose RAT, a method designed for universal, targeted behavior attacks. RAT trains an intention policy that is explicitly aligned with human preferences, serving as a precise behavioral target for the adversary. Concurrently, an adversary manipulates the victim's policy to follow this target behavior. To enhance the effectiveness of these attacks, RAT dynamically adjusts the state occupancy measure within the replay buffer, allowing for more controlled and effective behavior manipulation. Our empirical results on robotic simulation tasks demonstrate that RAT outperforms existing adversarial attack algorithms in inducing specific behaviors. Additionally, RAT shows promise in improving agent robustness, leading to more resilient policies. We further validate RAT by guiding Decision Transformer agents to adopt behaviors aligned with human preferences in various MuJoCo tasks, demonstrating its effectiveness across diverse tasks.
Overview
We propose a universal targeted behavior attack method against DRL agents, designed to effectively induce specific behaviors in a victim agent.
Experiments
Targeted Behavior Attacks
Our empirical results show the strong behavior-inducing capability of RAT over several tasks.
We choose several online models (Cheetah, Walker) as victims, which are well-trained by official DT implementation with D4RL. Under the behavior-oriented adversarial manipulation, the Cheetah agent lifts its leg and performs a 90-degree push-up, as depicted in Figures (a) and (b), while the Walker agent demonstrates superior balance by standing on one foot and exhibits a one-legged dance, illustrated in Figures (c) and (d).
(a) Cheetah Lift leg
(b) 90 Degree Push-up
(c) Walker Dance
(d) Stand on One Foot
We also train the victim models on Meta-world tasks using the SAC algorithm, with 1 million time steps; these agents acquire skills to lock the door, open the drawer, open the window, and turn on the faucet, respectively. Instead of inducing the robot arm away from the manipulated object, RAT's attack precisely prompts it to perform the behaviors that humans desire.
Unlock the Door
Close the Drawer
Close the Window
Close the Faucet
Improving Robustness
A practical application of RAT is in assessing the robustness of established models or in enhancing an agent's robustness via adversarial training. Our methods, RAT-ATLT and RAT-WocaR, significantly enhance the robustness of agents in the Meta-world environment.
This table presents the average episode rewards ± standard deviation for robust agents under various attack methods, with results averaged across 100 episodes.
Additional Experiments
Contribution of Each Component
We perform additional experiments to explore the impact of each component in RAT across various tasks. This table reports the average success rate on four robotic manipulation tasks from Meta-world, averaged over five runs.
The results in the table illustrate that:
1. The intention policy, serving as a flexible behavior target, plays a crucial role in RAT.
2. The weighting function can further enhance the asymptotic performance of RAT in the bi-level optimization.
3. The combined policy (behavior policy) boosts performance by mitigating the state-action distribution shift.
Impact of Feedback Amount and Different Attack Budgets
We investigate the impact of the quantity of preferences and the attack budget on performance.
It shows RAT's performance improves with an increase in the number of preference labels, underscoring the pivotal influence of feedback quantity.
Experimental results indicate that the performance of all methods enhances with an escalation in the attack budget, which represents the adversary's attack power.
Quality of learned reward functions
We further analyze the quality of the reward functions learned by RAT in comparison to the true reward function. In Figure, we display four time-series plots illustrating the normalized learned reward (blue) and the ground truth reward (red). The results suggest that the learned reward function aligns effectively with the true reward function, which is derived from human feedback.
Drawer Close
Drawer Open
Faucet Close
Faucet Open