On the Robustness of Safe Reinforcement Learning under Observational Perturbations

 Abstract

Safe reinforcement learning (RL) trains a policy to maximize the task reward while satisfying constraints. While prior works focus on the performance optimality, we find that the optimal solutions to many safe RL problems are not robust and safe against carefully designed attackers. We formally analyze the unique properties of designing effective adversarial attackers in the safe RL setting. We show that baseline adversarial attack techniques for standard RL tasks are not always effective for safe RL and propose two new approaches: one that maximizes the cost and the other maximizes the reward. One interesting and counter-intuitive finding is that the maximum reward attack is a strong one, which both induces unsafe behaviors and makes the attack stealthy by maintaining the reward. We further propose a much safer adversarial training algorithm and evaluate it via comprehensive experiments. This work sheds light on the inherited connection between robustness and safety in RL and provides a pioneering work for future safe RL study.

Method Overview


Adversarial attackers


Given an optimal policy of a tempting safe RL problem, we aim to design strong adversaries to effectively make the agent unsafe and keep reward stealthiness. We propose two strong adversaries: Maximum Reward (MR) critic attacker and Maximum Cost (MC) critic attacker. We also propose an improved  Maximum Action Difference (MAD) attacker which we name it Adaptive MAD (AMAD) attacker. The AMAD attacker can perform similarly or better than the MAD attacker and is used as one of the baselines. 


Adversarial safe RL training algorithm


To defend against observational perturbations, we propose an adversarial safe RL training method. Similar to adversarial training in the supervised learning literature, we directly optimize the policy upon the attacked sampling trajectories. The meta adversarial training algorithm is shown on the right. We particularly adopt the primal-dual methods that are widely used in the safe RL literature as the learner. We adopt MC or MR as the adversary when sampling trajectories and the scheduler function aims to train the reward and cost Q-value functions for the MR and the MC attackers.

Experiment

Environments. We evaluate our adversarial safe RL approach in the Bullet-Safety-Gym environment as shown below. We consider two tasks (Run and Circle), and train multiple different robots (Car, Drone, Ant) for each task. 

Circle task

Run task

Car

Drone

Ant

Adversarial attackers' performance comparison -  video demo

The following animations show the vanilla PPO-Lagrangian agent's performance under different adversarial attacks in the Car-Circle task, where the agent should move along a circle while staying within the safety boundaries (two yellow planes). The video caption corresponds to the attacker's name, and "Natural" represents no attacks.

We can see that the agent performs well without adversarial attacks (Natural video) or random noise, but it violates the safety constraints (cross the safety boundaries) under the MC, MR, and AMAD adversarial attacks, which indicates that the vanilla PPOL algorithm is vulnerable to adversarial attacks.

Though the MAD attacker can make the agent performs badly in terms of task reward, it doesn't induce any unsafe behaviors, as the agent can still stay in the safe region. On the contrary, the AMAD attacker that only performs attacks in high-risk regions (close to the safety boundary) can induce more constraint violations.

Natural

Random 

MAD

AMAD

MR

MC

The following figure shows the performance of all the attacker baselines (Random, MAD, AMAD) and our MC and MR adversaries by attacking well-trained PPO-Lagrangian policies in different tasks. The trained policies can achieve nearly zero constraint violation costs without observational perturbations. We keep the trained model weights and environment seeds fixed for all the attackers to ensure fair comparisons. The results are shown below. From the figure, we can see that our proposed MC and MR attackers outperform all baseline attackers (Random, MAD, and AMAD) in terms of effectiveness by increasing the cost with a large margin in most tasks. In addition, the MR attacker is stealthy as it maintains the reward very well.

Reward and cost curves of all 5 attackers evaluated on well-trained vanilla PPO-Lagrangian models w.r.t the perturbation range ε. The curves are averaged over 5 random seeds and 50 episodes, where the solid lines are the mean and the shadowed areas are the standard deviation. The dashed line is the cost of the policy without perturbations: ε=0.

Adversarial trained PPO-Lagrangian agent under adversarial attacks -  video demo

After adversarial training, the agent is robust against various adversarial attackers. The agent can stay in the safe region with much fewer constraint violations. The videos below show the performance of our ADV-PPOL(MC) agent tested in Car-Circle task. It is apparent to see that the proposed adversarial-trained agent is more robust than the vanilla PPO-Lagrangian agent.

ADV-PPOL(MC)

Random 

MAD

AMAD

MR 

MC

We use the PID PPO-Lagrangian (abbreviated as PPOL) algorithm as the base safe RL learner, while the proposed adversarial training can be used in other safe RL methods as well. We adopt 5 baselines, including the PPOL-Vanilla method without robust training, the naive adversarial training under random noise PPOL-random, and the state-adversarial algorithm SA-PPOL, but we extend their PPO in the standard RL setting to PPOL in the safe RL setting. The original SA-PPOL algorithm utilizes the MAD attacker to compute the adversarial states, and then adds a KL regularizer to penalize the divergence between them and the original states. We add two additional baselines SA-PPOL(MC) and SA-PPOL(MR) for the ablation study, where we change the MAD attacker to our proposed MC and MR adversaries. 

Our adversarial training methods are named as ADV-PPOL(MC) and ADV-PPOL(MR), which are trained under the MC and MR attackers respectively.

The complete results are shown in the table below. We can observe that although most algorithms can achieve near-zero natural costs, the baseline approaches are vulnerable to strong MC and MR attacks. We can see that the proposed adversarial training methods (ADV-PPOL) consistently outperform baselines in safety with the lowest costs while maintaining high rewards.

Evaluation results of natural performance (no attack) and under all 5 attackers. Our methods are ADV-PPOL(MC/MR). Each value is reported as: mean ± standard deviation for 50 episodes and 5 seeds. We shadow the two lowest-costs agents under each attacker column and break ties based on rewards, excluding the failing agents (whose natural rewards are less than 30% of PPOL-vanilla’s). We mark the failing agents with.