Towards Robust and Safe Reinforcement Learning with Benign Off-policy Data

Abstract

Previous work demonstrates that the optimal safe reinforcement learning (SRL) policy in a noise-free environment is vulnerable and could be unsafe under observational attacks. While adversarial training effectively improves robustness and safety, collecting samples by attacking the behavior agent online could be expensive or prohibitively dangerous in many applications. We propose the robuSt vAriational ofF-policy lEaRning (SAFER) approach, which only requires benign training data without attacking the agent. SAFER obtains an optimal non-parametric variational policy distribution via convex optimization and then uses it to improve the parametrized policy robustly via supervised learning. The two-stage policy optimization facilitates robust training, and extensive experiments on multiple robot platforms show the efficiency of SAFER in learning a robust and safe policy: achieving the same reward with much fewer constraint violations during training than on-policy baselines.

Method Overview

An overview of SAFER is shown in the left figure below. It consists of a constrained E-step and a robust M-step. The E-step step aims to find the optimal variational distribution that maximizes the reward return while satisfying the safety constraint. This step can be written as a constrained optimization problem, which has a closed-form solution and can be efficiently solved by convex optimization. The robust M-step has two components: a vanilla M-step that aims to fit the variational distribution obtained from the E-step using a parametrized policy, such as a neural network, and then achieve generalization beyond the state-action samples used for training; and an adversarial training step that aims to improve the policy robustness by optimizing toward the worst-case perturbations. The pseudo-code of SAFER is shown below.

Overview of SAFER

Pseudo-code of SAFER

Experiments

Experiment Setting

We consider two tasks (Run and Circle) and four robots (Ball, Car, Drone, and Ant). For the Run task, the agents are rewarded for running fast between two boundaries and are given a constraint violation cost if they run across the boundaries or exceed an agent-specific velocity threshold. For the Circle task, the agents are rewarded for running in a circle but are constrained within a safe region that is smaller than the radius of the target circle. We name the tasks as Ball-Circle, Car-Circle, Drone-Run, and Ant-Run.

Circle task

Run task

Ball

Car

Drone

Ant

Baselines

On-policy Baselines. We use the adversarial training algorithm ADV-PPOL as the major on-policy baseline, which collects corrupted trajectories by the Maximum-Cost (MC) attacker to train the base safe RL agent—PPOL. We also use the Maximum-Reward (MR) attacker proposed in ADV-PPOL. We use a robust training algorithm that is effective in standard RL SA-PPOL as another baseline, which utilizes the Maximum Action Difference (MAD) attacker. We also extend it by changing the MAD attacker to a stronger MC attacker in the safe RL setting, which yields the SA-PPOL(MC) baseline.

Off-policy Baselines. Since SAFER is closely related to the EM-based safe RL algorithm CVPO, we use it as a basic baseline and name it CVPO-vanilla. We adopt its variant, CVPO-random, which is trained under random noise as another baseline. We directly apply the same online adversarial training techniques with the MC attacker in ADV-PPOL to the off-policy setting, which yields the ADV-CVPO baseline. We consider another intuitive and simple adversarial training method by attacking the sampled data from the replay buffer, which we name the ADV-EM-CVPO baseline. The data-flow diagrams are shown in the figures below.

ADV-PPOL ADV-CVPO ADV-EM-CVPO

Final Performance of SAFER - video demos

The following animations show the performance of some baselines and SAFER under MC and MR attackers in the Car-Circle environment. The first row shows the performance of CVPO-vanilla. We can observe that the vanilla EM-based safe RL method CVPO-vanilla is vulnerable to adversarial attacks, although it can attain near zero natural cost in a noise-free environment. The second row shows the performance of ADV-CVPO. The successful on-policy adversarial training techniques in ADV-PPOL do not work in the off-policy setting, as the ADV-CVPO method is not safe under adversarial attackers and even performs poorly in noise-free environments. However, it is clear that the SAFER agnets are robust and safe against adversarial attacks. More detailed results are in the table below, and more videos are available here.

Sample Efficiency During Learning

The figure below demonstrates the efficacy of utilizing each cost, i.e., how many task rewards the agent can obtain given a budget of constraint violations. SAFER is more sample efficient during learning and outperforms on-policy baselines with a large margin among all tasks: it uses 4 ~ 20 times fewer cumulative constraint violations to achieve the same task reward.