Recovery RL: Safe Reinforcement Learning with Learned Recovery Zones

Brijen Thananjeyan*, Ashwin Balakrishna*, Suraj Nair, Michael Luo, Krishnan Srinivasan, Minho Hwang,

Joseph E. Gonzalez, Julian Ibarz, Chelsea Finn, Ken Goldberg

Abstract:

Safety remains a central obstacle preventing widespread use of RL in the real world: learning new tasks in uncertain environments requires extensive exploration, but safety requires limiting exploration. We propose Recovery RL, an algorithm which navigates this tradeoff by (1) leveraging offline data to learn about constraint violating zones before policy learning and (2) separating the goals of improving task performance and constraint satisfaction across two policies: a task policy that only optimizes the task reward and a recovery policy that guides the agent to safety when constraint violation is likely. We evaluate Recovery RL on 6 simulation domains, including two contact-rich manipulation tasks and an image-based navigation task, and an image-based reaching task on a physical robot. We compare Recovery RL to 5 prior safe RL methods which jointly optimize for task performance and safety via constrained optimization or reward shaping and find that Recovery RL outperforms the next best prior method across all domains. Results suggest that Recovery RL trades off constraint violations and task successes 2 - 80 times more efficiently in simulation domains and 12 times more efficiently in physical experiments.

Recovery RL:

The key insights in Recovery RL are to (1) efficiently leverage offline data of constraint violations to learn about constraints before interacting with the environment and (2) separate the goals of task performance and safety across two policies: a task policy, which solely optimizes the unconstrained objective, and a recovery policy, which defines a new MDP for the task policy in which exploration is probabilistically safe under the task constraints. These choices enable safer learning online, since a human can often provide controlled examples of constraint violations before the robot interacts with the environment, and a better balance of task performance and constraint satisfaction since separately optimizing for these two objectives helps mitigate the suboptimality and instability that can arise from the objective conflict between encouraging good task performance and safe learning.

To learn the task and recovery policies, we learn a safety critic, which estimates the probability of constraint violation in the near future under the current policy. When this probability is sufficiently low, the task policy is executed, while if it is too high, the recovery policy is executed to protect the task policy from constraint violations. The recovery policy is also an RL agent, which is trained by learning a policy which minimizes the probability of constraint violation as measured by the learned safety critic.

Simulation Experiments:

We evaluate Recovery RL on a set of 2D navigation tasks, two contact rich manipulation environments, and a visual navigation task as illustrated below.

Since Recovery RL and prior methods trade off between safety and task progress, we report the ratio of the cumulative number of task successes and the cumulative number of constraint violations at each episode to illustrate this (higher is better). We tune all algorithms to maximize this ratio, and task success is determined by defining a goal set in the state space for each environment. To avoid issues with division by zero, we add 1 to the cumulative task successes and constraint violations when computing this ratio. This metric provides a single scalar value to quantify how efficiently different algorithms balance task completion and constraint satisfaction.

We compare Recovery RL with 6 prior algorithms: Unconstrained, which ignores constraints and optimizes only the unconstrained objective, LR, RSPO, and SQRL, which integrate constraints into the policy optimization objective using variations of the Lagrangian relaxation technique, and RP and RCPO which leverage ideas from reward shaping to modify the task reward function to incorporate constraints. We compare these methods against Recovery RL with a model-free and model-based recovery policy as shown below and find that Recovery RL is able to significantly outperform prior methods in its ability to tradeoff between task performance and constraint satisfaction.

Recovery RL is able to successfully prevent constraint violations on a variety of tasks in simulation, including a challenging object extraction task (left) and a Maze navigation task (right). The Maze navigation task is particularly illustrative as it shows how the recovery mechanism interacts with the task policy as learning progresses. Here, the agent is randomly initialized in the left column and is provided a reward function which measures its negative L2 distance to the middle of the right column. Early in training, the task policy does not yet know about obstacles in the environment, and thus repeatedly attempts to collide with the walls to achieve the most direct path to the goal. However, the recovery policy prevents this, leading the task policy to bounce back and forth (early training). Eventually, the task policy learns to reach the goal but still requires the recovery mechanism to prevent collisions occasionally (mid training). Finally, the task policy is able to efficiently navigate to the goal with minimal activations of the recovery policy once it has learned sufficiently about the structure of the maze (late training). In the shelf extraction task, we see that Recovery RL learns to carefully nudge the red block out of the way before grasping it so it can avoid toppling the yellow blocks.

Early Training

Mid Training

Late Training

Physical Experiments:

We evaluate Recovery RL on an 2 image-based tasks on the da Vinci Research Kit (dVRK). The dVRK is cable-driven and has relatively imprecise controls, motivating closed-loop control strategies to compensate for these errors . We first evaluate Recovery RL on an image based obstacle avoidance task illustrated below, where contacts are detected by using changes in the motor currents of the dVRK. We observe similar results, and find that even in a contact rich image-based manipulation tasks, Recovery RL is able to effectively learn safe policies and can more effectively trade off task successes and constraint violations than baselines algorithms. Notice that as in the Maze navigation task above, early in training Recovery RL is prevented from colliding into obstacles by the recovery policy (early training), and then eventually learns to navigate around the obstacles (late training).

Early Training

Late Training

We also evaluate Recovery RL on an image-based precision reaching task, where the robot must guide its end effector to a goal position without crossing an invisible stay-out zone in the workspace. Again, Recovery RL learns to perform this task much faster than baselines with fewer constraint violations.