Reinforcement Learning in a Safety-Embedded MDP with Trajectory Optimization

Abstract

Safe Reinforcement Learning (RL) focuses on the problem of training a policy to maximize the reward while ensuring safety. It is an important step towards applying RL to safety-critical real-world applications. However, safe RL is challenging due to the trade-off between the two objectives of maximizing the reward and satisfying the safety constraints. In this work, we propose to learn a policy in a modified MDP in which the safety constraints are embedded into the action space. In this "safety-embedded MDP," the output of the RL agent is transformed into a sequence of actions using a trajectory optimizer that is guaranteed to be safe, under assumption that the robot is currently in a safe and quasi-static configuration. We evaluate our method in the Safety Gym benchmark and show that we achieve significantly higher rewards and fewer safety violations during training than previous work; further, we have no safety violations during inference. We also evaluate our method on a real robot box-pushing task and demonstrate that our method can be safely deployed in the real world.

Videos of real-robot experiments

In this experiment, the robot needs to push the black box toward the goal (large green circle). The robot itself needs to avoid the hazards (small red circles). The blue cylinder is the pillar that the robot can interact with. We compare our method with TRPO Lagrangian method in this section.

Our method

TRPO Lagrangian

Videos of simulation experiments

In this experiment, the robot needs to push the yellow object toward the goal (green circle). The robot itself needs to avoid the hazards (blue circles). The blue cylinder is the pillar ad the robot can interact with. We compare our method with TRPO Lagrangian method in this section.

Our method: PointPush1

TRPO Lagrangian: PointPush1

Our method: CarPush1

TRPO Lagrangian: CarPush1