Efficient Exploration Using Extra Safety Budget in Constrained Policy Optimization

Abstract

Reinforcement learning (RL) has achieved promising results on most robotic control tasks. Safety of learning-based controllers is an essential notion of ensuring the effectiveness of the controllers. Current methods adopt whole consistency constraints during the training, thus resulting in inefficient exploration in the early stage. 

In this paper, we propose a Constrained Policy Optimization with Extra Safety Budget (ESB-CPO) algorithm to strike a balance between the exploration and the constraints. In the early stage, our method looses the practical constraints of unsafe transitions (adding extra safety budget) with the aid of a new metric we propose. With the training process, the constraints in our optimization problem become tighter. Meanwhile, theoretical analysis and practical experiments demonstrate that our method meets the cost limit's demand gradually in the final stage of training. Remarkably, our method gains remarkable performance improvement under the same cost limit compared with CPO algorithm.


Motivation 

Intuitive example showing the impact of efficient exploration in the early stage. The red region represents the obstacles. When the robot concerns the safety constraints a lot at the initial stage, it maybe find a sub-optimal trajectory in the task. Instead, when the robot can ignore the constraints for unsafe states in the beginning, it can find a direct path to finish the task. Afterward, it can meet the demand of avoiding collisions gradually so that the optimal trajectory can be obtained finally.

Method 

ESB-CPO algorithm constructs a constrained optimizaiton problem based on trust region method. Different from CPO algorithm, we propose a new metric, namely Lyapunov-based Advantage Estimation (LAE) which consists of stability and safety values. It can magnify the gap between safe and unsafe transitions by safety value part.  When we view the safety value part as a extra safety budget, our method can loose the constraints of unsafe transitions in the early stage. Meanwhile, our method can maintain the safety constraints gradually because the theoretical bound is very close to the bound in CPO algorithm.

Experiments 

Environments

Doggo-Goal

Point-Push

Ball-Reach

Drone-Circle

Comparisons

Ablation studies