Accelerating Safe Reinforcement Learning with Constraint-mismatched Policies

Anonymous Authors

Motivation

TL;DR: We propose a new algorithm that learns constraint-satisfying policies with the supervision of the sub-optimal baseline policy, and provide theoretical analysis and empirical demonstration in the context of reinforcement learning with constraints.

Keywords: Safe reinforcement learning, Reinforcement learning with constraints, Imitation learning with constraints

(Code is provided in the bottom of the page)

Consider a situation where you are going to learn how to drive under the supervision of driving instructors. One instructor drives aggressively since he/she was a race car driver before. The other instructor drives conservatively since he/she is a school bus driver before. Learning from the aggressive instructor leads to an unsafe driving maneuver. On the other hand, learning from the conservative instructor results in an inefficient driving policy. The natural question to ask is how we can learn from them efficiently without compromising the constraints (e.g., safety)?

Figure 1: The learning scenario considered in the paper. Race car drivers drive aggressively and have a higher risk of being unsafe, whereas bus drivers drive conservatively and have a lower risk of being unsafe. How can we safely learn from them based on our own cost constraint (i.e., safety)?

In this paper, we consider the problem of reinforcement learning when provided with a baseline control policy and a set of constraints that the controlled system must satisfy. The baseline policy might arise from a heuristic, a prior application, a teacher or demonstrator data. The constraints might encode safety, fairness or some application-specific requirements. We want to efficiently use reinforcement learning to adapt the baseline policy to improve performance and satisfy the given constraints when it is applied to the new system. The key challenge is to effectively use the baseline policy (which need not satisfy the current constraints) to aid the learning of a constraint-satisfying policy in the new application. We propose an iterative algorithm for solving this problem. Each iteration is composed of three-steps. The first step performs a policy update to increase the expected reward, the second step performs a projection to minimize the distance between the current policy and the baseline policy, and the last step performs a projection onto the set of policies that satisfy the constraints. This procedure allows the learning process to leverage the baseline policy to achieve faster learning while improving reward performance and satisfying the constraints imposed on the current problem. We analyze the convergence of the proposed algorithm and provide a finite-sample guarantee. Empirical results demonstrate that the algorithm can achieve superior performance, with 10 times fewer constraint violations and around 40% higher reward compared to state-of-the-art methods.

SPACE Update

SPACE is composed of three steps (Fig. 2):

Step 1 (green) improves the reward in the trust region.

Step 2 (blue) projects the policy onto a region around πB controlled by hkD.

Step 3 (red) projects the policy onto the cost constraint set.

The next question is how we can determine the region around πB, i.e., the distance between the learned and baseline policies.

Figure 2: Update procedure for SPACE.

Updating hD

We select h0D to be small and gradually increase hkD at each iteration to expand the region around πB . Specifically, we make hk+1 > hk if

(a) JCk) > JCk−1): this increase is to ensure a nonempty intersection between the region around πB and the cost constraint set (feasibility). See Fig. 3(a).

(b) JRk) < JRk−1): this increase gives the next policy more freedom to improve the reward and the cost constraint performance (exploration). See Fig. 3(b).

Figure 3(a): Illustrating when πB is outside the cost constraint set. The highest reward is achieved at the yellow star.

Figure 3(b): Illustrating when πB is inside the cost constraint set. The highest reward is achieved at the yellow star.

Finite Sample Guarantees

We further analyze SPACE in terms of convergence rate. We show that under the KL divergence projection or 2-norm projection, both projections take quadratic time to converge to a first order stationary point. The convergence rate depends on the spectrum of the Fisher information matrix of the policy, allowing us to characterize the difference between these two projections.

Experiment Results

We provide a detailed examination of SPACE compared to baselines. These include:

  • evaluation of SPACE on real human demonstration data (This is illustrated in Figure 4),

  • evaluation of the value of reward versus the cumulative cost constraint value to demonstrate that SPACE achieves more reward with fewer cost constraint violations,

  • comparison of a safe baseline policy and an aggressive baseline policy to demonstrate that SPACE safely learn from the baseline policy with different cost constraints,

  • ablation test of using a fixed hkD in SPACE to demonstrate that it is necessary to use the proposed update scheme,

  • comparison of SPACE and other annealing approaches to demonstrate that SPACE exploits the baseline policy effectively,

  • comparison of SPACE with the KL divergence and the 2-norm projections to characterize the behavior of these two projections, and

  • evaluation of using different initial value of h0D to demonstrate that the selection of the initial value does not affect the performance of SPACE drastically.

Figure 4. We test the proposed algorithm on car-racing task where we get a baseline human policy by using human demonstration data. The demonstration data are obtained by asking a human to play the game.