Gradient Shaping for Multi-Constraint

Safe Reinforcement Learning

L4DC 2024

Carnegie Mellon University, Google DeepMind

Abstract

Online safe reinforcement learning (RL) involves training a policy that maximizes task efficiency while satisfying constraints via interacting with the environments. In this paper, our focus lies in addressing the complex challenges associated with solving multi-constraint (MC) safe RL problems. We approach the Safe RL problem from the perspective of Multi-Objective Optimization (MOO) and propose a unified framework designed for MC safe RL algorithms. This framework highlights the manipulation of gradients derived from constraints. Leveraging insights from this framework and recognizing the significance of redundant and conflicting constraint conditions, we introduce the Gradient Shaping (GradS) method for general Lagrangian-based Safe RL algorithms to improve the training efficiency in terms of both reward and constraint satisfaction. Our extensive experiments demonstrate the effectiveness of our proposed method in encouraging exploration and learning a policy that improves both safety and reward performance across various challenging MC safe RL tasks as well as good scalability to the constraint dimension.

Video

MCSafeRL.mp4

Outline:

[Intro] 0:00

[Method] 0:44

[Experiment] 3:57

[Conclusion] 5:50

Method Overview

Preliminary: Online Safe RL

The pipeline for online safe RL

The pipeline for online safe RL is shown in the figure above. The RL agent interacts with the world to collect the data and transmit it to the replay buffer. During the agent learning process, safe RL algorithms conduct the critics' update, Lagrangian multiplier update, and policy update subsequently. Our proposed framework and method lie in the policy update module, which contains the loss calculation, gradient shaping, and actor update. The gradient shaping step is the key component in this work.

A Unified Framework for Lagrangian-based MC Safe RL

The proposed unified framework for MC Safe RL

We introduce a unified framework for Lagrangian-based MC safe RL algorithms from the perspective of Multi-Objective Optimization (MOO). The detailed formulation can be seen in the paper. As shown in the figure above, the major difference among Lagrangian-based MC safe RL methods is the strategy dealing with gradients induced by constraints:

(1) The vanilla method considers all the constraint gradients;

(2) The CRPO method randomly selects one constraint for policy update;

(3) The Min-Max method selects the gradient from the cost that violates the most.

Constraint Types in MC Safe RL

Constraint types in MC Safe RL. Left: a toy example in autonomous driving to illustrate the constraint types. Right: The illustration of redundant, conflicting and independent area.

We define the constraint types based on the gradient similarity. We illustrate the idea with autonomous driving settings, which consider three constraints: (a) lane-keeping constraint; (b) collision constraint; and (c) low-velocity constraint. Then the defined constraint types are:

(1) Redundant constraints: if the constraint gradients make the policy update in a similar direction. For example, the lane-keeping constraint and the collision constraint are redundant.

(2) Conflicting constraints: if the constraint gradients make the policy update in a conflicting direction. For example, the lane-keeping constraint and the low-speed constraint are conflicting.

(3) Independent constraints: if the constraint gradients make the policy update in independent directions.

The visualization of constraint gradients is shown in the right figure. For the selected gradient (marked as black), the redundant area and conflicting area are shown in blue and red cones. As shown in the next section, the idea of the proposed GradS is to eliminate redundant and conflicting gradients and keep gradients in the "independent" area.

Proposed GradS Method

Proposed GradS method and the comparison with baseline methods

The objective of our approach is to address the challenges posed by redundant and conflicting constraints, aiming to eliminate over-conservativeness resulting from redundant constraints and escape local optima to resolve conflicting constraints.

As shown in the figure above, the GradS steps are as follows: (1) it shuffles the constraint gradients, initializing the candidate gradient set with the first gradient. (2) It then selects gradients sequentially. If a newly chosen gradient is neither redundant nor conflicting with any other gradient in the candidate set, it is added to the set. (3) After the selection process, the constraint candidate set is obtained. Then GradS samples a gradient from this candidate set as the constraint gradient. The process is described in the Algorithm below. The convergence analysis is shown in the paper.

Experiment

Task visualization

In the experiment, we consider the trajectory-wise safety problem, which is different from the state-wise safety: we expect the cost along one trajectory to be less than or equal to a non-zero cost limit. We select four model-free safe RL algorithms: PPO-Lag, TRPO-Lag, SAC-Lag, and DDPG-Lag as the base safe RL algorithms, and the Vanilla, CRPO, and Min-Max as the baseline methods.

The simulation environments are sourced from public benchmark Bullet-Safety-Gym and Safety-Gymnasium. We consider two tasks (Circle and Goal) and four robots as shown in the figure above. To better simulate real-world scenarios, we introduce three representative costs: Boundary/collision cost: agents incur a cost if they cross the boundary or collide with the obstacles. High-velocity cost: agents receive a cost if they exceed the upper-velocity limit. Low-velocity cost: agents receive a cost if their speed falls below the minimum threshold. All costs are binary. A detailed explanation of the costs is provided in the appendix. Intuitively, boundary/collision cost and high-velocity cost are likely redundant constraints since high speed might result in crossing the boundary or collision. High-velocity cost and low-velocity cost are likely conflicting constraints as their activation potentially tends to pull the policy in conflicting optimization directions.

The full experiment is shown in the paper. Here we provide a snapshot of the results. The following experiment is the final performance of PPO-based safe RL algorithms in the Car-Circle-v3 task (v3 means we consider all the types of constraints we mentioned above).

The x-axis in each figure means the constraint number in the tasks. The first two figures show the reward and normalized costs, while the remaining three show the representative cost returns. The bar charts represent the mean value and the error bars represent the standard deviation. All plots are averaged among 5 random seeds and 10 trajectories for each seed.

We can observe that:

(1) MC tasks address over-conservativeness, exploration difficulties, and imbalance issues for safe RL: Take settings at cost dimension 4 as an example, vanilla and Min-Max methods face exploration issues induced by conflicting constraints and over-conservativeness induced by redundant constraints, thus they can not attain a reasonable reward. CRPO method explores well with a high reward thanks to the stochastical gradient selection process, while it can not ensure safety caused by the imbalance constraint issues. Specifically, if one type of redundant constraints significantly outweighs others in terms of quantity, the CRPO method would disproportionately activate constraints from this type, potentially overlooking other constraints.

(2) The proposed GradS method shows its outstanding performance in the MC tasks: it overcomes the issue of over-conservativeness akin to those observed in the Vanilla method by eliminating redundant gradients. Furthermore, the elimination of redundant constraints reduces the risk of neglecting minor constraints, a drawback of the CRPO. In scenarios with conflicting constraints, GradS excels in performance with high reward and low cost violation compared to the baseline algorithm as it eliminates the conflicting constraints and enables stochastic constraint gradients to encourage exploration. In addition, GradS shows great scalability with the number of constraints since the performance is almost invariant with the cost dimension from 4 to 16.

Conclusion

In this paper, we analyze the MC Safe RL problem through the lens of constraint types, identifying two challenging MC Safe RL settings: redundant and conflicting constraints. To address these challenges, we propose the constraint gradient shaping (GradS) technique from the standpoint of Multi-Objective Optimization (MOO), ensuring compatibility with general Lagrangian-based Safe RL algorithms. Our analysis highlights the necessity of developing efficient and effective algorithms for handling multiple costs, shedding light on the critical importance of addressing multi-cost constraints in safe RL settings. The extensive experimental results reconfirm that GradS effectively solves the MC Safe RL problems in both redundant and conflicting constraint settings, and is safer, and more rewarding than baseline methods. By proposing the GradS technique and providing a comprehensive analysis, we hope to contribute to the advancement of safe RL algorithms and their successful implementation in real-world complex and safety-critical environments.