SafeDreamer: Safe Reinforcement Learning with World Models

Abstract

The deployment of Reinforcement Learning (RL) in real-world applications is constrained by its failure to satisfy safety criteria. Existing Safe Reinforcement Learning (SafeRL) methods, which rely on cost functions to enforce safety, often fail to achieve zero-cost performance in complex scenarios, especially vision-only tasks. These limitations are primarily due to model inaccuracies and inadequate sample efficiency. The integration of world models has proven effective in mitigating these shortcomings. In this work, we introduce SafeDreamer, a novel algorithm incorporating Lagrangian-based methods into world model planning processes within the superior Dreamer framework. Our method achieves nearly zero-cost performance on various tasks, spanning low-dimensional and vision-only input, within the Safety-Gymnasium benchmark, showcasing its efficacy in balancing performance and safety in RL tasks. 

Architecture

The Architecture of SafeDreamer. (a) illustrates all components of SafeDreamer, which distinguishes costs as safety indicators from rewards and balances them using the Lagrangian method and a safe planner. The OSRP (b) and OSRP-Lag (c) variants execute online safety-reward planning (OSRP) within the world models for action generation.

OSRP-Lag integrates online planning with the Lagrangian approach to balance long-term rewards and costs. The BSRP-Lag variant of SafeDreamer (d) employs background safety-reward planning (BSRP) via the Lagrangian method within the world models to update a safe actor. 

Online & Background Safety-Reward Planning


Demo

SafetyPointGoal1 

Unsafe DreamerV3

SafetyPointGoal1

SafeDreamer

SafetyPointGoal1

Model Input

SafetyPointGoal1

Model Prediction

SafetyPointGoal2 

Unsafe DreamerV3

SafetyPointGoal1 

SafeDreamer

SafetyPointGoal2

Model Input

SafetyPointGoal2 

Model Prediction

SafetyPointButton1

Unsafe DreamerV3

SafetyPointButton1

SafeDreamer

SafetyPointButton1

Model Input

SafetyPointButton1

Model Prediction

SafetyPointButto2

Unsafe DreamerV3

SafetyPointButton2

SafeDreamer

SafetyPointButton2

Model Input

SafetyPointButton2 

Model Prediction

SafetyPointPush1

Unsafe DreamerV3

SafetyPointPush

SafeDreamer

SafetyPointPush1

Model Input

SafetyPointPush1 

Model Prediction

SafetyPointPush2

Unsafe DreamerV3

SafetyPointPush2

SafeDreamer

SafetyPointPush2

Model Input

SafetyPointPush2

 Model Prediction

SafetyRacecarGoal1 - God Perspective

SafeDreamer

SafetyRacecarGoal1 - Model Input

SafeDreamer

SafetyRacecarGoal1 - Model Prediction

SafeDreamer

SafetyRacecarGoal2 - God Perspective

SafeDreamer

SafetyRacecarGoal2 - Model Input

SafeDreamer

SafetyRacecarGoal2 - Model Prediction

SafeDreamer

SafetyRacecarButton1 - God Perspective

SafeDreamer

SafetyRacecarButton1 - Model Input

SafeDreamer

SafetyRacecarButton1 - Model Prediction

SafeDreamer

SafetyRacecarButton2 - God Perspective

SafeDreamer

SafetyRacecarButton2 - Model Input

SafeDreamer

SafetyRacecarButton2 - Model Prediction

SafeDreamer

SafetyRacecarPush1 - God Perspective

SafeDreamer

SafetyRacecarPush1 - Model Input

SafeDreamer

SafetyRacecarPush1 - Model Prediction

SafeDreamer

SafetyRacecarPush2 - God Perspective

SafeDreamer

SafetyRacecarPush2 - Model Input

SafeDreamer

SafetyRacecarPush2 - Model Prediction

SafeDreamer

MetaDrive

SafeDreamer(BSRP-Lag)

Unsafe Baseline

FormulaOne

SafeDreamer(BSRP-Lag) God perspective

SafeDreamer(BSRP-Lag) First-person perspective

Unsafe Baseline God perspective

Unsafe Baseline First-person perspective

Car-Racing

SafeDreamer(BSRP-Lag)

Unsafe Baseline

Assessment in Unseen Testing Environments

Video Prediction

Using the past 25 frames as context, our world models predict the next 45 steps in Safety-Gymnasium based solely on the given action sequence, without intermediate image access. 

The video predictions in the tasks of the Point agent. In SafetyPointGoal1, the model leverages observed goals to forecast subsequent ones in future frames. In SafetyPointGoal2, the oncoming rightward navigational movement of the robot to avoid an obstacle is predicted by the model. In the SafetyPointButton1, the model predicts the robot’s direction toward the green goal. For SafetyPointButton2, the model anticipates the robot’s trajectory, bypassing the yellow sphere on its left. In the SafetyPointPush1, the model foresees the robot’s intention to utilize its head to mobilize the box. Finally, in SafetyPointPush2, the model discerns the emergence of hitherto unseen crates in future frames, indicating the model’s prediction ability of environmental transition dynamics.

The video predictions of the Racecar agent. In SafetyRacecargoal1, the world model anticipates the adjustment of agent direction towards a circular obstacle. Similarly, within the SafetyRacecargoal2, the model predicts the Racecar’s incremental deviation from a vase. In SafetyRacecarButton1, the world model predicts the Racecar’s nuanced navigation to avoid a right-side obstacle. In SafetyRacecarButton2, the model predicts the Racecar’s incremental distance toward a circular obstacle. In SafetyRacecarPush1 and SafetyRacecarPush2 tasks, the model predicts the emergence of the box and predicts the Racecar’s direction towards a box, respectively.

Experiment

Cost limit=25 in PoinGoal1 (Low-dimensional)

Ablation studies on the weight of the cost model loss in SafetyPointGoal1

We run SafeDreamer (BSRP-Lag) for 1M steps on SafetyPointGoal1. We find that applying different weights to the unsafe interactions in the cost model’s loss has varying effects on the cost’s convergence. A higher weight might aid in the cost’s reduction. We hypothesize that this effect is due to the unbalanced distribution of cost in the environment. Different weights can mitigate this imbalance, thereby accelerating the convergence of the cost model.

Ablation studies on the weight of the cost model loss in SafetyPointGoal2

We run SafeDreamer (BSRP-Lag) for 2M steps on SafetyPointGoal2. The experimental results were similar to those on SafetyPointGoal1, but we suggest fine-tuning this hyperparameter based on the cost distribution in different environments.

1. Swift Convergence to Nearly Zero-cost

SafeDreamer surpasses model-free algorithms regarding both rewards and costs. Although model-free algorithms can decrease costs over time, they struggle to achieve higher rewards. This challenge stems from their reliance on learning a policy purely through trial-and-error, devoid of world model assistance, which hampers optimal solution discovery with limited data samples. In Safety-Gymnasium tasks, the agent begins in a safe region at episode reset without encountering obstacles. A feasible solution apparent to humans is to keep the agent stationary, preserving its position. However, even with this simplistic policy, realizing zero-cost with model-free algorithms either demands substantial updates or remains elusive in some tasks.

2. Dual Objective Realization: Balancing Enhanced Reward with Minimized Cost

SafeDreamer uniquely attains minimal costs while achieving higher rewards in the five visual-only safety tasks. In contrast, model-based algorithms such as LAMBDA and Safe SLAC attain a cost limit beyond which further reductions are untenable due to the inaccuracies of the world models. On the other hand, in environments with denser or more dynamic obstacles, such as SafetyPointGoal2 and SafetyPointButton1, MPC struggles to ensure safety due to the absence of a cost critic within a limited online planning horizon. Integrating a world model with critics enables agents to effectively utilize information on current and historical states to ensure their safety. From the beginning of training, our algorithms demonstrate safety behavior, ensuring extensive safe exploration. Specifically, in the SafetyPointGoal1 and SafetyPointPush1 environments, SafeDreamer matches the performance of DreamerV3 in reward while preserving nearly zero-cost.

3. Mastering Diverse Domains: Dominance in Visual and Low-dimensional Tasks

We conducted evaluations within two low-dimensional sensor input environments. The reward of MBPPO-Lag ceases to increase when the cost begins to decrease, similar to the phenomenon observed in PPO-Lag. Distinctly, our algorithm optimizes rewards while concurrently achieving a substantial cost reduction.