SafeDreamer: Safe Reinforcement Learning with World Models
Abstract
The deployment of Reinforcement Learning (RL) in real-world applications is constrained by its failure to satisfy safety criteria. Existing Safe Reinforcement Learning (SafeRL) methods, which rely on cost functions to enforce safety, often fail to achieve zero-cost performance in complex scenarios, especially vision-only tasks. These limitations are primarily due to model inaccuracies and inadequate sample efficiency. The integration of world models has proven effective in mitigating these shortcomings. In this work, we introduce SafeDreamer, a novel algorithm incorporating Lagrangian-based methods into world model planning processes within the superior Dreamer framework. Our method achieves nearly zero-cost performance on various tasks, spanning low-dimensional and vision-only input, within the Safety-Gymnasium benchmark, showcasing its efficacy in balancing performance and safety in RL tasks.
Architecture
The Architecture of SafeDreamer. (a) illustrates all components of SafeDreamer, which distinguishes costs as safety indicators from rewards and balances them using the Lagrangian method and a safe planner. The OSRP (b) and OSRP-Lag (c) variants execute online safety-reward planning (OSRP) within the world models for action generation.
OSRP-Lag integrates online planning with the Lagrangian approach to balance long-term rewards and costs. The BSRP-Lag variant of SafeDreamer (d) employs background safety-reward planning (BSRP) via the Lagrangian method within the world models to update a safe actor.
Online & Background Safety-Reward Planning
Demo
SafetyPointGoal1
Unsafe DreamerV3
SafetyPointGoal1
SafeDreamer
SafetyPointGoal1
Model Input
SafetyPointGoal1
Model Prediction
SafetyPointGoal2
Unsafe DreamerV3
SafetyPointGoal1
SafeDreamer
SafetyPointGoal2
Model Input
SafetyPointGoal2
Model Prediction
SafetyPointButton1
Unsafe DreamerV3
SafetyPointButton1
SafeDreamer
SafetyPointButton1
Model Input
SafetyPointButton1
Model Prediction
SafetyPointButto2
Unsafe DreamerV3
SafetyPointButton2
SafeDreamer
SafetyPointButton2
Model Input
SafetyPointButton2
Model Prediction
SafetyPointPush1
Unsafe DreamerV3
SafetyPointPush
SafeDreamer
SafetyPointPush1
Model Input
SafetyPointPush1
Model Prediction
SafetyPointPush2
Unsafe DreamerV3
SafetyPointPush2
SafeDreamer
SafetyPointPush2
Model Input
SafetyPointPush2
Model Prediction
SafetyRacecarGoal1 - God Perspective
SafeDreamer
SafetyRacecarGoal1 - Model Input
SafeDreamer
SafetyRacecarGoal1 - Model Prediction
SafeDreamer
SafetyRacecarGoal2 - God Perspective
SafeDreamer
SafetyRacecarGoal2 - Model Input
SafeDreamer
SafetyRacecarGoal2 - Model Prediction
SafeDreamer
SafetyRacecarButton1 - God Perspective
SafeDreamer
SafetyRacecarButton1 - Model Input
SafeDreamer
SafetyRacecarButton1 - Model Prediction
SafeDreamer
SafetyRacecarButton2 - God Perspective
SafeDreamer
SafetyRacecarButton2 - Model Input
SafeDreamer
SafetyRacecarButton2 - Model Prediction
SafeDreamer
SafetyRacecarPush1 - God Perspective
SafeDreamer
SafetyRacecarPush1 - Model Input
SafeDreamer
SafetyRacecarPush1 - Model Prediction
SafeDreamer
SafetyRacecarPush2 - God Perspective
SafeDreamer
SafetyRacecarPush2 - Model Input
SafeDreamer
SafetyRacecarPush2 - Model Prediction
SafeDreamer
MetaDrive
SafeDreamer(BSRP-Lag)
Unsafe Baseline
FormulaOne
SafeDreamer(BSRP-Lag) God perspective
SafeDreamer(BSRP-Lag) First-person perspective
Unsafe Baseline God perspective
Unsafe Baseline First-person perspective
Car-Racing
SafeDreamer(BSRP-Lag)
Unsafe Baseline
Assessment in Unseen Testing Environments
Video Prediction
Using the past 25 frames as context, our world models predict the next 45 steps in Safety-Gymnasium based solely on the given action sequence, without intermediate image access.
The video predictions in the tasks of the Point agent. In SafetyPointGoal1, the model leverages observed goals to forecast subsequent ones in future frames. In SafetyPointGoal2, the oncoming rightward navigational movement of the robot to avoid an obstacle is predicted by the model. In the SafetyPointButton1, the model predicts the robot’s direction toward the green goal. For SafetyPointButton2, the model anticipates the robot’s trajectory, bypassing the yellow sphere on its left. In the SafetyPointPush1, the model foresees the robot’s intention to utilize its head to mobilize the box. Finally, in SafetyPointPush2, the model discerns the emergence of hitherto unseen crates in future frames, indicating the model’s prediction ability of environmental transition dynamics.
The video predictions of the Racecar agent. In SafetyRacecargoal1, the world model anticipates the adjustment of agent direction towards a circular obstacle. Similarly, within the SafetyRacecargoal2, the model predicts the Racecar’s incremental deviation from a vase. In SafetyRacecarButton1, the world model predicts the Racecar’s nuanced navigation to avoid a right-side obstacle. In SafetyRacecarButton2, the model predicts the Racecar’s incremental distance toward a circular obstacle. In SafetyRacecarPush1 and SafetyRacecarPush2 tasks, the model predicts the emergence of the box and predicts the Racecar’s direction towards a box, respectively.
Experiment
Cost limit=25 in PoinGoal1 (Low-dimensional)
Ablation studies on the weight of the cost model loss in SafetyPointGoal1
We run SafeDreamer (BSRP-Lag) for 1M steps on SafetyPointGoal1. We find that applying different weights to the unsafe interactions in the cost model’s loss has varying effects on the cost’s convergence. A higher weight might aid in the cost’s reduction. We hypothesize that this effect is due to the unbalanced distribution of cost in the environment. Different weights can mitigate this imbalance, thereby accelerating the convergence of the cost model.
Ablation studies on the weight of the cost model loss in SafetyPointGoal2
We run SafeDreamer (BSRP-Lag) for 2M steps on SafetyPointGoal2. The experimental results were similar to those on SafetyPointGoal1, but we suggest fine-tuning this hyperparameter based on the cost distribution in different environments.
1. Swift Convergence to Nearly Zero-cost
SafeDreamer surpasses model-free algorithms regarding both rewards and costs. Although model-free algorithms can decrease costs over time, they struggle to achieve higher rewards. This challenge stems from their reliance on learning a policy purely through trial-and-error, devoid of world model assistance, which hampers optimal solution discovery with limited data samples. In Safety-Gymnasium tasks, the agent begins in a safe region at episode reset without encountering obstacles. A feasible solution apparent to humans is to keep the agent stationary, preserving its position. However, even with this simplistic policy, realizing zero-cost with model-free algorithms either demands substantial updates or remains elusive in some tasks.
2. Dual Objective Realization: Balancing Enhanced Reward with Minimized Cost
SafeDreamer uniquely attains minimal costs while achieving higher rewards in the five visual-only safety tasks. In contrast, model-based algorithms such as LAMBDA and Safe SLAC attain a cost limit beyond which further reductions are untenable due to the inaccuracies of the world models. On the other hand, in environments with denser or more dynamic obstacles, such as SafetyPointGoal2 and SafetyPointButton1, MPC struggles to ensure safety due to the absence of a cost critic within a limited online planning horizon. Integrating a world model with critics enables agents to effectively utilize information on current and historical states to ensure their safety. From the beginning of training, our algorithms demonstrate safety behavior, ensuring extensive safe exploration. Specifically, in the SafetyPointGoal1 and SafetyPointPush1 environments, SafeDreamer matches the performance of DreamerV3 in reward while preserving nearly zero-cost.
3. Mastering Diverse Domains: Dominance in Visual and Low-dimensional Tasks
We conducted evaluations within two low-dimensional sensor input environments. The reward of MBPPO-Lag ceases to increase when the cost begins to decrease, similar to the phenomenon observed in PPO-Lag. Distinctly, our algorithm optimizes rewards while concurrently achieving a substantial cost reduction.