Safe DreamerV3: Safe Reinforcement Learning with World Models
Abstract
The widespread application of Reinforcement Learning (RL) in real-world situations is yet to come to fruition, largely as a result of its failure to satisfy the essential safety demands of such systems. Existing safe reinforcement learning (SafeRL) methods, employing cost functions to enhance safety, fail to achieve zero-cost in complex scenarios, including vision-only tasks, even with comprehensive data sampling and training. To address this, we introduce Safe DreamerV3, a novel algorithm that integrates both Lagrangian-based and planning-based methods within a world model. Our methodology represents a significant advancement in SafeRL as the first algorithm to achieve nearly zero-cost in both low-dimensional and vision-only tasks within the Safety-Gymnasium benchmark.
Architecture
Contrasting Safe DreamerV3 with Autonomous Intelligence Architecture [1]. (a) Autonomous Intelligence utilizes scalar costs reflecting the discomfort level of the agent, while Safe DreamerV3 (b) uses costs as distinct safety indicators from reward and balances these using a Lagrangian method and safe planner.
PLAN and PLAN-L (c) variants of Safe DreamerV3 use safe planners for action generation, while LAG (d) uses a safe actor for action generation.
Pipeline
Demo
SafetyPointGoal1 - God Perspective
SafetyPointGoal1 - Model Input
SafetyPointGoal1 - Model Prediction
SafetyPointGoal2 - God Perspective
SafetyPointGoal2 - Model Input
SafetyPointGoal2 - Model Prediction
SafetyPointButton1 - God Perspective
SafetyPointButton1 - Model Input
SafetyPointButton1 - Model Prediction
SafetyPointButton2 - God Perspective
SafetyPointButton2 - Model Input
SafetyPointButton2 - Model Prediction
SafetyPointPush1 - God Perspective
SafetyPointPush1 - Model Input
SafetyPointPush1 - Model Prediction
SafetyPointPush2 - God Perspective
SafetyPointPush2 - Model Input
SafetyPointPush2 - Model Prediction
SafetyRacecarGoal1 - God Perspective
SafetyRacecarGoal1 - Model Input
SafetyRacecarGoal1 - Model Prediction
SafetyRacecarGoal2 - God Perspective
SafetyRacecarGoal2 - Model Input
SafetyRacecarGoal2 - Model Prediction
SafetyRacecarButton1 - God Perspective
SafetyRacecarButton1 - Model Input
SafetyRacecarButton1 - Model Prediction
SafetyRacecarButton2 - God Perspective
SafetyRacecarButton2 - Model Input
SafetyRacecarButton2 - Model Prediction
SafetyRacecarPush1 - God Perspective
SafetyRacecarPush1 - Model Input
SafetyRacecarPush1 - Model Prediction
SafetyRacecarPush2 - God Perspective
SafetyRacecarPush2 - Model Input
SafetyRacecarPush2 - Model Prediction
Video Predictions
The world model receives the previous 25 frames as contextual input and predicts the future 45 steps based on the given action sequence and without access to intermediate images. By leveraging the observed sequence of images and action information, the model employs environmental dynamics modeling to infer future states, rewards and costs.
Experiment
1. Dual Objective Realization: Balancing Enhanced Reward with Minimized Cost
Safe DreamerV3 uniquely attains minimal costs while reaping high rewards. Our algorithm exhibits a conservative behavior, ensuring safe exploration to a considerable extent. Specifically, in the SafetyPointGoal1 and SafetyPointPush1 environments, Safe DreamerV3 matches the performance of unsafe policies in reward terms while preserving nearly zero-cost.
2. Swift Convergence to Nearly Zero-cost
Safe DreamerV3 surpasses model-free algorithms in terms of both rewards and costs. Although model-free algorithms may eventually approximate zero-cost, they struggle to significantly boost rewards. This challenge stems from their exclusive reliance on data sampling, devoid of world models assistance, which hampers optimal solution discovery with limited samples.
3. Mastering Diverse Domains: Dominance in Visual and Low-dimensional Tasks
We undertook comprehensive assessments on two low-dimensional vector input environments, explicitly SafetyPointGoal1 and SafetyRacecarGoal1, and established performance parity with that in visual tasks.
Code
Coming soon...
References:
[1] Yann LeCun. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review, 62, 2022