Safe DreamerV3: Safe Reinforcement Learning with World Models

Abstract

The widespread application of Reinforcement Learning (RL) in real-world situations is yet to come to fruition, largely as a result of its failure to satisfy the essential safety demands of such systems. Existing safe reinforcement learning (SafeRL) methods, employing cost functions to enhance safety, fail to achieve zero-cost in complex scenarios, including vision-only tasks, even with comprehensive data sampling and training. To address this, we introduce Safe DreamerV3, a novel algorithm that integrates both Lagrangian-based and planning-based methods within a world model. Our methodology represents a significant advancement in SafeRL as the first algorithm to achieve nearly zero-cost in both low-dimensional and vision-only tasks within the Safety-Gymnasium benchmark.

Architecture

Contrasting Safe DreamerV3 with Autonomous Intelligence Architecture [1]. (a) Autonomous Intelligence utilizes scalar costs reflecting the discomfort level of the agent, while Safe DreamerV3 (b) uses costs as distinct safety indicators from reward and balances these using a Lagrangian method and safe planner.

PLAN and PLAN-L (c) variants of Safe DreamerV3 use safe planners for action generation, while LAG (d) uses a safe actor for action generation.

Pipeline

Demo

SafetyPointGoal1 - God Perspective

SafetyPointGoal1 - Model Input

SafetyPointGoal1 - Model Prediction

SafetyPointGoal2 - God Perspective

SafetyPointGoal2 - Model Input

SafetyPointGoal2 - Model Prediction

SafetyPointButton1 - God Perspective

SafetyPointButton1 - Model Input

SafetyPointButton1 - Model Prediction

SafetyPointButton2 - God Perspective

SafetyPointButton2 - Model Input

SafetyPointButton2 - Model Prediction

SafetyPointPush1 - God Perspective

SafetyPointPush1 - Model Input

SafetyPointPush1 - Model Prediction

SafetyPointPush2 - God Perspective

SafetyPointPush2 - Model Input

SafetyPointPush2 - Model Prediction

SafetyRacecarGoal1 - God Perspective

SafetyRacecarGoal1 - Model Input

SafetyRacecarGoal1 - Model Prediction

SafetyRacecarGoal2 - God Perspective

SafetyRacecarGoal2 - Model Input

SafetyRacecarGoal2 - Model Prediction

SafetyRacecarButton1 - God Perspective

SafetyRacecarButton1 - Model Input

SafetyRacecarButton1 - Model Prediction

SafetyRacecarButton2 - God Perspective

SafetyRacecarButton2 - Model Input

SafetyRacecarButton2 - Model Prediction

SafetyRacecarPush1 - God Perspective

SafetyRacecarPush1 - Model Input

SafetyRacecarPush1 - Model Prediction

SafetyRacecarPush2 - God Perspective

SafetyRacecarPush2 - Model Input

SafetyRacecarPush2 - Model Prediction

Video Predictions

The world model receives the previous 25 frames as contextual input and predicts the future 45 steps based on the given action sequence and without access to intermediate images. By leveraging the observed sequence of images and action information, the model employs environmental dynamics modeling to infer future states, rewards and costs.

Experiment

1. Dual Objective Realization: Balancing Enhanced Reward with Minimized Cost

Safe DreamerV3 uniquely attains minimal costs while reaping high rewards. Our algorithm exhibits a conservative behavior, ensuring safe exploration to a considerable extent. Specifically, in the SafetyPointGoal1 and SafetyPointPush1 environments, Safe DreamerV3 matches the performance of unsafe policies in reward terms while preserving nearly zero-cost.

2. Swift Convergence to Nearly Zero-cost

Safe DreamerV3 surpasses model-free algorithms in terms of both rewards and costs. Although model-free algorithms may eventually approximate zero-cost, they struggle to significantly boost rewards. This challenge stems from their exclusive reliance on data sampling, devoid of world models assistance, which hampers optimal solution discovery with limited samples.

3. Mastering Diverse Domains: Dominance in Visual and Low-dimensional Tasks

We undertook comprehensive assessments on two low-dimensional vector input environments, explicitly SafetyPointGoal1 and SafetyRacecarGoal1, and established performance parity with that in visual tasks.

Code

Coming soon...

References:

[1] Yann LeCun. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review, 62, 2022