Yihang Yao*, Zhepeng Cen*, Wenhao Ding, Haohong Lin, Shiqi Liu,
Tingnan Zhang, Wenhao Yu, Ding Zhao
* indicates equal contribution
Carnegie Mellon University, Google DeepMind
Abstract
Offline safe reinforcement learning (RL) aims to train a policy that satisfies constraints using a pre-collected dataset. Most current methods struggle with the mismatch between imperfect demonstrations and the desired safe and rewarding performance. In this paper, we introduce OASIS (cOnditionAl diStributIon Shaping), a new paradigm in offline safe RL designed to overcome these critical limitations. OASIS utilizes a conditional diffusion model to synthesize offline datasets, thus shaping the data distribution toward a beneficial target domain. Our approach makes compliance with safety constraints through effective data utilization and regularization techniques to benefit offline safe RL training. Comprehensive evaluations on public benchmarks and varying datasets showcase OASIS's superiority in benefiting offline safe RL agents to achieve high-reward behavior while satisfying the safety constraints, outperforming established baselines. Furthermore, OASIS exhibits high data efficiency and robustness, making it suitable for real-world applications, particularly in tasks where safety is imperative and high-quality demonstrations are scarce.
Keywards: Data-centric Learning, Diffusion Model, Data Augmentation, Safe Reinforcement Learning.
Challenges for Offline Safe RL
(1) Distribution Shift: In offline RL, the agent may have poor generalizability when facing Out-of-Distribution (OOD) state-action pairs during online evaluation.
(2) Safety-Efficiency Performance Trade-Off: The agent tends to be over-conservative or aggressive (unsafe) when taking an overestimation or an underestimation of the safety requirements.
(3) Safe dataset mismatch (SDM) problem: a preference mismatch between the behavior policy and the optimal policy. The SDM problem is explained in more details below.
In the figure on the right, each point represents a single trajectory, with the x-axis representing the cost return of that trajectory and the y-axis representing the reward return. The dashed black line indicates the cost threshold defined by the user, while the red star marks the optimal policy under these conditions. The SDM problem arises when training on the blue dataset, which was collected by a conservative policy that yields low reward and low cost. In this case, the regularization term forces the learned policy to stay close to the conservative policy, leading to suboptimal performance. Similarly, when using the yellow dataset, the learned policy becomes overly aggressive, resulting in high costs. In both scenarios, the SDM problem causes the learned policy to be unsatisfactory.
In this paper, we primarily aim to address the SDM problem.
SDM problem illustration
Method Overview
Intuition of distribution shaping
Intuition: As shown in the figure above, the core idea of our approach is straightforward: first, to learn a data generator based on the current dataset, and then generate a dataset that aligns with the desired safety-cost preference; finally, to train an offline RL agent on this generated dataset.
OASIS overview
OASIS overview: The figure above illustrates the data generation and RL training pipeline of OASIS. Our data generator is based on a diffusion model. Given a safety threshold preference, we first sample an initial state from the original dataset and then condition on both the state and the preference to generate a subsequence of data, forming a new dataset aligned with the user’s preference.
Theoretical Analysis
In theoritical analysis, we answer these two questions:
(Q1) Why do we use the diffusion model for conditional distribution shaping? [Theorem 1]
(Q2) How does conditional distribution shaping benefit offline safe RL training? [Theorem 2]
The brief introduction and visualization are shown in the following figure. Please see more details in our paper.
Theorem illustration
Theorem 1 [Distribution shaping error bound]: Using the diffusion model as a data generator, we can ensure that the TV distance between the generated state-action density and the density under optimal policy is bounded.
Theorem 2 [Constraint violation bound]: When trained on the generated dataset, the safety performance of the RL agent is guaranteed.
Experiment Results
In experiments, we mainly answer these questions:
(Q1) How does OASIS perform compared to Data generation and safe RL baselines?
(Q2) How well does OASIS realize the distribution shaping for RL tasks?
(Q3) How does OASIS improve the data efficiency of offline RL training?
To answer these questions, we conduct experiments in the DSRL benchmark. For detailed information on the experimental setup, please refer to the Experiment section and Appendix in our paper.
(Q1) As shown in the table above, we can observe that first OASIS consistently achieves safety requirements across all the tested tasks; and second OASIS achieves high rewards while satisfying the safety constraints.
(Q2) To answer this question, we generated datasets with varying cost and reward conditions. In these two figures, we show the reward distribution and cost distribution under different conditions. The values represent the mean reward and cost of the generated datasets. Both the conditions and values have been normalized to the same scale. Let's look at two examples: the red dataset is conditioned on low cost and medium reward, while the yellow dataset is conditioned on medium cost and high reward. We can observe a clear alignment between the generated datasets and the target conditions. Additionally, we visualized the density map of the (x, y) positions. This figure demonstrates that setting different conditions leads to distinctly different dataset distributions.
(Q3) In the last experiment, we want to show the effect of dataset quality on RL training data efficiency. In this figure, we train RL agents using different amounts of data. For OASIS, we trained the agent using a preference-aligned generated dataset. For other baselines, we use the raw datasets. We can see that with high-quality, OASIS can learn a good policy even just using 1 percentile of data.
Bib info