Imitate the Good and Avoid the Bad

Imitate the Good and Avoid the Bad: An Incremental Approach to Safe Reinforcement Learning

Huy Hoang, Tien Mai, Pradeep Varakantham

School of Computing and Information Systems

Singapore Management Univerisity, Singapore

{mhhoang, atmai, pradeepv}@smu.edu.sg

Code, Paper

Accepted at AAAI-24

Abstract

Constrained RL framework: Enforces safe actions while maximizing expected reward.

Recent approaches convert trajectory-based cost constraints into surrogate problems for solutions that over or underestimate cost constraints at each state.

Our approach:

Avoids modifying trajectory-based cost constraints. Utilizes an oracle to label trajectories as "good" or "bad" based on reward threshold and overall cost constraint.
Are able to work from any starting policy or set of trajectories and improve on it.
Outperform top benchmark approaches for solving Constrained RL problems with respect to expected cost, CVaR cost, or even unknown cost constraints.

Constrained Reinforcement learning

Overview

Our primary simplified approach involves using a classifier K to assist the agent in recognizing state-action pairs that appear in high-return and constraint-satisfied trajectories ΩG while avoiding those that do not meet our criteria ΩB.

In this process, our agent interacts with the environment to gather the replay buffer T. Subsequently, an oracle selects specific trajectories to be placed into both the good and the bad buffer categories. Through the compilation of these buffers, a classifier is trained to offer feedback to the agent. This feedback aids the agent in distinguishing whether a state-action pair is favorable or unfavorable.

Pseudo Code

Initial Policy

In our algorithm, the significance of good trajectories cannot be overstated for achieving superior performance. However, in practical scenarios involving strict constraint tasks, obtaining these good trajectories poses a challenge. To address this issue, we implement a strategy where we train the agent in a relaxed constraint setting, characterized by a higher constraint threshold. This adjustment permits the agent to take riskier actions, potentially resulting in higher returns. Consequently, this approach enables us to gather a smaller yet impactful number of good trajectories, accelerating our training process.

Environments

We compare our algorithm with prior safe RL ones using six different SafetyGym environments.

Experimental Results

Safety-PointGoal1

Safety-PointButton1

Safety-PointPush1

Safety-CarGoal1

Safety-CarButton1

Safety-CarPush1

Additional Experiments

We also answer several additional questions about our method which can be found in our paper:

(Q1) Is it necessary to use both good and bad demonstrations in the training?
(Q2) How is SIM compared to a BC-based and GAIL-based algorithm?
(Q3) How does SIM perform with different expertise levels of the initial policy π0? Can it benefit from a not well-trained policy?
(Q4) Can SIM provide a high-reward and safe policy using a relaxed-constraint expert?
(Q5) What happens if the cost function is inaccessible?
(Q6) Would an unconstrained problem benefit from our approach?
(Q7) Would our approach work with CVaR constrained problems?
(Q8) Do the number of initial good trajectories impact to the final performance?

Citation

@article{hoang2023imitate,

title={Imitate the Good and Avoid the Bad: An Incremental Approach to Safe Reinforcement Learning},

author={Hoang, Huy and Varakantham, Tien Mai Pradeep},

journal={arXiv preprint arXiv:2312.10385},

year={2023}

}