Imitate the Good and Avoid the Bad: An Incremental Approach to Safe Reinforcement Learning

Huy Hoang, Tien Mai, Pradeep Varakantham

School of Computing and Information Systems 

Singapore Management Univerisity, Singapore 

{mhhoang, atmai, pradeepv}@smu.edu.sg

Code, Paper

Accepted at AAAI-24

Abstract

Constrained RL framework: Enforces safe actions while maximizing expected reward.

Recent approaches convert trajectory-based cost constraints into surrogate problems for solutions that over or underestimate cost constraints at each state.


Our approach: 


Constrained Reinforcement learning

Overview

Our primary simplified approach involves using a classifier K to assist the agent in recognizing state-action pairs that appear in high-return and constraint-satisfied trajectories G while avoiding those that do not meet our criteria B.

In this process, our agent interacts with the environment to gather the replay buffer T. Subsequently, an oracle selects specific trajectories to be placed into both the good and the bad buffer categories. Through the compilation of these buffers, a classifier is trained to offer feedback to the agent. This feedback aids the agent in distinguishing whether a state-action pair is favorable or unfavorable.

Pseudo Code

Initial Policy

In our algorithm, the significance of good trajectories cannot be overstated for achieving superior performance. However, in practical scenarios involving strict constraint tasks, obtaining these good trajectories poses a challenge. To address this issue, we implement a strategy where we train the agent in a relaxed constraint setting, characterized by a higher constraint threshold. This adjustment permits the agent to take riskier actions, potentially resulting in higher returns. Consequently, this approach enables us to gather a smaller yet impactful number of good trajectories, accelerating our training process.

Environments

We compare our algorithm with prior safe RL ones using six different SafetyGym environments.

Experimental Results

Safety-PointGoal1

Safety-PointButton1

Safety-PointPush1

Safety-CarGoal1

Safety-CarButton1

Safety-CarPush1

Additional Experiments

We also answer several additional questions about our method which can be found in our paper:

Citation

@article{hoang2023imitate,

  title={Imitate the Good and Avoid the Bad: An Incremental Approach to Safe Reinforcement Learning},

  author={Hoang, Huy and Varakantham, Tien Mai Pradeep},

  journal={arXiv preprint arXiv:2312.10385},

  year={2023}

}