Constrained RL framework: Enforces safe actions while maximizing expected reward.
Recent approaches convert trajectory-based cost constraints into surrogate problems for solutions that over or underestimate cost constraints at each state.
Our approach:
Avoids modifying trajectory-based cost constraints. Utilizes an oracle to label trajectories as "good" or "bad" based on reward threshold and overall cost constraint.
Are able to work from any starting policy or set of trajectories and improve on it.
Outperform top benchmark approaches for solving Constrained RL problems with respect to expected cost, CVaR cost, or even unknown cost constraints.
Our primary simplified approach involves using a classifier K to assist the agent in recognizing state-action pairs that appear in high-return and constraint-satisfied trajectories ΩG while avoiding those that do not meet our criteria ΩB.
In this process, our agent interacts with the environment to gather the replay buffer T. Subsequently, an oracle selects specific trajectories to be placed into both the good and the bad buffer categories. Through the compilation of these buffers, a classifier is trained to offer feedback to the agent. This feedback aids the agent in distinguishing whether a state-action pair is favorable or unfavorable.
In our algorithm, the significance of good trajectories cannot be overstated for achieving superior performance. However, in practical scenarios involving strict constraint tasks, obtaining these good trajectories poses a challenge. To address this issue, we implement a strategy where we train the agent in a relaxed constraint setting, characterized by a higher constraint threshold. This adjustment permits the agent to take riskier actions, potentially resulting in higher returns. Consequently, this approach enables us to gather a smaller yet impactful number of good trajectories, accelerating our training process.
We compare our algorithm with prior safe RL ones using six different SafetyGym environments.
Safety-PointGoal1
Safety-PointButton1
Safety-PointPush1
Safety-CarGoal1
Safety-CarButton1
Safety-CarPush1
We also answer several additional questions about our method which can be found in our paper:
(Q1) Is it necessary to use both good and bad demonstrations in the training?
(Q2) How is SIM compared to a BC-based and GAIL-based algorithm?
(Q3) How does SIM perform with different expertise levels of the initial policy π0? Can it benefit from a not well-trained policy?
(Q4) Can SIM provide a high-reward and safe policy using a relaxed-constraint expert?
(Q5) What happens if the cost function is inaccessible?
(Q6) Would an unconstrained problem benefit from our approach?
(Q7) Would our approach work with CVaR constrained problems?
(Q8) Do the number of initial good trajectories impact to the final performance?
@inproceedings{hoang2024imitate,
title={Imitate the good and avoid the bad: An incremental approach to safe reinforcement learning},
author={Hoang, Huy and Mai, Tien and Varakantham, Pradeep},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={38},
number={11},
pages={12439--12447},
year={2024}
}