Safe Option-Critic
Learning Safety in the Option-Critic Architecture
Designing hierarchical reinforcement learning algorithms that induce a notion of safety is not only vital for safety-critical applications, but also, brings better understanding of an artificially intelligent agent's decisions. While learning end-to-end options automatically has been fully realised recently, in this work we propose a solution to learning safe options. We introduce the idea of controllability of states based on the temporal difference errors in the option-critic framework. We then derive the policy-gradient theorem with controllability and propose a novel framework called safe option-critic. We demonstrate the effectiveness of our approach in the four-rooms grid-world and three games in the Arcade Learning Environment (ALE). Learning end-to-end options with the proposed notion of safety achieves reduction in variance of return and boosts performance in environments with intrinsic variability in rewards. More importantly, the proposed algorithm outperforms vanilla options in all the environments and primitive actions in two out of three ALE games.
Grid World: Four Rooms Environment
Learned Flattened Policies
Safe Option-Critic
Option-Critic
Continuous Puddle-World Environment
Sampled Trajectory
Safe Option-Critic
Option-Critic
Arcade Learning Environment
Safe A2OC
A2OC
Safe A2OC
A2OC