Safe Option-Critic

Learning Safety in the Option-Critic Architecture

Designing hierarchical reinforcement learning algorithms that induce a notion of safety is not only vital for safety-critical applications, but also, brings better understanding of an artificially intelligent agent's decisions. While learning end-to-end options automatically has been fully realised recently, in this work we propose a solution to learning safe options. We introduce the idea of controllability of states based on the temporal difference errors in the option-critic framework. We then derive the policy-gradient theorem with controllability and propose a novel framework called safe option-critic. We demonstrate the effectiveness of our approach in the four-rooms grid-world and three games in the Arcade Learning Environment (ALE). Learning end-to-end options with the proposed notion of safety achieves reduction in variance of return and boosts performance in environments with intrinsic variability in rewards. More importantly, the proposed algorithm outperforms vanilla options in all the environments and primitive actions in two out of three ALE games.

Grid World: Four Rooms Environment

Learned Flattened Policies

Safe Option-Critic

Option-Critic

Continuous Puddle-World Environment

Sampled Trajectory

Safe Option-Critic

Option-Critic

Arcade Learning Environment

Safe A2OC

A2OC

MsPacmanWithControllability0PT10.mov
MsPacmanA2OC_NoControllability.mov

Safe A2OC

A2OC

AmidarSafeA2OC_0PT10Controllability.mov
AmidarA2OC_NoControllability.mov