Options of Interest
Temporal Abstraction with Interest Functions
Khimya Khetarpal, Martin Klissarov, Maxime Chevalier-Boisvert, Pierre-Luc Bacon, Doina Precup
McGill University, Mila, University de Montreal, Stanford University, Google DeepMind
In proceedings of AAAI 2020, also presented at NeurIPS 2019, Deep RL Workshop, Learning Transferable Skills Workshop.
Abstract: Temporal abstraction refers to the ability of an agent to use behaviours of controllers which act for a limited, variable amount of time. The options framework describes such behaviours as consisting of a subset of states in which they can initiate, an internal policy and a stochastic termination condition. However, much of the subsequent work on option discovery has ignored the initiation set, because of difficulty in learning it from data. We provide a generalization of initiation sets suitable for general function approximation, by defining an interest function associated with an option. We derive a gradient-based learning algorithm for interest functions, leading to a new interest-option-critic architecture. We investigate how interest functions can be leveraged to learn interpretable and reusable temporal abstractions. We demonstrate the efficacy of the proposed approach through quantitative and qualitative results, in both discrete and, continuous environments.
Environments
Fourrooms
TMaze
MiniWorld
Half Cheetah
Options Learned with Interest Functions
HalfCheetah
Option 0 specializes in going forward
Option 1 specializes in going backward
Option 0 attends to to move forward by dragging its limbs, whereas option 1 is employed to take much larger hopped steps
TMaze
Visualization of Interest Function
Figure 1: The first task has multiple goals and in the next task only one
Figure 2: Interest Functions for Option 1 (top ) and Option 2 (bottom)
The point mass agent starts at the bottom end of a T shaped maze, while there are two possible goal locations (left and right end of the T shaped maze) as shown in the Figure 1 in row 1. Both goals result in a reward value of +1 upon reaching the goals. After 150 iterations, the goal that has been visited the most is removed an the agent has to adapt it's policy to the only available location for reward. We used 2 options in both OC and IOC agent. We can observe the visualization of interest functions during the learning process in the Figure 2. The x-y axis are the two dimensional state space. Initially, the interest functions are randomized. It is over time that options learn to specialize in different regions of the state space. The agent starts at the tail of the T shaped maze. It is interesting how one option emerges to specialize near the tail and the other around the T junction. Once the task has changed; the interest of each options adjusts and automatically achieves a similar state seperation as to where options specialize.
MiniWorld
IOC agent has learned two distinct options: option 0 attends scanning the surrounding, whereas option 1 is used to directly navigate towards the block upon successfully locating the block as shown in
Options Learned without Interest Functions - OC Agent
OC agent doesn't learn specialized skills and both option 0 and 1 end up going backward and overfitting to the current task in hand i.e. task 2 of going backward. Even though the first task was to go forward. We see that in IOC the skills or an options's interest from task 1 is protected.