Options of Interest

Temporal Abstraction with Interest Functions

Khimya Khetarpal, Martin Klissarov, Maxime Chevalier-Boisvert, Pierre-Luc Bacon, Doina Precup

McGill University, Mila, University de Montreal, Stanford University, Google DeepMind

Abstract: Temporal abstraction refers to the ability of an agent to use behaviours of controllers which act for a limited, variable amount of time. The options framework describes such behaviours as consisting of a subset of states in which they can initiate, an internal policy and a stochastic termination condition. However, much of the subsequent work on option discovery has ignored the initiation set, because of difficulty in learning it from data. We provide a generalization of initiation sets suitable for general function approximation, by defining an interest function associated with an option. We derive a gradient-based learning algorithm for interest functions, leading to a new interest-option-critic architecture. We investigate how interest functions can be leveraged to learn interpretable and reusable temporal abstractions. We demonstrate the efficacy of the proposed approach through quantitative and qualitative results, in both discrete and, continuous environments.

Environments

Fourrooms

TMaze

MiniWorld

Half Cheetah

Options Learned with Interest Functions

HalfCheetah

Option 0 specializes in going forward

Option 1 specializes in going backward

Option 0 attends to to move forward by dragging its limbs, whereas option 1 is employed to take much larger hopped steps

TMaze

Visualization of Interest Function

Figure 1: The first task has multiple goals and in the next task only one

Figure 2: Interest Functions for Option 1 (top ) and Option 2 (bottom)

The point mass agent starts at the bottom end of a T shaped maze, while there are two possible goal locations (left and right end of the T shaped maze) as shown in the Figure 1 in row 1. Both goals result in a reward value of +1 upon reaching the goals. After 150 iterations, the goal that has been visited the most is removed an the agent has to adapt it's policy to the only available location for reward. We used 2 options in both OC and IOC agent. We can observe the visualization of interest functions during the learning process in the Figure 2. The x-y axis are the two dimensional state space. Initially, the interest functions are randomized. It is over time that options learn to specialize in different regions of the state space. The agent starts at the tail of the T shaped maze. It is interesting how one option emerges to specialize near the tail and the other around the T junction. Once the task has changed; the interest of each options adjusts and automatically achieves a similar state seperation as to where options specialize.

MiniWorld

IOC agent has learned two distinct options: option 0 attends scanning the surrounding, whereas option 1 is used to directly navigate towards the block upon successfully locating the block as shown in

Options Learned without Interest Functions - OC Agent

OC agent doesn't learn specialized skills and both option 0 and 1 end up going backward and overfitting to the current task in hand i.e. task 2 of going backward. Even though the first task was to go forward. We see that in IOC the skills or an options's interest from task 1 is protected.