Language as an Abstraction for Hierarchical Deep Reinforcement Learning

Yiding Jiang, Shixiang Gu, Kevin Murphy, Chelsea Finn

Google Research

Overview

Hierarchical reinforcement learning offers a promising framework for learning policies that span long time horizon, but designing the abstraction between high-level policy and low-level policy is challenging. Language is a compositional representation that is human-interpretable and flexible, making it suitable for encoding a wide range of behaviors.

We propose to use language as the abstraction between the high-level and low-level policy, and demonstrate that the resulting policy can successfully solve long-horizon tasks with sparse reward and can generalize well using the compositionality of language even in challenging high-dimensional observation and action spaces. First, we demonstrate the benefit of our method in a low-dimensional observation space through various ablation and comparison against different HRL methods, and then scale our methods to challenging pixel observation space where the baselines cannot make progress.

More environment documentation and code of the algorithms are coming soon!

State-based Results

Low Level Policy

Here the videos should the rollout of the low-level policy trying to complete randomly sampled goals in 2 different environment settings.
The caption shows the current instruction and green text indicates that the instruction was completed.

low-level-rollout.mp4

Standard

This policy is trained on a fixed set of 5 spheres with different colors.
The policy is able to complete almost every instruction.

multi_object_rollout.mp4

Diverse

This policy is trained on objects with varying colors (5), shapes (2), materials (2), and sizes (2).
1086008 possible object configurations (same object can repeat).
The policy is able to complete almost every instruction.

High Level Policy

Here we show the videos of random rollout for the high-level policies on 3 standard setting (not cherry picked).
"Success" indicates that the task has been completed.
In all 3 tasks, the success rate is close to perfect.
- The agents only receive rewards only if all of the constraints are satisfied (e.g. In task (b), all the objects need to be ordered). The sparse binary reward makes these tasks extremely challenging for RL agents as random exploration rarely produces meaningful reward signal.
- A flat RL algorithm (DDQN) can only solve (a) and (b) with much higher variance and worse asymptotic performance; other HRL algorithms (HIRO and Option-Critics) cannot solve the tasks at all.

statement_rollout.mp4

(a) Object Arrangement

In this task, the high level policy needs to make the following statements simultaneously true:

red ball to the right of purple ball
green ball to the right of red ball
green ball to the right of cyan ball
purple ball to the left of cyan ball
cyan ball to the right of purple ball
red ball in front of blue ball
red ball to the left of green sphere
green ball in front of blue sphere
purple ball to the left of cyan ball
blue ball behind red ball

ordering_rollout.mp4

(b) Object Ordering

In this task, the high level policy needs to order the objects following:

cyan < purple < green < blue < red
The objects can not be too far apart in the y-axis

sorting_rollout.mp4

(c) Object Sorting

In this task, the high level policy needs to sort the objects such that:

The red sphere is between and behind the purple and blue spheres
The blue sphere must be to the right of the purple sphere
The green sphere is between and below the purple and in front of the purple and blue spheres
The cyan sphere is in the "middle" of all 4 other spheres
The other 4 spheres form a "diamond" around the cyan sphere

Vision-based Results

Low Level Policy

Here the videos should the rollout of the low-level policy whose observation is in the pixel space trying to complete randomly sampled goals in 2 different environment settings.
The viewing angle is the same as shown in the videos but downsized to 64 x 64 by cubic interpolation. This viewing angle includes occlusion as well as small degree of partial observability.
The caption shows the current instruction and green text indicates the instruction was completed.
The policy uses orders of magnitude higher observation and action spaces than the state-based policies.

image_based_low_level_rollout.mp4

Standard

Same 5 object setting as the state-based version.

image-low-level-rollout-diverse.mp4

Diverse

Objects with different colors (5) and shapes (3).
3003 possible total object configurations (objects do not repeat).
Unlike in the state-base settings, we observed that the model capacity and training time can be a limiting factor when scaling up to more diverse visual settings; as such, this visual diversity provides a good trade-off between training time and complexity.

High-level Policy

Here we show the videos of random roll-out for vision-based high-level policy on 3 standard setting and 3 diverse settings.
"Success" indicates that the task has been completed.
Success rates of the policies for (a-c) are nearly perfect, while the policies for (d-e) -- which are much more visually challenging -- sometimes fail to complete the tasks.
Increased observation and action dimensionality exacerbates the exploration problem and even the DDQN which learns 2/3 state-based high-level tasks fails to learn a non-trivial policy.

high-level-image-arrange.mp4

(a) Object Arrange

Same objective as that of the state-based.

high-level-image-sort.mp4

(b) Object Ordering

Same objective as that of the state-based.

high-level-image-2d.mp4

(c) Object Sorting

Same objective as that of the state-based.

high-level-image-colorsort.mp4

(d) Color Ordering

red < green < blue < cyan < purple

high-level-image-shapesort.mp4

(e) Shape Ordering

sphere < cube < cylinder

high-level-image-colorshape.mp4

(f) Color & Shape Ordering

The colors must be sorted according to color ordering and within each color the shapes must sorted according to shape ordering.

Google Sites

Report abuse