Language as an Abstraction for Hierarchical Deep Reinforcement Learning
Overview
Hierarchical reinforcement learning offers a promising framework for learning policies that span long time horizon, but designing the abstraction between high-level policy and low-level policy is challenging. Language is a compositional representation that is human-interpretable and flexible, making it suitable for encoding a wide range of behaviors.
We propose to use language as the abstraction between the high-level and low-level policy, and demonstrate that the resulting policy can successfully solve long-horizon tasks with sparse reward and can generalize well using the compositionality of language even in challenging high-dimensional observation and action spaces. First, we demonstrate the benefit of our method in a low-dimensional observation space through various ablation and comparison against different HRL methods, and then scale our methods to challenging pixel observation space where the baselines cannot make progress.
More environment documentation and code of the algorithms are coming soon!
State-based Results
Low Level Policy
- Here the videos should the rollout of the low-level policy trying to complete randomly sampled goals in 2 different environment settings.
- The caption shows the current instruction and green text indicates that the instruction was completed.
Standard
- This policy is trained on a fixed set of 5 spheres with different colors.
- The policy is able to complete almost every instruction.
Diverse
- This policy is trained on objects with varying colors (5), shapes (2), materials (2), and sizes (2).
- 1086008 possible object configurations (same object can repeat).
- The policy is able to complete almost every instruction.
High Level Policy
- Here we show the videos of random rollout for the high-level policies on 3 standard setting (not cherry picked).
- "Success" indicates that the task has been completed.
- In all 3 tasks, the success rate is close to perfect.
- The agents only receive rewards only if all of the constraints are satisfied (e.g. In task (b), all the objects need to be ordered). The sparse binary reward makes these tasks extremely challenging for RL agents as random exploration rarely produces meaningful reward signal.
- A flat RL algorithm (DDQN) can only solve (a) and (b) with much higher variance and worse asymptotic performance; other HRL algorithms (HIRO and Option-Critics) cannot solve the tasks at all.
(a) Object Arrangement
In this task, the high level policy needs to make the following statements simultaneously true:
- red ball to the right of purple ball
- green ball to the right of red ball
- green ball to the right of cyan ball
- purple ball to the left of cyan ball
- cyan ball to the right of purple ball
- red ball in front of blue ball
- red ball to the left of green sphere
- green ball in front of blue sphere
- purple ball to the left of cyan ball
- blue ball behind red ball
(b) Object Ordering
In this task, the high level policy needs to order the objects following:
- cyan < purple < green < blue < red
- The objects can not be too far apart in the y-axis
(c) Object Sorting
In this task, the high level policy needs to sort the objects such that:
- The red sphere is between and behind the purple and blue spheres
- The blue sphere must be to the right of the purple sphere
- The green sphere is between and below the purple and in front of the purple and blue spheres
- The cyan sphere is in the "middle" of all 4 other spheres
- The other 4 spheres form a "diamond" around the cyan sphere
Vision-based Results
Low Level Policy
- Here the videos should the rollout of the low-level policy whose observation is in the pixel space trying to complete randomly sampled goals in 2 different environment settings.
- The viewing angle is the same as shown in the videos but downsized to 64 x 64 by cubic interpolation. This viewing angle includes occlusion as well as small degree of partial observability.
- The caption shows the current instruction and green text indicates the instruction was completed.
- The policy uses orders of magnitude higher observation and action spaces than the state-based policies.
Standard
- Same 5 object setting as the state-based version.
Diverse
- Objects with different colors (5) and shapes (3).
- 3003 possible total object configurations (objects do not repeat).
- Unlike in the state-base settings, we observed that the model capacity and training time can be a limiting factor when scaling up to more diverse visual settings; as such, this visual diversity provides a good trade-off between training time and complexity.
High-level Policy
- Here we show the videos of random roll-out for vision-based high-level policy on 3 standard setting and 3 diverse settings.
- "Success" indicates that the task has been completed.
- Success rates of the policies for (a-c) are nearly perfect, while the policies for (d-e) -- which are much more visually challenging -- sometimes fail to complete the tasks.
- Increased observation and action dimensionality exacerbates the exploration problem and even the DDQN which learns 2/3 state-based high-level tasks fails to learn a non-trivial policy.
(a) Object Arrange
- Same objective as that of the state-based.
(b) Object Ordering
- Same objective as that of the state-based.
(c) Object Sorting
- Same objective as that of the state-based.
(d) Color Ordering
- red < green < blue < cyan < purple
(e) Shape Ordering
- sphere < cube < cylinder
(f) Color & Shape Ordering
- The colors must be sorted according to color ordering and within each color the shapes must sorted according to shape ordering.