Self-Consistent Trajectory Autoencoder: Hierarchical Reinforcement Learning with Trajectory Embeddings

Abstract: Hierarchical reinforcement learning methods typically focus on architectural design decisions, such as the inclusion of options in the action space. In many cases they also require some form of supervision to learn the options. In this work, we take a representation learning perspective on reinforcement learning, and show that this allows us to perform hierarchical RL with more flexibility and less supervision. We show that we can learn continuous latent representations of trajectories, which are effective in solving temporally extended and compositional problems. Our proposed model, SeCTAR, draws inspiration from variational autoencoders, and learns latent representations of trajectories with a key difference being that we learn both a latent-conditioned policy to generate trajectories by acting in the world, and a latent-conditioned model to predict the behavior of such policies. This model provides a built-in prediction mechanism, giving us a natural way to do hierarchical RL with model-based planning in latent space. We propose a novel algorithm for performing hierarchical RL, which combines model-based planning in a learned latent space with an unsupervised exploration objective. We show that our model is effective at reasoning over long horizons with sparse rewards for several simulated tasks, outperforming standard reinforcement learning methods and prior methods for hierarchical reasoning, model-based planning, and exploration.

Model: The SeCTAR model computation graph. A trajectory is encoded into a latent distribution, which is then sampled from and directly decoded into a sequence of states using a state decoder and used to condition a policy decoder which produces trajectories through sequential execution in the environment.

2D Navigation

Wheeled Locomotion

Object Manipulation

Swimmer

Results

Updated comparison of our method with best settings for prior methods on 2D navigation, wheeled locomotion, object manipulation, and swimmer waypoint. The swimmer waypoint task is new - swimmer gets reward of 1 for every 3 waypoints that the swimmer swims through correctly.

In these updated comparisons, we used PPO-option critic which performed much better than DQN-option critic and also ran more extensive hyperparameter sweeps on the baselines. Dashed lines indicate truncated execution. We find that on all tasks, our method is able to achieve higher reward much quicker than model-based, model-free and hierarchical baselines. We did not evaluate FeUdal and A3C on the wheeled locomotion and swimmer tasks, as our implementations of these methods only accommodated discrete actions.

Video Demonstrations

Visualization of Baseline Methods

Below are visualization of baselines on the 2D Navigation task. The blue line indicates the position of the agent over time. The X's indicate goals that were reached in the correct order, and the colored dots indicate goals that have yet to be reached. The colors indicate order of the goal in rainbow where red is the first goal and violet is the last. We visualize the final performance of one rollout over 5 goal configurations for each task. (All methods were tested on the same 5 configurations, ordering may be different)

TRPO on 2D Navigation

VIME on 2D Navigation

MPC on 2D Navigation

Option Critic on 2D Navigation

A3C on 2D Navigation

Feudal on 2D Navigation

Below are visualization of baselines on the Wheeled Locomotion task. The visualization format is the same as that of 2D navigation.

TRPO on Wheeled Locomotion

VIME on Wheeled Locomotion

MPC on Wheeled Locomotion

Option Critic on Wheeled Locomotion

Below are visualization of baselines on block manipulation. The purple line indicates the position of the agent over time. The yellow, green, black, and magenta lines indicate movements of the blocks over time. The X's indicate the goal location for each block, and the colored dots indicate the final location of each block. We visualize the final performance of one rollout over 5 goal configurations for each task.

TRPO on Block Manipulation

VIME on Block Manipulation

MPC on Block Manipulation

Option Critic on Block Manipulation

A3C on Block Manipulation

Feudal on Block Manipulation

Below are visualization of baselines on the swimmer. The blue line indicates the center of mass of the swimmer over time. The rest of the visualization format is the same as that of 2D navigation.

TRPO on Swimmer Waypoint

VIME on Swimmer Waypoint

Option Critic on Swimmer Waypoint