Efficient Planning in a Compact Latent Action Space

Zhengyao Jiang, Tianjun Zhang, Michael Janner, Yueying Li, Tim Rocktäschel, Edward Grefenstette, Yuandong Tian

Planning-based reinforcement learning has shown strong performance in tasks in discrete and low-dimensional continuous action spaces. However, scaling such methods to high-dimensional action spaces remains challenging. We propose Trajectory Autoencoding Planner (TAP), which learns a compact discrete latent action space from offline data for efficient planning, enabling continuous control in high-dimensional control with a learned model.

Dimension Scalability

Computational Scalability

Performance Scalability

With the increased state/action dimensionality, the decision latency of the Trajectory Transformer (TT) grows quickly because of the dimension-wise autoregressive modelling. However, the decision latency of TAP is much lower and is not affected by the state-action dimensionality. In addition, the relative performance between TAP and baselines grows with increased action dimensionality. For Adroit manipulation tasks with 24 degrees of freedom, TAP surpasses existing model-based methods, including TT, with a large margin and also beats strong model-free actor-critic baselines.

TAP Modelling

Project to Discrete Latent Spacde

Model Latent Codes with a Transformer

Using a state conditional VQ-VAE, TAP approximates the conditional distribution of trajectories segments given by the current state with a discrete (categorical) distribution, where each latent code corresponds to 3-steps of the potential complement of the existing trajectory. The distribution of the latent codes is then modelled by an autoregressive Transformer, which is also conditioned on the initial state.

TAP Planning

When deployed as an RL agent, TAP avoids optimizing in a high-dimensional continuous action space but instead looks for the optimal plan in the latent space by sampling or beam search according to the distribution modelled by the Transformer.

The objective function for the search combines the estimated return (red) and a penalty for the out-of-distribution trajectory (blue). In general, the objective encourages the plan to have the highest estimated return given the probability of the trajectory being larger than a threshold.

Direct Sampling

Beam Search

Here we show a visualization of the generated plans by both direct sampling and beam search with 256 samples (beam width=64 and expansion factor=4 for beam search). Each frame shows 256 latent codes and the corresponding state in a particular step in the plan, where all the plans start from an initial state of the hopper task. The trajectories are sorted according to the objective score, so that the most front-end trajectory will be the plan to be executed. Direct sampling generates more diverse trajectories while most of them are of low prior probability. There are some of the predicted trajectories (opaque) that move quickly but do not follow the environment dynamics. The plan that is chosen follows the true dynamics but is suboptimal as it's falling down. On the other hand, beam search generates trajectories that are both with high returns and high probability. The model predicts 144 steps of the future and is trained on hopper-medium-replay.

Experiment Results

Gym Locomotion (3-8 degrees of freedom) Results

Adroit Robotic Hand Control (24 degrees of freedom) Results

Without advanced value estimation or policy iteration, TAP managed to perform competitively on gym locomotion tasks in D4RL. Among gym locomotion tasks, TAP performs better in ant tasks that have higher action dimensionality (8 degrees of freedom). For Adroit robotic hand control with 24 degrees of freedom, TAP not only surpasses model-based offline RL methods (Opt-MOPO and TT) with a large margin but also performs better than strong model-free baselines (CQL and IQL).

Ablation Results

The ablation studies on gym locomotion control show how the key design choices affect the performance of TAP. Particularly, having each latent variable covering 3 steps of the trajectory helps both sampling speed and generalization. The objective function that both consider return and trajectory likelihood is also better than pure return maximization and probability maximization (behaviour cloning). On the other hand, planning with a shorter trajectory length won't hurt the performance too much and the performance improvement from planning length saturates after Horizon=15. We also tested a direct sampling of 2048 trajectories according to the prior or uniformly, instead of a beam search. It turns out beam search still performs slightly better and its decision latency is also lower. The role of sampling from the prior rather than uniformly sampling from all the latent codes also made a big difference.

Citation

@article{jiang2023latentplan,

author = {Jiang, Zhengyao and Zhang, Tianjun and Janner, Michael and Li, Yueying and Rocktäschel, Tim and Grefenstette, Edward and Tian, Yuandong},

title = {Efficient Planning in a Compact Latent Action Space},

journal={ICLR2023},

year = {2023},

}