IQL-TD-MPC: Implicit Q-Learning for Hierarchical Model Predictive Control

Rohan Chitnis*, Yingchen Xu*, Bobak Hashemi, Lucas Lehnert,
Urun Dogan, Zheqing Zhu, Olivier Delalleau

Meta AI, FAIR

Published at ICRA 2024

Arxiv link

Abstract

Model-based reinforcement learning (RL) has shown great promise due to its sample efficiency, but still struggles with long-horizon sparse-reward tasks, especially in offline settings where the agent learns from a fixed dataset. We hypothesize that model-based RL agents struggle in these environments due to a lack of long-term planning capabilities, and that planning in a temporally abstract model of the environment can alleviate this issue. In this paper, we make two key contributions: 1) we introduce an offline model-based RL algorithm, IQL-TD-MPC, that extends the state-of-the-art Temporal Difference Learning for Model Predictive Control (TD-MPC) with Implicit Q-Learning (IQL); 2) we propose to use IQL-TD-MPC as a Manager in a hierarchical setting with any off-the-shelf offline RL algorithm as a Worker. More specifically, we pre-train a temporally abstract IQL-TD-MPC Manager to predict "intent embeddings", which roughly correspond to subgoals, via planning. We empirically show that augmenting state representations with intent embeddings generated by an IQL-TD-MPC manager significantly improves off-the-shelf offline RL agents' performance on some of the most challenging D4RL benchmark tasks. For instance, the offline RL algorithms AWAC, TD3-BC, DT, and CQL all get zero or near-zero normalized evaluation scores on the medium and large antmaze tasks, while our modification gives an average score over 40.

IQL-TD-MPC

Videos of Experiments

Behavioral Cloning (BC) on Antmaze

We visualize an episode of the Behavioral Cloning (BC) agent on the antmaze-large-play-v2 task below. On the left, we see that without intent embeddings, the ant gets stuck close to the start of the maze, never reaching the goal. On the right, we see that the ant reaches the goal, guided by the intent embeddings whose decoding is visualized in green. To generate these green visualizations, we trained a separate decoder alongside the IQL-TD-MPC Manager that converts the intent embeddings (in the Manager's latent space) back into the raw environment state space, which contains the position and velocity of the ant. The green dot shows the position, and the green line attached to the dot shows the velocity (speed is the length of the line). This visualization shows that the intent embeddings are effectively acting as latent-space subgoals that the Worker exploits to learn a more effective policy.

Without Intent Embeddings

With Intent Embeddings

Implicit Q-Learning (IQL) on Maze2D

Next, we visualize an episode of the Implicit Q-Learning (IQL) agent on the maze2d-large-v1 task below. The optimal policy is to control the ball (green) to reach the goal (red) and then stay in place until the episode times out.  In both cases, we see that the ball makes progress toward the goal. The difference is twofold: (1) without intent embeddings, the ball takes a bit longer to reach the goal for the first time; (2) without intent embeddings, after reaching the goal for the first time, the ball moves far away from it before returning. This is possibly due to the IQL agent not seeing enough data concentrated around the goal, so it has not properly learned to stay in place once it gets there. By contrast, we again see that our intent embeddings, whose decoding is visualized in blue, act as latent-space subgoals that guide the ball to the goal. These blue visualizations were generated in the same way as for the above antmaze-large-play-v2 green visualizations (we only changed the color because the ball is already green in maze2d).

Without Intent Embeddings

With Intent Embeddings