Planning -- breaking down a large problem into smaller solvable components -- is a core challenge for any intelligent system. The central importance of this challenge is accentuated in high-dimensional, long-horizon, continuous control, where agents need to efficiently plan in order to succeed. Our paper presents a new method for model-based learning and planning, which learns a graph-structured world model that facilitates temporally extended reasoning. By learning the landmarks of an environment in the latent space, our method allows an agent to build a compact and sparse mental map of the world, which can be combined with graph search to stitch together small tasks for long-horizon problems. We demonstrate the success of our new algorithm on a variety of robotics environments.
For the purpose of illustration, we can consider a consider a simple U-Shaped maze environment with an ant robot inside. The ant's goal is to navigate from one end of the maze to the other (denoted by a red dot). This environment is beneficial because it is easy to visualize.
Although this maze looks conceptually simple, controlling the ant is fairly complicated and the reward is sparse. The agent will not receive any signal until it completes a sequence of complicated high-dimensional continuous actions, making the problem challenging or even impossible for naive model free methods such as policy gradients. Of course, it would greatly help the agent if it could instead learn to complete smaller tasks en-route to completing the overall goal. For example in this simple case, the agent could teach itself to first navigate halfway between the agent's current position and the goal. In other words, a long horizon task can be solved by stitching together much simpler short-horizon tasks.
Planning Over Action Sequences
When I think about planning, I usually think of frameworks like Model-Based RL (MBRL) and Model-Predictive Control (MPC), where a learned dynamics model helps the agent to project out a series of desired actions.
The transition distribution, which is estimated over past experience, is used to search over state-action transitions and help the agent plan a route from its current state s to a desired goal state g. We call this approach planning over action sequences, because the agent projects out a series of beneficial actions it would like to take.
Planning over action sequences has two major disadvantages. First, the longer the planning horizon, the more time for small errors in the dynamics model to propagate forward. Learning a good enough model to avoid this, it turns out, is very difficult. Learning how to recover from these types of small model failures is an active area of research.
We see errors in the learned model (yellow) propagate. Thus planning diverges from the true outcome (green).
The second difficulty with planning over action sequences is that it requires planning over every single timestep, which quickly becomes intractable. If we consider our ant maze example, the agent needs to plan actions over hundreds of timesteps in order to successfully complete the task. Ideally, the agent would not need to plan out its actions at every timestep. Instead, the ideal agent could plan over a more sparse set of transitions across key states in the environment. That is precisely the intuition behind world model as a graph.
World Model as a Graph
When planning over action sequences, the learned transition distribution can be thought of as a model of the agent's world. This type of world model is popular because it is simple to understand and simple to use for look ahead and planning. However, we are interested in world models that are more sparse, in the sense that transitioning between states in the world model represents a major shift in the agent's place within the environment. In other words, transitioning between abstract states in the world model represents a more wholesale change in the agent's state that likely requires several actions to achieve. As we discuss in the paper, this notion of learning sparse states that are useful for planning is related to the idea of landmarks. Indeed, we call the abstract states that we learn latent landmarks.
We want to learn a world model that is actually useful for planning. Thus, simply learning an auto-encoder with a sparsity bottleneck is likely going to be insufficient. To learn an encoder that is useful for planning, we use the idea of reachability. The reachability between state A and state B is simply the minimum number of actions it would take to transition from A to B. By utilizing the special reward structure of goal-reaching (the agent receives a reward of -1 until the goal is reached), we can obtain reachability estimation by first parameterizing the Q-function in a way that is disentangled from the discount factor during the Q-learning process:
And then we can distill the reachability estimates V from the function D:
Given this reachability estimate, the key idea is to constrain the latent space of this auto-encoder such that the distance between two latent embeddings roughly correspond to the reachability estimate:
Now that we have a latent space where the distance between nodes in the space roughly corresponds to how far apart (in terms of reachability) the points are for the agent, we can simply do clustering in this latent space to obtain latent landmarks:
We now proceed to construct a graph using those latent landmarks, with the nodes being the decoded latent centroids, and the edges being reachability estimates.
Notice we have drawn the sparse multistep transitions between the learned latent states as the nodes on a graph. This is deliberate, as our high level planning strategy is going to be to perform graph search over the graph of encoded latent landmarks. In other words, how can we select a sequence of nodes for the agent to transverse in order to accomplish a high-level goal? We use the reachability estimates that we used for clustering as the weights of the edges on the graph. This makes intuitive sense, since we are trying to find a shortest path among latent states to traverse to the desired goal state.
For online planning, we use a modified version of the Floyd algorithm to do graph search. A low-level model free policy learns to transition between the nodes suggested via graph search. The agent is allowed to use the model to replan as it goes, but crucially it must stick to each plan for the a specific number of steps (the predicted reachability towards the sub-goal) before changing plans. We found that forcing the agent to stick to the same plan for its predicted reachability horizon was a crucial part of training and sampling.
If we project the learned latent space back onto the agent's state space, we can get an idea of the spatial position of each of the agent's learned latent landmarks. In the case of our ant maze example, the distribution of learned latent landmarks looks like this
We see the learned landmarks, visualized as blue dots, are roughly uniformly spread throughout the state space. This is more or less in line with our expectations of this example. The agent's position is represented by the green dot. The planner's current goal is the black star, and the desired final goal is the red diamond. We can use this representation to chart the planner's course through the environment, like so
As desired, the planner learns to move between close landmarks to successfully navigate the agent to the goal. Looking at the actual environment, we can watch this process unfold in real time.
We see that the ant's motion is somewhat jittery, as it seems to have some uncertainty about when exactly it reaches the desired latent landmarks. Nevertheless, it is able to successfully navigate the maze without much trouble.
In our paper, we also consider planning over a variety of manipulation tasks involving a fetch robot. The most difficult task involves a block and a box. During training, the agent is mostly concerned with simply learning to pick and place the block. During test time, we ask the agent to generalize to placing the block inside the box. This is difficult, because collisions with the box can greatly destabilize planning.
Looking at the states corresponding to the learned latent landmarks, we see that they are generally states that help avoid collisions with the box. Indeed, the trained agent seems happy to avoid the box whenever necessary, which was in many ways the desired behavior.
Happily, all of this effort to safely navigate the box proves worthwhile, as the high level planner is able to successfully instruct the agent how to place the block in the box via navigating the learned latent landmarks.
Full videos of the train time and test time behavior for this task can be seen below.