Planning with Goal-Conditioned Policies

Soroush Nasiriany*, Vitchyr Pong*, Steven Lin, Sergey Levine

University of California, Berkeley

Advances in Neural Information Processing Systems (NeurIPS), 2019

*Equal Contribution

Paper | Code

Summary Video (3 Minutes)

LEAP_summary_3min__Nasiriany_Pong_Lin_Levine.mp4

Motivation and Idea

  • Problem: Solving temporally-extended robotic tasks with high-dimensional state spaces is a major challenge.
  • Solution: We perform planning at an abstracted level, decomposing a long-horizon task into a series of short-horizon tasks that are individually significantly easier to solve.
  • Approach: Optimize for a sequence of high-level subgoals that guide the agent to the goal.
  • Questions: What space captures the space of valid subgoals, that we can optimize over? How do we know whether the subgoals in a sequence are reachable from one another?

Our Method: Latent Embeddings for Abstracted Planning (LEAP)

We present a method for temporally-extended planning over high-dimensional state spaces by learning a state representation amenable to optimization and a goal-conditioned policy to abstract time.

Overview

  1. The planner is given a starting and goal state.
  2. The planner plans intermediate subgoals in a low-dimensional latent space. By planning in this latent space, the subgoals correspond to valid state observations.
  3. The goal-conditioned policy then tries to reach the first subgoal. After t1 time steps, the policy replans and repeats steps 2 and 3.

State Representation

Optimizing subgoals over the space of raw observations is ill-defined, as the space of valid subgoals lies in a low-dimensional manifold of the raw space. To address this, we train a β-VAE, whose latent space captures the space of valid states from the raw observation space. The latent space provides a state abstraction, which we can use to plan subgoals. We train the β-VAE from a dataset of randomly collected states from the environment.

Goal-Conditioned Policy

We need a goal-conditioned policy that measures reachability between a pair of states. We employ temporal difference models (TDMs) from Pong et al. (2018). Given a starting state s and goal g, TDMs measure how close the agent can get from s to g for a short time horizon. For long-horizon tasks, TDMs provide temporal abstraction, allowing us to chain multiple short-horizon tasks into a long-horizon task.

Subgoal Planner

Given the current state s and goal g, our planner optimizes for subgoals over the latent space of the VAE,. Each consecutive pair of subgoals is given a feasibility score, a metric for how close the agent can get to the next subgoal, starting from the previous subgoal. We maximize the overall feasibility of the plan. We employ an additional penalty that constrains the latent subgoals to stay within the prior distribution of the latent space.

Experiments

2D Navigation

The pointmass must plan a globally-optimal path to reach the goal, from inside the u-wall to the other side.

Push and Reach

The robot must move around the puck to the desired puck location, and then move the end effector to its desired location.

Ant Navigation

The ant must plan a globally-optimal path to reach the goal, from one side of the wall to the other.

Visualizations

We visualize:

  1. true goal: the goal of the task
  2. observation: the current observation
  3. next subgoal: the next subgoal to reach
  4. value function: the value function heatmap, showing reachability from the current observation, for a short time horizon ahead.

2D Navigation

Push And Reach

Ant Navigation

Citation

@article{nasiriany2019planning,
  title={Planning with Goal-Conditioned Policies},
  author={Soroush Nasiriany, Vitchyr Pong, Steven Lin, Sergey Levine},
  booktitle={Advances in Neural Information Processing Systems},
  year={2019}
}