Learning to Generalize across Long-Horizon Tasks from Human Demonstrations

RSS 2020 Talk

Motivation

  • Imitation Learning is effective for real-world robot learning, since it does not require an expensive online exploration process.

  • But learning policies that can generalize beyond the demonstrated behaviors is still an open challenge.

  • We present an imitation learning framework that enables robots to learn complex real-world manipulation tasks efficiently and synthesize new behaviors from training demonstrations.

  • Our key insight:

    • Multi-task domains often contain latent structure that allows for demonstration trajectories to intersect in different regions of the state space (see figure below).

    • We leverage such structure in conjunction with offline imitation learning to train goal-directed policies that can generalize to unseen start and goal state combinations.

  • We present Generalization Through Imitation (GTI), a two-stage algorithm.

    • In the first stage, we train a stochastic policy that leverages trajectory intersections to have the capacity to compose behaviors from different demonstration trajectories together.

    • In the second stage, we collect a small set of rollouts from the unconditioned stochastic policy and train a goal-directed agent to generalize to novel start and goal configurations visited by the stochastic policy.

Task demonstrations often intersect in certain areas of the state space. For example, assume we are given a set of demonstrations from A0 to AG (orange) and from B0 to BG (blue). It should be possible to leverage such intersections to have policies generalize to go from A0 to BG and B0 to AG as well, by composing different demonstration sequences together.

  • Consider the setup above. Task demonstrations are provided from A0 to AG (orange) and from B0 to BG (blue), and these trajectories intersect at a certain region of the state space.

  • Then, by imitating part of an orange trajectory and part of a blue trajectory, it should be possible to generalize to new start and goal pairs, such as going from A0 to BG or B0 to AG, even though no training examples explicitly show how to do this.

  • In the domain above, the orange demonstrations correspond to retrieving a loaf of bread from a closed container, placing it into a bowl, and serving it on a plate. The blue demonstrations correspond to picking a loaf of bread off the table, placing it into a bowl, and then placing it into an oven.

  • Our goal is to enable a policy to generalize to new start and goal pairs - such as retrieving the bread from the closed container and placing it into the oven.

Contributions

  1. We propose Generalization Through Imitation (GTI) - a novel algorithm for compositional task generalization based on learning from a fixed number of human demonstrations of long horizon tasks.

  2. We present a real-world end-to-end imitation learning system that solves complex long horizon manipulation tasks and generalizes to new pairs of start and goal task configurations.

  3. We demonstrate the effectiveness of our approach in both simulated and real world experiments and show that it is possible to learn novel and unseen goal-directed behavior on long horizon manipulation domains from under an hour of human demonstrations.

Task Demonstrations

We collected task demonstrations using the RoboTurk interface and teleoperation.

task_demo1.mp4

Start: Bread in Container

End: Bread on Plate


task_demo2.mp4

Start: Bread on Table

End: Bread in Oven


GTI Algorithm Overview

  1. (Human Data Collection) We collect task demonstrations from humans using the RoboTurk interface, allowing humans to teleoperate the robot arm with full 6DoF control using a smartphone.

  2. (Stage 1 GTI Training) We train our Stage 1 policy that has two main components. The first is a conditional Variational AutoEncoder (cVAE) that models the conditional distribution of future image observations conditioned on a current observation. The second is a recurrent goal-conditioned policy that is trained to imitate action sequences conditioned on goal observations, similar to works such as IRIS and Learning From Play. An important difference is that the policy is conditioned on latent goals by leveraging the cVAE to encode image observations.

  3. (Autonomous Data Collection) We collect a new dataset by leveraging the stochastic Stage 1 policy. Diverse latent goals are sampled using the cVAE Gaussian Mixture Model (GMM) prior and are used to condition the policy, encouraging the policy to leverage trajectory intersections to generate trajectories that were not seen in the original dataset by selectively imitating parts of different trajectories.

  4. (Stage 2 GTI Training) We train a goal-conditioned policy from the new dataset in order to distill the undirected Stage 1 policy behavior into a goal-directed model that can directly condition on final desired goal images in order to achieve them. The new policy is capable of generalizing to unseen pairs of start and goal observations.

Qualitative Results

GTI Stage 1 Diverse Policy Rollouts

Imitation of Seen Behaviors

stage1_1.mp4

Start: Bread in Container

End: Bread on Plate


stage1_2.mp4

Start: Bread on Table

End: Bread in Oven


Discovery of New Behaviors

stage1_3.mp4

Start: Bread in Container

End: Bread in Oven


stage1_4.mp4

Start: Bread on Table

End: Bread on Plate


GTI Stage 2: Generalizing to New Goals

Goal Observation

stage2_video.mp4

Goal-Conditioned Policy Rollout


GTI Stage 1: Latent Goals for Exploration

stage1_latent_goals.mp4

Sampling latent goals allows the policy to get "unstuck" in regions of conflicting supervision.

GTI Stage 1: Error Recovery

stage1_error_recovery.mp4

Closed-loop behavior that demonstrates the ability of the policy to recover from failures.