Watch, Try, Learn

Meta-Learning from Demonstrations and Rewards

Allan Zhou, Eric Jang, Daniel Kappler, Alex Herzog, Mohi Khansari, Paul Wohlhart, Yunfei Bai, Mrinal Kalakrishnan, Sergey Levine, Chelsea Finn

Google Brain, X, and UC Berkeley

[paper][code]

Abstract

Imitation learning allows agents to learn complex behaviors from demonstrations. However, learning a complex vision-based task may require an impractical number of demonstrations. Meta-imitation learning is a promising approach towards enabling agents to learn a new task from one or a few demonstrations by leveraging experience from learning similar tasks. In the presence of task ambiguity or unobserved dynamics, demonstrations alone may not provide enough information; an agent must also try the task to successfully infer a policy. In this work, we propose a method that can learn to learn from both demonstrations and trial-and-error experience with sparse reward feedback. In comparison to meta-imitation, this approach enables the agent to effectively and efficiently improve itself autonomously beyond the demonstration data. In comparison to meta-reinforcement learning, we can scale to substantially broader distributions of tasks, as the demonstration reduces the burden of exploration. Our experiments show that our method significantly outperforms prior approaches on a set of challenging, vision-based control tasks.

Meta-Training

1. User records demonstrations

We record a handful of demonstrations per task for hundreds of different tasks using a virtual reality setup.


2. Train a trial policy with meta-imitation

We train the trial policy to infer how to solve the task given just one or a few task demonstrations.

3. Collect trial trajectories with this policy

We run our trained trial policy in the environment, collecting the resulting trajectories and recording whether each trajectory was a success or failure.

4. Learn from the trial data

We train a new policy to infer how to solve the task given one or a few demonstration and corresponding trial trajectories.

Deployment

In this pick and place task, the user provided demo moves the cup to the right edge of the table.

During the trial, our method moves the cup to the wrong side of the table and receives zero reward.

Our method learns from the incorrect trial and solves the task.

Gripper Experiments

We evaluated our method on complex tasks in a realistic simulation environment using both state space and vision-based observations. Our tasks can be classified into one of four families: button pressing, grasping, pushing, and pick and place. Examples of the first three are shown below, while a pick and place example is shown above.

The goal is to push one of the two buttons.

The goal is to grasp and lift one of the two objects.

The goal is push one of the objects into the other.