One-Shot Imitation Learning

Authors: Yan Duan, Marcin Andrychowicz, Bradly C. Stadie, Jonathan Ho, Jonas Schneider, Ilya Sutskever, Pieter Abbeel, Wojciech Zaremba

Abstract:

Imitation learning has been commonly applied to solve different tasks in isolation. This usually requires either careful feature engineering, or a significant number of samples. This is far from what we desire: ideally, robots should be able to learn from very few demonstrations of any given task, and instantly generalize to new situations of the same task, without requiring task-specific engineering. In this paper, we propose a meta-learning framework for achieving such capability, which we call one-shot imitation learning.

Specifically, we consider the setting where there is a very large set of tasks, and each task has many instantiations. For example, a task could be to stack all blocks on a table into a single tower, another task could be to place all blocks on a table into two-block towers, etc. In each case, different instances of the task would consist of different sets of blocks with different initial states. At training time, our algorithm is presented with pairs of demonstrations for a subset of all tasks. A neural net is trained that takes as input one demonstration and the current state (which initially is the initial state of the other demonstration of the pair), and outputs an action with the goal that the resulting sequence of states and actions matches as closely as possible with the second demonstration. At test time, a demonstration of a single instance of a new task is presented, and the neural net is expected to perform well on new instances of this new task. The use of soft attention allows the model to generalize to conditions and tasks unseen in the training data. We anticipate that by training this model on a much greater variety of tasks and settings, we will obtain a general system that can turn any demonstrations into robust policies that can accomplish an overwhelming variety of tasks.

Paper on ArXiv

Illustration of one-shot imitation learning.

The left side shows the demonstration trajectory, and the right side shows the learned policy executing on a new situation of the same task, conditioned on the entire demonstration. The task that needs to be achieved is to stack blocks into 4 towers: "ab," "cde," "fg," and "hij," where the blocks are ordered from top to bottom within each group.

The bottom left corner shows the attention weights over different time steps of the demonstration. The x-axis shows the time index in the demonstration, and y-axis shows the different query heads of the attention operation.

The bottom right corner shows the attention weights over different blocks in the current state. The x-axis shows different blocks, and y-axis shows the different query heads of the attention operation.

Overall performance

We found that our proposed architecture can both perform well on the tasks it was trained on, and also on new tasks not seen during training.
As the number of stacking operations, the task gets harder, and the hand-engineered policy ("Demonstration") we used to generate demonstrations is not perfect (therefore the success rate decreases). However our architecture ("Entire trajectory") matches the performance of the hand-engineered policy pretty well. In principle, the performance can be further improved by better demonstrations.
We also compared to alternative architectures, e.g. only conditioning on key frames or only the final state. Overall we found that our architecture performs the best.

Training tasks

Test tasks

Side-by-side execution of all training tasks

Side-by-side execution of all test tasks

Failure case: Accidental wrong move, conditioning on entire demonstration (DAGGER)

The intended task was "abcd efgh." It should form two towers, by:

1st tower:

- Stacking block C on top of block D

- Stacking block B on top of block C

- Stacking block A on top of block B

2nd tower:

- Stacking block E on top of block F

- Stacking block F on top of block G

- Stacking block G on top of block H

However, due to manipulation failure, block C was bounced onto block J (around 00:22), which was identified as a wrong move. Fortunately, the policy tries to recover from the mistake, and will likely succeed given more time.

Failure case: Intentional wrong move, conditioning on the final state

The intended task was "ab cd efg." It should form three towers, by:

1st tower:

- Stacking block A on top of block B

2nd tower:

- Stacking block C on top of block D

3rd tower:

- Stacking block F on top of block G

- Stacking block E on top of block F

However, the policy executes the wrong task, by intentionally placing block C on top of block A.

Failure case: Irrecoverable manipulation failure

The intended task was "abcd efgh." It should form two towers, by:

1st tower:

- Stacking block C on top of block D

- Stacking block B on top of block C

- Stacking block A on top of block B

2nd tower:

- Stacking block E on top of block F

- Stacking block F on top of block G

- Stacking block G on top of block H

However, due to manipulation failure, block A was shaken off the table (around 00:18), and the demonstration policy does not know how to handle this situation. Hence this execution has failed.