Towards More Generalizable
One-shot Visual Imitation Learning

Zhao Mandi*, Fangchen Liu*, Kimin Lee, Pieter Abbeel
University of California, Berkeley

Paper | Code

Abstract

We extend one-shot imitation learning to an ambitious multi-task setup, and support this formulation by a vision-based robotic manipulation benchmark, consisting of 7 tasks, a total of 61 variations, and a continuum of instances within each variation. We investigate multi-task training followed by one-shot imitation evaluated (i) on variations within the training tasks, (ii) on unseen tasks, and (iii) after fine-tuning on unseen tasks.

We propose to tackle challenges in multi-task one-shot imitation by improving network architecture and self-supervised representation learning. Our method, namely MOSAIC, outperforms prior state of the art in learning efficiency, final performance, and is able to learns a multi-task policy with promising generalization ability via fine-tuning.

Nut Assembly

Pick & Place

Stack Block

Basketball

Door

Press Button

Drawer


  • We illustrate the input images to our model as it performs 1 variation from each of the 7 task environments in our benchmark.

One-shot Visual Imitation Learning

A general-purpose robot should be able to master a wide range of tasks and quickly learn a novel one by leveraging past experiences. One-shot imitation learning is one approach towards this goal: an agent is trained with pairs of expert demonstrations, such that at test time, it can directly execute a new task from just one demonstration.

See left for an illustrative example of one-shot imitation learning on 2 variations of the Nut Assembly task.

  • Top row: Demonstration videos for each variation. No other context information (e.g. task IDs) is provided.

  • Bottom row: An agent must "watch" the demonstration to infer which variation it should perform.

variation#1

variation#2

However, prior works tend to assume a very strong similarity between train and test. An agent is trained on many variations of just one task, to then be evaluated on unseen but similar variations of the same task. In our multi-task setup, an agent is trained on many tasks and many variations, and tested not only on all of the trained tasks, but additionally novel tasks that are excluded from training.

Benchmark

Full illustration of our robot manipulation benchmark of 7 tasks and a total of 61 semantically distinct variations. For each variation, we also add randomization to create more varied instances, such as different initial object positions.

Method

We propose MOSAIC (Multi-task One-Shot Imitation with self-Attention and Contrastive learning), which integrates (i) a self-attention model architecture and (ii) a temporal contrastive module to enable better task disambiguation and more robust representation learning.

Illustration of our network architecture. The policy network takes in a stack of demonstration video frames and current state observation images, and predicts the expert action at each time-step.

Illustration of our contrastive module. A temporal contrastive loss is applied in auxiliary with the policy's behavior cloning loss.

Experiments

Single and Multi-task One-shot Imitation Evaluated on Seen Tasks


We evaluate test-time one-shot imitation performance on both single-task and multi-task setup.

For each task, 1) each entry of the row named "single" reports results of a single-task model that was trained and tested on the same task; 2) the row named "multi" reports results from one multi-task model that was trained on all 7 tasks in the benchmark and tested on each task separately.

Multi-task One-shot Imitation on Novel Tasks


We compare fine-tuning multi-task models on their corresponding held-out novel task versus training a single-novel-task model from scratch.

We take the same amount of expert data used in single- and multi-task experiments in the previous section and mark them as the full "100% data", then repeat the fine-tuning and training-from-scratch experiments with 25%, 50% and 75% of the full data.

As the results indicate, fine-tuning multi-task model on a completely novel task learns faster than training-from-scratch, sometimes converging to a higher final performance, and requiring fewer amount of expert data.