Scalable Multi-Task Imitation Learning with Autonomous Improvement
Avi Singh1, Eric Jang2, Alex Irpan2, Daniel Kappler3, Murtaza Dalal1, Sergey Levine1,2, Mohi Khansari3, Chelsea Finn2,4
1UC Berkeley, 2Google Brain, 3Google X, 4Stanford
- A critical challenge for robotic learning is scale: acquiring a large enough dataset for the system to generalize effectively
- Imitation learning is a stable and powerful approach for robot learning, but requires human operators for data collection
- In this work, we aim to build an imitation learning system that is capable of
- Improving through autonomously collected data
- Avoids the explicit use of reinforcement learning or task-specific rewards
- Our method, which we call MILI, utilizes the following insight: In a multi-task setting, a failed attempt at one task might represent a successful attempt at another task. This allows use to rollouts from the learned meta-policy itself for improving its performance.
1. Train a meta-policy on a multi-task imitation dataset
We start with a multi-task dataset of human demonstrations, and apply a meta-imitation learning algorithm to learn a demo-conditioned policy.
2. Use the same dataset to also learn a task embedding model
We use the same initially provided dataset to also learn a task embedding model. We use a contrastive loss to push demos from the same task close together in the latent space, and demos from different tasks are pushed apart.
3. Use meta-policy to collect new trials, and embed them
We use the meta-policy learned in step 1 to collect trials new environments by conditioning it on random demonstrations from the original dataset. We embed these trials into the latent task space using the embedding model learned in step 2.
4. Generate new tasks by clustering trials that are close in the latent space into new tasks, and re-run meta-imitation
In order to utilize the trials to learn an improved meta-policy, we need to organize them into tasks. We do so by finding trials that are close to each other in the latent space, and grouping them to generate new task. We then perform meta-imitation on this new dataset, in addition to the original human demo dataset.
MILI: The Complete Pipeline
The MILI algorithm We bootstrap a one-shot imitation policy using a multi-task imitation learning dataset. We then use this policy to collect trials in new environments. A latent task space, learned using the same initial dataset, is then used to find similarities in the collected trials, and generate new tasks for meta-imitation learning. We update our meta-policy using the newly collected data, and repeat this process until convergence.