Diffusion Meets DAgger

Supercharging Eye-in-hand Imitation Learning

Xiaoyu Zhang       Matthew Chang       Pranav Kumar       Saurabh Gupta

University of Illinois at Urbana-Champaign

[arxiv] [diffusion model code]

Abstract

A common failure mode for policies trained with imitation is compounding execution errors at test time. When the learned policy encounters states that are not present in the expert demonstrations, the policy fails, leading to degenerate behavior. The Dataset Aggregation, or DAgger approach to this problem simply collects more data to cover these failure states. However, in practice, this is often prohibitively expensive. In this work, we propose Diffusion Meets DAgger (DMD), a method to reap the benefits of DAgger without the cost for eye-in-hand imitation learning problems. Instead of collecting new samples to cover out-of-distribution states, DMD uses recent advances in diffusion models to synthesize these samples. This leads to robust performance from few demonstrations. We compare DMD against behavior cloning baseline across four tasks: pushing, stacking, pouring, and shirt hanging. In pushing, DMD achieves 80% success rate with as few as 8 expert demonstrations, where naive behavior cloning reaches only 20%. In stacking, DMD succeeds on average 92% of the time across 5 cups, versus 40% for BC. When pouring coffee beans, DMD transfers to another cup successfully 80% of the time. Finally, DMD attains 90% success rate for hanging shirt on a clothing rack. 

DMD System Overview. Our system operates in three stages. 

a) A diffusion model is trained, using task and play data, to synthesize novel views relative to a given image. 

b) This diffusion model is used to generate an augmenting dataset that contains off-trajectory views from expert demonstrations. Labels for these views (cyan arrows) are constructed such that off-trajectory views will still converge towards task success (right). Images with a green border are from trajectories in the original task dataset. Purple-outlined images are diffusion-generated augmenting samples. 

c) The original task data and augmenting dataset are combined for policy learning.

DMD Robotic Experiments

There are four tasks in total: pushing an apple to a target location, stacking five different cups on a box, pouring coffee beans into a cup, and hanging a shirt on a rack. We conduct our experiments on a Franka Research 3 robot with a wrist-mounted GoPro Hero 9. 

Non-prehensile Pushing

(a) DMD vs. BC

DMD outperforms BC across all settings. DMD achieves a 100% success rate when pushing an apple, greatly exceeding BC’s 30%. It also maintains an 80% success rate with only 8 demonstrations, whereas BC drops to 20%. 

*This video contains multiple sections for different experiments

*This video contains multiple sections for different experiments

(b) DMD vs. SPARTN 

Our diffusion model synthesizes higher quality images than NeRFs, especially when scenes undergo deformations. This advantage results in higher task performance: DMD achieves a 100% success rate, while SPARTN achieves only 50%. 

(Note that this DMD-24-demos video is different from the DMD-24-demos video bove because they are from two different pairwise randomized A/B tests.)

(c) Utility of Play Data

Training the diffusion model with additional play data boosts the task success rate to 100%, compared to 80% when using the model trained only on task data.

(Note that this DMD-24-demos video is different from the previous DMD-24-demos videos because they are all from different pairwise randomized A/B tests.)

Task & Play Data

Only Task Data

Stacking

*This video contains multiple sections for different experiments

*This video contains multiple sections for different experiments

Pouring

*Recorded with a third-person camera to view amount of coffee beans transferred/spilled better. This view is not input into the policy.

*Recorded with a third-person camera to view amount of coffee beans transferred/spilled better. This view is not input into the policy.

Hanging a Shirt

*Recorded with a third-person camera to view amount of coffee beans transferred/spilled better. This view is not input into the policy.

*Recorded with a third-person camera to view amount of coffee beans transferred/spilled better. This view is not input into the policy.

In-the-Wild Cup Arrangement

We leverage a diverse in-the-wild dataset from the recent Universal Manipulation Interface (UMI) paper. We adopt the same task definition as the in-the-wild generalization experiment in UMI: placing a cup on a saucer with its handle facing the left side of the robot. UMI collected 1447 demonstrations across 30 locations and 18 training cups. 

We use their publicly available demonstration data and conduct evaluation in our lab (i.e. novel location) with and without DMD. We test on 5 held-out cups. For each cup, we test 5 different start configurations. We follow the experiment protocol outlined in UMI: we use pixel masks to make sure that the starting locations of the cups and saucers are the same across the two methods.

DMD

Diffusion Policy