Interventional Data Generation for Robust and Data-Efficient Robot Imitation Learning

Ryan Hoque, Ajay Mandlekar*, Caelan Garrett*, Ken Goldberg, Dieter Fox

Paper Link:

[here]

Real Robot Videos

I-MG Policy (Noiseless)

I-MG Policy (Noise)

I-MG Policy in Simulation (Noise)

Robust to Dynamic Pose Changes

Robust to Visual Distractors

Robust to Physical Perturbations

Interventional MimicGen

A system for automatically generating interventional data from a small set of human interventions

Imitation learning is a promising paradigm for training robot control policies, but these policies can suffer from distribution shift, where the conditions at evaluation time differ from those in the training data. One common real-world source of distribution shift is object pose estimation error, which can cause agents that rely on pose information to fail catastrophically during deployment. A popular approach for increasing policy robustness to distribution shift is interactive imitation learning, in which a human operator provides corrective interventions during policy deployment. However, collecting a sufficient amount of interventions to cover the distribution of policy mistakes can be burdensome for human operators. I-MG automatically generates large datasets of synthetic corrective interventions from a handful of human interventions, with coverage across both diverse scene configurations and policy mistake distributions. The I-MG system can also facilitate sim-to-real transfer of pose-conditioned policies, which can mitigate the visual sim-to-real gap but suffer from inaccurate pose estimates. Below, the robot mistakenly believes the peg is at the position highlighted in red and requires demonstration of recovery behavior toward the true peg position.

Simulation Experiments

Quantitative results are in the main text in Tables 1, 2, and 3. Here we show qualitative results of learned I-MG policy execution under pose noise. All 50 evaluation episodes of the highest performing checkpoint are shown, with 5 frames skipped for every frame shown for faster viewing. The policies in Nut Insertion, 2-Piece Assembly, and Coffee move toward the noisy pose estimates (where they believe the target object to be) but are able to recover toward the true target pose upon contact with the target object. Meanwhile, a single policy in Nut-and-Peg Assembly can both (Geometry 1) grasp and place the nut when the handle is specified correctly and (Geometry 2) recover toward an alternate handle location after disambiguating between the two via a missed grasp. Note that the policies are not perfect and still make mistakes, in part because they use only 10 human interventions.

Nut Insertion (98%)

nut assembly eval.mp4

2-Piece Assembly (70%)

2pa eval.mp4

Coffee (80%)

coffee eval.mp4

Nut-and-Peg Assembly, Geometry 1 (92%)

nut geometry 1.mp4

Nut-and-Peg Assembly, Geometry 2 (88%)

nut geometry 2.mp4

Additional Modules

Offline Mode (human demonstrates both mistake and recovery)

I-MG can also be used entirely offline (i.e., without robot policy execution). Here, the human can provide offline "mistake and recovery" demonstrations, indicating what could go wrong and how to recover from it. For example, this can be desirable for "real2sim" corrections: observing sim2real gaps and manually correcting for them. By indicating what portion of the demonstration corresponds to recovery, these source interventions can be expanded with the same I-MG process (however, with mistake replay rather than policy execution). On the left below, the human teleoperates entire trajectories of intentionally toppling an object and setting it upright, with the red border indicating annotated recovery segments. On the right, these source interventions are automatically expanded with I-MG.

Offline Data Collection

offline data collection.mp4

Interventional Data Generation

offline expansion.mp4

Inter-Subtask Recovery (recovery across object-centric subtasks)

MimicGen assumes that the task can be represented as a known sequence of object-centric subtasks. However, catastrophic policy failures may cross the boundary between object-centric subtasks, reverting task progress to earlier stages. For instance, consider the task below, where the task consists of (1) grasping object 1, (2) placing object 1, (3) grasping object 2, and (4) placing object 2. An imprecise place of object 1 can require a re-grasp, reverting the ordering to the first subtask. Thus, I-MG models the task as a sequence of subtask attempts rather than subtasks during the data generation process, allowing multiple attempts of each subtask and reversion to earlier stages. When evaluated under the mistake distribution, the learned policy can recover from this failure and occasionally exhibit closed-loop regrasping behavior, an emergent behavior that does not appear in the dataset.

Data Collection

intersubtask collect.mp4

Learned Policy Execution

My Movie 12 (1).mp4

Additional Results

Policy performance is stable across training seeds

We perform 3 training runs with different seeds (2000 epochs each) for each of the settings below, reporting mean and standard deviation in the highest performing model checkpoint (50 trials every 50 epochs). We find low variance (standard deviation <7%) across training seeds, indicating stability.

Nut Insertion, Source Interventions: 32.0% +/- 5.9%

Nut Insertion, MimicGen Full Demos: 54.7% +/- 6.8%

Nut Insertion, I-MG Policy Ablation: 86.0% +/- 0.0%

Nut Insertion, I-MG: 98.7% +/- 0.9%

2-Piece Assembly, I-MG: 74.7% +/- 5.2%

Coffee, I-MG: 84.0% +/- 4.3%

Page updated

Google Sites

Report abuse