Interventional Data Generation for Robust and Data-Efficient Robot Imitation Learning

Ryan Hoque, Ajay Mandlekar*, Caelan Garrett*, Ken Goldberg, Dieter Fox

ArXiv Link: https://arxiv.org/abs/2405.01472

Real Robot Qualitative Results

I-Gen Policy (Noiseless)

I-Gen Policy (Noise)

I-Gen Policy in Simulation (Noise)

Robust to Dynamic Pose Changes

Robust to Visual Distractors

Robust to Physical Perturbations

We run real robot experiments with a Franka Emika Panda robot for a cube grasping task, where the robot must reach and grasp a 5 cm x 5 cm x 5 cm cube randomly positioned (fixed orientation) in a 20 cm x 30 cm region. The policy input consists of the end effector position, estimated block position, and gripper state, while the policy output consists of continuous control delta-pose actions at 20 Hz (with fixed top-down orientation). The pose error distribution samples uniform noise with x in [-1 cm, 1 cm] and y in [-7 cm, -2.5 cm] U [2.5 cm, 7 cm] (to produce contact between the cube and a gripper jaw for pose disambiguation). Here we investigate I-Gen's ability to facilitate sim2real transfer: the policy is trained entirely in simulation with no real-world data or fine-tuning. We include quantitative results below and qualitative videos above. Results suggest trends consistent with simulation, with I-Gen outperforming baselines. The policy is also robust to physical perturbations, visual distractors, and dynamic pose changes as shown above.

IntervenGen

A system for automatically generating interventional data from a small set of human interventions

Imitation learning is a promising paradigm for training robot control policies, but these policies can suffer from distribution shift, where the conditions at evaluation time differ from those in the training data. One common real-world source of distribution shift is object pose estimation error, which can cause agents that rely on pose information to fail catastrophically during deployment. A popular approach for increasing policy robustness to distribution shift is interactive imitation learning, in which a human operator provides corrective interventions during policy deployment. However, collecting a sufficient amount of interventions to cover the distribution of policy mistakes can be burdensome for human operators. IntervenGen (I-Gen) automatically generates large datasets of synthetic corrective interventions from a handful of human interventions, with coverage across both diverse scene configurations and policy mistake distributions. The I-Gen system can also facilitate sim2real transfer of pose-conditioned policies. Below, the robot mistakenly believes the peg is at the position highlighted in red and requires demonstration of recovery behavior toward the true peg position.

Simulation Experiments

Quantitative results are in the main text in Tables 1, 2, and 3. Here we show qualitative results of learned I-MG policy execution under pose noise. All 50 evaluation episodes of the highest performing checkpoint are shown, with 5 frames skipped for every frame shown for faster viewing. The policies in Nut Insertion, 2-Piece Assembly, and Coffee move toward the noisy pose estimates (where they believe the target object to be) but are able to recover toward the true target pose upon contact with the target object. Meanwhile, a single policy in Nut-and-Peg Assembly can both (Geometry 1) grasp and place the nut when the handle is specified correctly and (Geometry 2) recover toward an alternate handle location after disambiguating between the two via a missed grasp. Note that the policies are not perfect and still make mistakes, in part because they use only 10 human interventions.

Nut Insertion (98%)

nut assembly eval.mp4

2-Piece Assembly (70%)

2pa eval.mp4

Coffee (80%)

coffee eval.mp4

Nut-and-Peg Assembly, Geometry 1 (92%)

nut geometry 1.mp4

Nut-and-Peg Assembly, Geometry 2 (88%)

nut geometry 2.mp4

Additional Modules

Offline Mode (human demonstrates both mistake and recovery)

I-Gen can also be used entirely offline (i.e., without robot policy execution). Here, the human can provide offline "mistake and recovery" demonstrations, indicating what could go wrong and how to recover from it. For example, this can be desirable for "real2sim" corrections: observing sim2real gaps and manually correcting for them. By indicating what portion of the demonstration corresponds to recovery, these source interventions can be expanded with the same I-Gen process (however, with mistake replay rather than policy execution). On the left below, the human teleoperates entire trajectories of intentionally toppling an object and setting it upright, with the red border indicating annotated recovery segments. On the right, these source interventions are automatically expanded with I-Gen.

Offline Data Collection

offline data collection.mp4

Interventional Data Generation

offline expansion.mp4

Inter-Subtask Recovery (recovery across object-centric subtasks)

MimicGen assumes that the task can be represented as a known sequence of object-centric subtasks. However, catastrophic policy failures may cross the boundary between object-centric subtasks, reverting task progress to earlier stages. For instance, consider the task below, where the task consists of (1) grasping object 1, (2) placing object 1, (3) grasping object 2, and (4) placing object 2. An imprecise place of object 1 can require a re-grasp, reverting the ordering to the first subtask. Thus, I-Gen models the task as a sequence of subtask attempts rather than subtasks during the data generation process, allowing multiple attempts of each subtask and reversion to earlier stages. When evaluated under the mistake distribution, the learned policy can recover from this failure and occasionally exhibit closed-loop regrasping behavior, an emergent behavior that does not appear in the dataset.

Data Collection

intersubtask collect.mp4

Learned Policy Execution

My Movie 12 (1).mp4

Full Pseudocode

Additional Results

Policy performance is stable across training seeds

We perform 3 training runs with different seeds (2000 epochs each) for each of the settings below, reporting mean and standard deviation in the highest performing model checkpoint (50 trials every 50 epochs). We find low variance (standard deviation <7%) across training seeds, indicating stability.

Nut Insertion, Source Interventions: 32.0% +/- 5.9%

Nut Insertion, MimicGen Full Demos: 54.7% +/- 6.8%

Nut Insertion, I-Gen Policy Ablation: 86.0% +/- 0.0%

Nut Insertion, I-Gen: 98.7% +/- 0.9%

2-Piece Assembly, I-Gen: 74.7% +/- 5.2%

Coffee, I-Gen: 84.0% +/- 4.3%

I-Gen is less dependent on noise injection than the ablation

Noise injection in executed actions during the MimicGen process can significantly increase downstream policy performance (Mandlekar et al., Laskey et al.); consequently, in our main set of experiments, we used MimicGen's default setting of additive unit Gaussian noise with 0.05 scale. However, we found that I-Gen can be less sensitive to the presence of action noise; in the Nut Insertion environment, with a 10x reduced magnitude of action noise, the performance of the ablation of policy execution during data generation falls from 86% to 66%, while I-Gen's performance remains at 98%. This could be due to the broad coverage of the mistake distribution generated by policy execution.

Page updated

Google Sites

Report abuse