Ryan Hoque, Ajay Mandlekar*, Caelan Garrett*, Ken Goldberg, Dieter Fox
I-MG Policy (Noiseless)
I-MG Policy (Noise)
I-MG Policy in Simulation (Noise)
Robust to Dynamic Pose Changes
Robust to Visual Distractors
Robust to Physical Perturbations
Imitation learning is a promising paradigm for training robot control policies, but these policies can suffer from distribution shift, where the conditions at evaluation time differ from those in the training data. One common real-world source of distribution shift is object pose estimation error, which can cause agents that rely on pose information to fail catastrophically during deployment. A popular approach for increasing policy robustness to distribution shift is interactive imitation learning, in which a human operator provides corrective interventions during policy deployment. However, collecting a sufficient amount of interventions to cover the distribution of policy mistakes can be burdensome for human operators. I-MG automatically generates large datasets of synthetic corrective interventions from a handful of human interventions, with coverage across both diverse scene configurations and policy mistake distributions. The I-MG system can also facilitate sim-to-real transfer of pose-conditioned policies, which can mitigate the visual sim-to-real gap but suffer from inaccurate pose estimates. Below, the robot mistakenly believes the peg is at the position highlighted in red and requires demonstration of recovery behavior toward the true peg position.
Quantitative results are in the main text in Tables 1, 2, and 3. Here we show qualitative results of learned I-MG policy execution under pose noise. All 50 evaluation episodes of the highest performing checkpoint are shown, with 5 frames skipped for every frame shown for faster viewing. The policies in Nut Insertion, 2-Piece Assembly, and Coffee move toward the noisy pose estimates (where they believe the target object to be) but are able to recover toward the true target pose upon contact with the target object. Meanwhile, a single policy in Nut-and-Peg Assembly can both (Geometry 1) grasp and place the nut when the handle is specified correctly and (Geometry 2) recover toward an alternate handle location after disambiguating between the two via a missed grasp. Note that the policies are not perfect and still make mistakes, in part because they use only 10 human interventions.
Nut Insertion (98%)
2-Piece Assembly (70%)
Coffee (80%)
Nut-and-Peg Assembly, Geometry 1 (92%)
Nut-and-Peg Assembly, Geometry 2 (88%)
I-MG can also be used entirely offline (i.e., without robot policy execution). Here, the human can provide offline "mistake and recovery" demonstrations, indicating what could go wrong and how to recover from it. For example, this can be desirable for "real2sim" corrections: observing sim2real gaps and manually correcting for them. By indicating what portion of the demonstration corresponds to recovery, these source interventions can be expanded with the same I-MG process (however, with mistake replay rather than policy execution). On the left below, the human teleoperates entire trajectories of intentionally toppling an object and setting it upright, with the red border indicating annotated recovery segments. On the right, these source interventions are automatically expanded with I-MG.
Offline Data Collection
Interventional Data Generation
MimicGen assumes that the task can be represented as a known sequence of object-centric subtasks. However, catastrophic policy failures may cross the boundary between object-centric subtasks, reverting task progress to earlier stages. For instance, consider the task below, where the task consists of (1) grasping object 1, (2) placing object 1, (3) grasping object 2, and (4) placing object 2. An imprecise place of object 1 can require a re-grasp, reverting the ordering to the first subtask. Thus, I-MG models the task as a sequence of subtask attempts rather than subtasks during the data generation process, allowing multiple attempts of each subtask and reversion to earlier stages. When evaluated under the mistake distribution, the learned policy can recover from this failure and occasionally exhibit closed-loop regrasping behavior, an emergent behavior that does not appear in the dataset.
Data Collection
Learned Policy Execution
We perform 3 training runs with different seeds (2000 epochs each) for each of the settings below, reporting mean and standard deviation in the highest performing model checkpoint (50 trials every 50 epochs). We find low variance (standard deviation <7%) across training seeds, indicating stability.
Nut Insertion, Source Interventions: 32.0% +/- 5.9%
Nut Insertion, MimicGen Full Demos: 54.7% +/- 6.8%
Nut Insertion, I-MG Policy Ablation: 86.0% +/- 0.0%
Nut Insertion, I-MG: 98.7% +/- 0.9%
2-Piece Assembly, I-MG: 74.7% +/- 5.2%
Coffee, I-MG: 84.0% +/- 4.3%