Human-in-the-Loop Imitation Learning using Remote Teleoperation

Ajay Mandlekar Danfei Xu* Roberto Martin-Martin* Yuke Zhu Li Fei-Fei Silvio Savarese

Contact: amandlek@stanford.edu

Paper Link

Video Summary

Motivation

Imitation Learning suffers from covariate shift

Small action errors can lead to unseen states, causing failure.

Intervention-based Policy Learning

Allowing a human to intervene during policy rollouts and provide corrections can mitigate these issues. However, most prior methods have been limited to 2D driving domains, which are substantially more tolerant to error than manipulation.

Bottlenecks in Manipulation Tasks

peg_covar_1.mp4

Human Demonstration

Insertion requires a precise sequence of actions - we call such regions of the state space bottlenecks.

peg_covar_2.mp4

Policy Execution

Small deviations near bottleneck regions can cause a trained policy to fail.

Contributions

We develop a system that enables remote teleoperation for 6-DoF robot control and a natural human intervention mechanism well suited to robot manipulation.
We introduce Intervention Weighted Regression (IWR), a simple yet effective method to learn from human interventions that encourages the policy to learn how to traverse bottlenecks through the interventions.
We evaluate our system and method on two challenging contact-rich manipulation tasks: a threading task and coffee machine task. We demonstrate that (1) policies trained on data collected by our system outperform policies trained on an equivalent amount of full human demonstration trajectories, (2) IWR outperforms alternatives for learning from the intervention data, and (3) our results hold across data collected from multiple human operators.

Remote Teleoperation for Collecting Interventions

Our system allows operators to remotely monitor trained policies and intervene when necessary. An operator only needs a smartphone and a web browser to participate in data collection. The operator watches the trained policy in a video stream until they decide to intervene. During an intervention, they move their phone in free space to apply relative pose commands to the robot arm. This provides a natural way for users to apply corrections.

Intervention Weighted Regression (IWR)

Our method partitions the collected data into intervention and non-intervention samples, and then samples them during training in equal proportion. This effectively re-weights the data to prioritize interventions while regularizing the policy to stay close to the policy used for data collection.

We use IWR iteratively to alternate between data collection with a human and the latest policy iteration and training with IWR to update the policy.

IWR algorithm block

Multi-Stage Manipulation Tasks with Bottlenecks

task_threading.mp4

Threading

The robot must thread a rod into a wooden ring. The task contains two bottlenecks - grasping the rod and inserting the rod in the ring. The insertion needs to be performed carefully - the ring can move easily if the rod hits the ring.

task_coffee.mp4

Coffee Machine

The robot must prepare a cup of coffee. The task contains three bottlenecks - grasping the pod, fitting it into the machine, and closing the lid. The pod grasping and insertion require careful precision - small errors can cause the pod to slip out of the hand or fail to be inserted into the machine.

The robot needs to generalize to a diverse distribution of task instances.

reset_dist_SawyerCircusTeleop.mp4

reset_dist_SawyerCoffeeContactTeleop.mp4

Data Collection Study

We collected data across 3 different operators - they differed in skill level and produced diverse quality data.

operator_diff_1.mp4

Operator 1 (Experienced)

operator_diff_2.mp4

Operator 3 (Inexperienced)

Intervention data outperforms full human demonstrations

IWR outperforms other intervention-based algorithms

IWR can learn from data collected by other intervention-based algorithms

Common Mistakes and Corrections

mistakes_threading_qual_1_ajay_v2_bs_true_int_only_slow.mp4

mistakes_coffee_qual_1_josiah_v3_bs_true_int_only_slow.mp4

Most mistakes and corrections occur near bottleneck regions.

Qualitative Policy Performance

IWR_threading_qual_1_ajay_v3_seed_1_epoch_1750.mp4

IWR_coffee_qual_1_ajay_v3_seed_100_epoch_1350.mp4

Trained policies demonstrate corrective behaviors.

Page updated

Report abuse