ThriftyDAgger: Budget-Aware Novelty and Risk Gating for Interactive Imitation Learning

Ryan Hoque, Ashwin Balakrishna, Ellen Novoseller,

Albert Wilcox, Daniel Brown, Ken Goldberg

Paper: [Link] | Code: [Link]

Abstract

Effective robot learning often requires online human feedback and interventions that can cost significant human time, giving rise to the central challenge in interactive imitation learning: is it possible to control the timing and length of interventions to both facilitate learning and limit burden on the human supervisor? This paper presents ThriftyDAgger, an algorithm for actively querying a human supervisor given a desired budget of human interventions. ThriftyDAgger uses a learned switching policy to solicit interventions only at states that are sufficiently (1) novel, where the robot policy has no reference behavior to imitate, or (2) risky, where the robot has low confidence in task completion. To detect the latter, we introduce a novel metric for estimating risk under the current robot policy. Experiments in simulation and on a physical cable routing experiment suggest that ThriftyDAgger's intervention criteria balances task performance and supervisor burden more effectively than prior algorithms. ThriftyDAgger can also be applied at execution time, where it achieves a 100% success rate on both the simulation and physical tasks. A user study (N=10) in which users control a three-robot fleet while also performing a concentration task suggests that ThriftyDAgger increases human and robot performance by 58% and 80% respectively compared to the next best algorithm while reducing supervisor burden.

The key insight is that even when states are familiar, the robot policy may have low probability of task success, so we train a Q-function to estimate the probability of successful convergence to a goal set under the robot's policy and use this to define a measure of risk. Since the Q-function need only distinguish clearly risk states (in which the robot needs help) from others, it does not need nearly as much data or need to be as accurate as Q-functions used for reinforcement learning. Experiments suggest that ThriftyDAgger

(1) Can learn robust policies for complex, long-horizon robotic manipulation tasks, including a vision-based cable routing task on a physical robot, while limiting supervisor burden to desired values.

(2) Solicits interventions that enable efficient fleet learning by boosting both robot productivity and human productivity on a distractor task compared to prior algorithms while soliciting fewer interventions and total human actions than prior algorithms.

Peg Insertion in Simulation

We evaluate ThriftyDAgger on a long-horizon peg insertion task in simulation, where the goal is to grasp a washer from an initial random pose and place it over a cylinder at a fixed location. To provide intuition on the task and ThriftyDAgger's switching mechanisms, we show 2 episodes from early in training below. In the first episode, the autonomous policy misses the grasp, resulting in a sufficiently novel state to switch control. In the second episode, the initial pose of the washer is tricky for the low-quality base policy, triggering a risk-based switch. We find that our switching mechanism result in more informative interventions than prior robot-gated algorithms and even human-gated algorithms (Table 1).

Robot Fleet User Study

We then evaluate how ThriftyDAgger and comparisons can be used to help users (N=10) control a fleet of 3 robots performing the same peg insertion task. Experiments suggest that ThriftyDAgger achieves significantly higher throughput (number of task completions by the robot) while helping users simultaneously achieve high performance on a distractor task, which they work on between solicited interventions. Notably, ThriftyDAgger achieves this while soliciting fewer interventions and human actions than comparisons, including a human-gated baseline where the human does not have a distractor task (Table 2). Below we provide videos of the robot-gated user interface and human-gated user interface.

User Interface for Robot-Gated Algorithms

Here we visualize the interface for robot-gated algorithms (ThriftyDAgger, LazyDAgger, SafeDAgger). We also visualize the raw values for ThriftyDAgger's novelty, risk, and discrepancy estimates to provide additional intuition, where the values are highlighted in red if they violate thresholds. Note that the user does not see these values.

User Interface for Human-Gated Algorithms

Here we show the UI for the human-gated algorithm HG-DAgger. The user does not perform a distractor task and is able to see a "bird's eye view" of all 3 robots performing the task on the left. The user can switch which robot to observe on the big screen with the buttons on the top right, and can choose to take or cede control for the robot currently on the big screen.

Physical Cable Routing

Finally, we evaluate ThriftyDAgger and baselines on a long-horizon, image-based cable routing task, where we observe similar results in Table 3. We visualize several illustrative episodes below as GIFs (faster than real-time) of the raw RGB images provided to the learned policies (after downsampling). Teleoperation is performed through the da Vinci Research Kit master controllers visualized below on the right.

Full Demo


1 of 25 task demonstrations initializing the base policy.

Behavior Cloning Rollout (Failure)

Offline Behavior Cloning is unable to make significant task progress.

Teleoperation Interface

The human observes the workspace through the lenses. Control is initiated by pressing the right foot pedal and moving the right joystick, whose pose is synchronized with the robot arm.

HG-DAgger Auto Rollout (Success)

A successful rollout after 1,500 environment steps of training with HG-DAgger, with interventions not allowed during execution.

HG-DAgger Auto Rollout (Failure)

An unsuccessful rollout after 1,500 environment steps of training with HG-DAgger, with interventions not allowed during execution. The robot goes the wrong way after bumping into the final obstacle.

HG-DAgger Intervention-Aided Rollout (Success)

A successful rollout after 1,500 environment steps of training with HG-DAgger, with interventions allowed during execution. Interventions are denoted with white text and slowed down 5x for clarity. Here the human intervenes when it sees the policy veer off-course.

ThriftyDAgger Auto Rollout (Success)

A successful rollout after 1,500 environment steps of training with ThriftyDAgger, with interventions not allowed during execution.

ThriftyDAgger Auto Rollout (Failure)

An unsuccessful rollout after 1,500 environment steps of training with ThriftyDAgger, with interventions not allowed during execution. The robot stalls upon reaching the second obstacle. It is eventually able to escape this, but then it does not reach the goal pose in the allotted time despite passing all 4 obstacles.

ThriftyDAgger Intervention-Aided Rollout

A successful rollout after 1,500 environment steps of training with ThriftyDAgger, with interventions allowed during execution. Interventions are denoted with white text and slowed down 5x for clarity. Here ThriftyDAgger solicits an intervention when the cable begins to push against itself, and resumes autonomous control once the human has untangled it.