ICRA 2020 Presentation
Large-scale supervision has accelerated progress in other fields such as Computer Vision and Natural Language Processing, but policy learning has witnessed no such success.
Several large-scale datasets exist today for Computer Vision and Natural Language Processing. These datasets have enabled rapid progress in these fields.
However, new supervision mechanisms such as RoboTurk allow for 1000s of task demonstrations to be collected in a matter of days. The advent of such mechanisms and large datasets motivates the following question: does a policy learning algorithm necessarily need to interact or can a robust and performant policy be learned purely from external experiences in large datasets?
The RoboTurk supervision mechanism
The RoboTurk dataset
For example, suppose we are a given a large collection of demonstrations on a pick-and-place task where the robot arm must pick up a soda can and place it in a target region. Given this large dataset, we would like to learn a policy without allowing the policy to collect additional data.
Our goal is to leverage a large collection of demonstrations to learn a performant policy without collecting any additional data.
Learning in this setting presents several challenges - the data can consist of suboptimal solution approaches and exhibit substantial diversity. To overcome these issues, we propose Implicit Reinforcement without Interaction at Scale (IRIS), a novel policy learning framework for offline learning from large-scale datasets.
- We propose Implicit Reinforcement without Interaction at Scale (IRIS), a policy learning framework that enables offline learning from a large set of diverse and suboptimal demonstrations by selectively imitating local sequences from the dataset.
- We evaluate IRIS across three datasets collected on tasks of varying difficulty. The first dataset is a pedagogical dataset that exhibits significant diversity in the demonstrations. The second dataset exhibits significant suboptimality in the demonstrations and is collected by one user. The third dataset is the RoboTurk dataset collected by humans via crowdsourcing. While our framework can leverage rewards if present in the demonstrations, the experiments only assume sparse task completion rewards that occur at the end of each demonstration.
- Empirically, our experiments demonstrate that IRIS is able to leverage large-scale off-policy task demonstrations that exhibit suboptimality and diversity, and significantly outperforms other imitation learning and batch reinforcement learning baselines.
Why is learning from large-scale demonstrations difficult?
Demonstrations collected through large-scale supervision can have significant suboptimality.
Fumbling the Can
The robot fumbles the can around before grasping it.
Multiple Missed Grasps
The robot executes grasps at the wrong time.
Failed Sideways Grasp
The robot tries to pick the can up sideways and fails.
Failed Top-Down Grasp
The robot tries to approach the can from the top to grasp it but misses.
Getting Stuck During Placement
The robot has trouble placing the can in the bin.
Blocked by the Wall
The robot gets blocked by the wall.
Demonstrations collected through large-scale supervision can consist of diverse task instances and exhibit different solution strategies.
Diversity in Task Instances
The demonstrations start with significant variation in arm and can poses.
Diversity in Solution Strategies
Careful Top-Down Grasp
The robot carefully approaches the can from the top to grasp it.
Slide Gripper Into Can for Grasp
The robot slides the gripper into the can sideways to grasp it.
Partial Tilt and Grab
The robot partially tilts the can to grasp it.
Full Tilt and Top-Down Grasp
The robot tilts the can over and then grasps it from the top.
Full Tilt and Slide Grasp
The robot tilts the can over and then slides the gripper into the can to grasp it.
Use Wall for Support in Grasp
The robot uses the wall as a support point to aid in grasping.
- Our main insight is to decompose the policy learning problem into two components: (1) low-level goal-conditioned imitation and (2) high-level goal selection.
- The low-level controller is trained to imitate short sequences from the demonstration data. The last observation is treated as the goal. Thus, this controller is able to reach nearby goals.
- The high-level goal selection mechanism consists of a generative model that proposes goals, and a value function that is used to pick the best goals.
- At test-time, given an observation, the high-level selects a new goal, and the low-level is unrolled for T timesteps to try and reach that goal.
- This process repeats until the end of the rollout.
How does IRIS account for suboptimal demonstrations?
- The low-level only needs to reproduce short action sequences, and so it has no need to account for suboptimal actions. The burden of suboptimality is tackled by the high-level.
- The value function is used to choose goals that make significant task progress, accounting for suboptimal demonstrations.
How does IRIS account for diverse demonstrations?
- The goal- conditioned controller is trained to condition on future goal observations at a fine temporal resolution and produce unimodal action sequences.
- Consequently, it is not concerned with modeling diversity, but rather reproduces small action sequences in the dataset to move from one state to another.
- The generative model in the goal selection mechanism proposes potential future observations that are reachable from the current observation - this explicitly models the diversity of solution approaches.
- In this way, IRIS decouples the problem into reproducing specific, unimodal sequences (policy learning) and modeling state trajectories that encapsulate different solution approaches (diversity), allowing for selective imitation.
How does IRIS learn from off-policy data?
- IRIS constrains learning to occur within the distribution of training data.
- The goal-conditioned controller directly imitates sequences from the training data, and the generative goal model is also trained to propose goal observations from the training data.
- The value learning component of the goal selection mechanism mitigates extrapolation error by making sure that the Q-network is only queried on state-action pairs that lie within the training distribution, as in this work.
Datasets and Tasks
A large, varied dataset generated by sampling random paths from the start location to the goal, and playing noisy, random magnitude actions to move along sampled random paths. Demonstration paths that deviate from the central path are made to take longer detours before joining the central path again. The dataset contains 250 demonstrations.
The goal is to actuate the Sawyer robot arm to grasp and lift the cube on the table. The demonstrator lifted the cube with a consistent grasping strategy, but took their time to grasp the cube, often moving the arm to the cube and then back, or actuating the arm from side to side near the cube, as shown in. This was done intentionally to ensure that there would be several state-action pairs in the dataset with little value. The dataset contains 137 demonstrations.
A filtered version of the RoboTurk pilot dataset consisting of the fastest 225 trajectories. These demonstrations were collected across multiple humans and exhibit significant suboptimality and diversity in the solution approaches.
Qualitative Results: Learned Policies
Behavioral Cloning (BC)
BC tries to directly imitate actions at each state, making it extremely sensitive to suboptimal data.
Batch Constrained deep-Q Learning (BCQ)
BCQ learns a value function, and is consequently able to learn to reach. However, a failed grasp makes the policy diverge due to moving to states the algorithm never saw at train time.
IRIS is able to selectively reproduce short behaviors from the dataset, leading to task success.
Behavioral Cloning (BC)
BC is unable to deal with diverse supervision because it only conditions on the current state.
Batch Constrained deep-Q Learning (BCQ)
BCQ is unable to reproduce reasonable behavior from the dataset since it is unable to capture the diversity in the dataset.
IRIS uses goal-conditioned imitation, which is crucial to enable learning from this diverse dataset.
Qualitative Results: Interesting Cases
IRIS can recover from bad states.
Even when IRIS fails, it exhibits stable, closed-loop behavior.
Qualitative Results: Visualizing Goal Selection
Visualization of the low-level controller taking goals selected on the right and trying to reach them.
Visualization of the goals selected by IRIS along a successful policy rollout.
Quantitative Results: Task Success Rate
IRIS can get ~80% success rate on new lift task instances.
IRIS is the only algorithm that is able to learn performant policies from crowdsourced demonstrations.
Quantitative Results: Learning Curves
We present a comparison of IRIS against several baselines on the Robosuite Lift and RoboTurk Cans datasets. There is a stark contrast in performance between variants of IRIS and the baseline models, which suggests that goal-conditioned imitation is critical for good performance.
Quantitative Results: Dataset Size Ablation
We present a dataset size comparison to understand how the performance of IRIS is affected by different quantities of data.