Pre-Training for Robots: Offline RL Enables Learning New Tasks in a Handful of Trials

Aviral Kumar*, Anikait Singh*, Frederik Ebert*, Yanlai Yang, Chelsea Finn, Sergey Levine

Contact emails: {aviralk , asap7772}@berkeley.edu

https://arxiv.org/abs/2210.05178 

Video

Overview

Progress in deep learning highlights the tremendous potential of utilizing diverse robotic datasets for attaining effective generalization and makes it enticing to consider leveraging broad datasets for attaining robust generalization in robotic learning as well. However, in practice, we often want to learn a new skill in a new environment that is unlikely to be contained in the prior data. Therefore we ask: how can we leverage existing diverse offline datasets in combination with small amounts of task-specific data to solve new tasks, while still enjoying the generalization benefits of training on large amounts of data? In this paper, we demonstrate that end-to-end offline RL can be an effective approach for doing this, without the need for any representation learning or vision-based pre-training. We present pre-training for robots (PTR), a framework based on offline RL that attempts to effectively learn new tasks by combining pre-training on existing robotic datasets with rapid fine-tuning on a new task, with as few as 10 demonstrations. PTR utilizes an existing offline RL method, conservative Q-learning (CQL), but extends it to include several crucial design decisions that enable PTR to actually work and outperform a variety of prior methods. To our knowledge, PTR is the first RL method that succeeds at learning new tasks in a new domain on a real WidowX robot with as few as 10 task demonstrations, by effectively leveraging an existing dataset of diverse multi-task robot data collected in a variety of toy kitchens. We also demonstrate that PTR can enable effective autonomous fine-tuning and improvement in a handful of trials, without needing any demonstrations. 

Setup

This figure above demonstrates our setup: We use a toykitchen setup described in prior work (Ebert et al. 2021) for our experiments. This utilizes a 6-DoF WidowX 250 robot. (1): Held-out toykitchen used for experiments in Scenario 3 (denoted “toykitchen 6”), (2): Re-targeting toykitchen used for experiments in Scenario 2 (denoted “toykitchen 2”), (3): target objects used in the experiments of scenario 3., (4): the held-out kitchen setup used for door opening (“toykitchen 1”)

Results

Scenario 1: Re-targeting skills for existing tasks for new objects during finetuning

Performance of PTR for “put sushi in metallic pot” in Scenario 1

We utilized the subset of the bridge data with all pick-and-place tasks in one toy kitchen for pre-training, and selected the “put sushi in pot” task as our target task. In order to pose a scenario where the offline policy at the end of pre-training must be re-targeted to act on a different object, we collected only ten demonstrations that place the sushi in a metallic pot. PTR substantially outperforms BC (finetune), even though it is provided access to only demonstration data. 

Scenario 2: Generalizing to previously unseen domains

We study whether PTR can adapt behaviors seen in the pretraining data to new domains. We study a The target door (shown in Figure 4(b)) we wish to open and corresponding toy kitchen domain is never seen previously in the pre-training data, and doors in the pre-training data exhibit different sizes, shapes, handle types and visual appearancesdoor opening task, which requires significantly more complex maneuvers and precise control compared to the pick-and-place tasks studied. The target door we wish to open and corresponding toy kitchen domain is never seen previously in the pre-training data, and doors in the pre-training data exhibit different sizes, shapes, handle types and visual appearances. Results are shown below.

Scenario 3: Learning to solve new tasks in new domains

Unlike the two scenarios studied above, in this scenario, we attempt to solve a new task in a new kitchen scene. This task is represented via a unique task identifier, and we are not provided with any data for this task identifier, or even any data with from the kitchen scene where this task is situated during pre-training. We pre-train on all 80 pick-and-place style tasks from the bridge dataset, while holding out any data from the new task kitchen scene, and then fine-tune on 10 demonstrations for 4 target tasks independently in this new kitchen. Results are shown below.

To better understand why BC (finetune) works worse than PTR, we visualize the trajectory rollouts of PTR and BC (finetune) to understand when and how PTR performs well. As an example in the figure above, observe that while PTR is accurately able to reach to the croissant and grasp it to solve the task, BC (finetune) is imprecise and grasps the bowl instead of the croissant resulting in failure. This indicates that somehow PTR is able to prioritize the most "critical" transitions, success at which is crucial in ensuring that the entire task is solved, despite having access to demonstration data.  This follows prior theoretical analyses such as: https://openreview.net/pdf?id=AP1MKT37rJ

Checkpoint Selection Heuristic

Q-Value Visualization

Since we wish to learn task-specific policies that do not overfit to small amounts of data, thereby losing their generalization ability, we must apply the right number of gradient steps during finetuning: too few gradient steps will produce policies that do not succeed at the target tasks, while too many gradient steps will give policies that have likely lose the generalization ability of the pre-trained policy. To handle this tradeoff, we use a simple heuristic: we run finetuning for many iterations while also plotting the learned Q-values over a held-out dataset of trajectories from the target task. Then, we pick the checkpoint for which the learned Q-values are (roughly) monotonically increasing over the course of an held-out trajectory . Empirically we find that this heuristic guides us to identify a good checkpoints.

Policy Rollouts using Checkpoint Heuristic

Poorly Chosen Checkpoint for Door Task

Well Chosen Checkpoint for Door Task

PTR improves the offline initialization through online fine-tuning

Offline Initialization

9K steps

20K Steps

Above we see the evolution of learned behaviors during online fine-tuning. Note that from a set of held out initial positions, the offline initialization is unable to grasp or open the door. After 9k steps, we see that the policy is beginning to able grasp the door handle but not pull the door open. After 20K steps, we see the policy able to fully open the door successfully.