COG: Connecting New Skills to Past Experience with Offline Reinforcement Learning

Avi Singh, Albert Yu, Jonathan Yang, Jesse Zhang, Aviral Kumar, Sergey Levine

UC Berkeley

link to arxiv | github | datasets | video

To appear in Conference on Robot Learning (CoRL), 2020

Introduction and Motivation

Reinforcement learning methods typically involve collecting a large amount of data for every new task. Since the amount of data we can collect for any single task is limited due to time and cost considerations in the real-world, the learned behavior is usually quite narrow.
In this paper, we propose an approach to incorporate a large amount of prior data, either from previously solved tasks or from unsupervised or undirected environment interaction, to extend and generalize learned behavior. This prior data is not specific to any one task, and can be used to extend a variety of downstream skills.
We train our policies in an end-to-end fashion, mapping high-dimensional image observations to low-level robot control commands, and present results in both simulated and real world domains. Our hardest experimental setting involves composing four robotic skills in a row: picking, placing, drawer opening, and grasping, where a +1/0 sparse re-ward is provided only on task completion.

Problem Setting

The input to our method consists of two distinct types of datasets. The first dataset (called "prior" data) consists of unlabeled interaction data, with no associated reward labels whatsoever. For instance in the examples we consider, which we also pictorially show below, this dataset need not contain any interaction with the task object of interest, and not all of this data is going to be useful for the downstream task. In addition to this data, we are provided with a smaller task-specific dataset which is labeled with sparse rewards, similar to standard reinforcement learning settings. Our goal is to utilize both the prior data and task-specific data to learn a policy that can execute the task from initial conditions that were unseen in the task data

We show that dynamic programming alone can leverage prior datasets

We start by running Q-learning on the task data, which allows for Q-values to propagate from high rewards states to states that are further back from the goal. We then add the prior dataset to the replay buffer, assigning all transitions a zero reward. Further dynamic programming on this expanded dataset allows Q-values to propagate to initial conditions that were unseen in the task data. Running reinforcement learning now results in a policy that can solve the task of interest from new initial conditions. Note that there is no single trajectory in our dataset that solves the entire task from these new starting conditions, but Q-learning allows us to "stitch" together relevant trajectories from prior and task data, without any additional supervision.