Behavior Retrieval: Few-Shot Imitation Learning by Querying Unlabeled Datasets

Maximilian Du, Suraj Nair, Dorsa Sadigh, Chelsea Finn

Stanford University

The Behavior Retrieval Approach

Most of the time, to get a robot to learn a downstream task through behavior cloning, we need to collect many of our own demonstrations and train the robot from scratch.

Goal: Can we use offline, unlabeled datasets to help a robot learn a downstream task?

Key Insight: Task-specific data not only provides new data for an agent to train on but can also inform the type of prior data the agent should use for learning.

We propose a simple approach that uses a small amount of downstream task-specific data to selectively query relevant behaviors from an offline, unlabeled dataset (including many sub-optimal behaviors). To do this, we first train a state-action embedder on the offline data. We compute the embeddings for all the data, and then pick the transitions in the offline data that lay the closest to the task-specific data in embedding space.

The agent is then jointly trained on the expert and queried data. Throughout this process, we make no assumptions about task labels in the offline data.

Sink Environment

In this environment, we want to grasp a soap container from a dish rack and place it on a red plate in the toy sink. There is no analgous task present in the offline dataset, but there are relevant sub-trajectories that involve the soap container. Can Behavior Retrieval extract the relevant sub-trajectories from the offline data to improve downstream task performance?

Mixed Offline Dataset

We use a 485-demo subset of the Bridge Dataset that contains all the manipulations in this toy sink

Task-Specific Dataset

We collect 10 task-specific demonstrations of placing the soap on the red plate, in a copy of the toy sink found in the bridge dataset.

Ours: Behavior Retrieval (26 / 40)

By leveraging the relevant parts of the offline demos, behavior retrieval is able to improve its performance even on a task that is not explicitly part of the offline mixed dataset.

Task Data Only (17/40)

Without additional data, we can get a functional policy, but it sometimes struggles to grasp the soap or place it properly on the plate

All Data (7 / 40)

If we naively include all of the bridge data in the model training, the model does not fit well to the desired task, which yields some significant failures, like moving the sink around.

Goal Conditioned (0/ 40)

Due to the distribution shift between the Bridge Data and our environment, zero-shot goal conditioned models do not work

Goal Conditioned Finetuned (16/ 40)

After finetuning with the task data, we see a reasonable policy, but there is still a large gap in performance.

Pickle Environment

In this environment, we want to grasp a pickle and place it into a toy cup. The offline dataset contains an analogus task, but there are different distractor objects and table colors. Can Behavior Retrieval extract the analogous task to improve downstream task performance?

Mixed Offline Dataset

We use a 285-demo subset of the Bridge Dataset that contains tabletop maipulations, including an analogous pickle-in-cup task.

Task-Specific Dataset

We collect 10 task-specific demonstrations of placing a pickle into a red cup.

Ours: Behavior Retrieval (23 / 40)

By leveraging the relevant parts of the offline demos, behavior retrieval is able to outperform even the ground truth task selection, because it is able to leverage relevant subtrajectories of otherwise irrelevant tasks. This is particularly noticeable with the stable grasping, which is shared in many of the bridge trajectories.

Task Labels (10/20)

We can use a similar technique used in the Bridge Dataset paper and use a one-hot task label for all of the tasks in the bridge dataset. During test-time, we use the one-hot that corresponds to the analogous task in the bridge data. This does not work zero-shot (due to shifts in the environment between the precollected dataset and the environment), and only with fine-tuning were we able to get this performance with the privileged data.

Task Data Only (11/40)

Without additional data, the policy struggles to find the pickle and lift it consistently

All Data (12 / 40)

If we naively include all of the bridge data to train the model, the model will try to capture more modalities than necessary, which yields a drop in performance, especially on the pickle pickup and the pickle alignment.

Goal Conditioned (0/ 40)

Due to the distribution shift between the Bridge Data and our environment, zero-shot goal conditioned models do not work

Goal Conditioned Finetuned (11/ 40)

With finetuning on the target data, we get an increase in performance, but the robot still struggles with following the right task.

Robustness of Behavior Retrieval

With a large offline dataset, we get more exposure to different visual features and objects. Can Behavior Retrieval leverage this diversity to create a more robust policy?

Real Pickle (4/10 behavior retrieval vs. 0/10 task data only)

We substituted a real pickle (different color), which reduced target data-only performance to zero, while behavior retrieval drops only a little in performance.

Physical Perturbation (4/10 behavior retrieval vs. 3/10 task data only)

We moved the cup as the robot was aligning the pickle. Both models showed some robustness to this perturbation, although our method does a little better.

Wrong Cup (4/10 behavior retrieval vs. 1/10 task data only)

We switched out the red cup used in the task demonstration with a blue cup. This reduced target data-only performance drastically, but behavior retrieval only drops a little because it sees a similar blue cup in the bridge dataset.

Distractors (4/10 behavior retrieval vs. 0/10 task data only)

We included some toy kitchen items in the visual field of the robot. This caused target task performance to drop to zero, while behavior retrieval drops only a little because it sees similar distractors in the selected bridge dataset.

Nut Assembly (real)

In this environment, we want to grasp a toy square and insert it onto the red peg. The offline dataset contains this task, as well as an adversarial task that inserts the square onto the green peg. Can Behavior Retrieval select the relevant task and ignore the adversarial task?

Mixed Offline Dataset

We consider an offline dataset of 160 demonstrations, with 80 going to the right red peg and 80 going to the wrong green peg.

Task-Specific Dataset

We collect 10 task-specific demonstrations that go to the right red peg

Ours: Behavior Retrieval (20 / 40)

By leveraging the relevant parts of the offline demos, behavior retrieval is able to create a robust policy.

Ground Truth (17/40)

If we assume privileged access to the task labels in the offline data and only train on data corresponding to the target task, we also get a usable policy. However, it performs worse than Behavior Retrieval because we are excluding some useful data in the irrelevant tasks in the offline data, like how to grasp an object.

Task Data Only (0/40)

There is not enough data (10 demos only) to learn a functional policy

All Data (12 / 40)

If we naively include all of the offline data in the training, the model will get confused and perform the wrong task (putting the hole on the green peg)

Goal Conditioned (10/ 40)

Goal conditioning allows us to train on multiple tasks, but the goal state often underdetermines the necessary trajectory, which leads to problems in harder parts of the task, like aligning the peg.

Video

deanonymized_main_video_submission.mp4

Page updated

Google Sites

Report abuse