Using Both Demonstrations and Language Instructions to More Efficiently Learn Robotic Tasks

Method Overview

Motivation

Current multitask policies use task embeddings based on one hot vectors, language embeddings, or demonstration embeddings. However, language instructions and video demonstrations can often be ambiguous, especially if they were provided in environments that do not perfectly align with the environment the robot is evaluated in.

To resolve ambiguities and decrease teacher effort needed when specifying new tasks, we propose using task embedding vectors that are bimodal: containing both language embedding features and visual demonstration features. Directly conditioning a multitask policy on bimodal task embeddings allows the two modalities to contextually complement each other, enabling the robot to more clearly understand what a new task is and how to perform it.

Contributions

We present DeL-TaCo (Joint Demo-language Task Conditioning), a framework for conditioning a multitask policy simultaneously on both a demonstration and corresponding language instruction. We introduce a challenging distribution of hundreds of robotic pick-and-place tasks and show that DeL-TaCo improves generalization ability and significantly decreases the number of expert demonstrations needed when learning novel tasks during test time.

To our knowledge, this is the first work to show that simultaneously conditioning a multi-task robotic manipulation policy on both demonstration and language embeddings improves sample efficiency and generalization over conditioning on either modality alone.

Sample Tasks (with associated task language strings)

Each gif depicts four rollouts. Container positions and distractor objects are different on each reset.

Selected Train Tasks

Task ID 1: "Put fountain vase in green bin."

Task ID 87: "Put red colored object in red bin."

Task ID 141: "Put chalice shaped object in front bin."

Task ID 270: "Put bottle in right bin."

Selected Test Tasks

Task ID 32: "Put black and white colored object in green bin."

Task ID 97: "Put trapezoidal prism shaped object in red bin."

Task ID 178: "Put box sofa in back bin."

Task ID 248: "Put cylinder shaped object in left bin."

Train/Test Splits

Color Split

Object Split

Shape Split

All Train and Test Tasks

Expand Task Table

Rows = object identifiers. Columns = container identifiers. Numbers indicate task ids, which are only used by the onehot policies.


Architecture

Detailed Architecture


Results

Generalization to Novel Objects, Colors, and Shapes

Training a demo encoder from scratch and using pretrained DistilBERT as the language encoder. DeL-TaCo (red) achieves better generalization performance than policies conditioned on language-only (blue) or demonstration-only (orange), getting far closer to the performance of the one-hot oracle policy which was given access to all of the test tasks during training.

Using pretrained CLIP for the language and demonstration encoder, we still see value in task-conditioning with both demonstrations and language with DeL-TaCo (red) than by learning novel tasks with language alone (blue) or demonstration alone (orange).

View these results as a table


Generalization to Novel Colors, and Shapes

View these results as a table


When trained on all 32 objects and evaluated on only the test colors and shapes, DeL-TaCo outperforms language-only and demonstration-only policies by a wider ~9% margin. Here, we train the demonstration encoder from scratch and use DistilBERT as the language encoder.

How many demonstrations is language worth?

View these results as a table


To evaluate the value of training policies conditioned on both demonstrations and language, we finetune the demonstration-only policy on k demonstrations per test task, where k is indicated in the legend, and compare it to the performance of DeL-TaCo (red dotted line) which was not given access to any demonstration on any test task. We find that the demonstration-only policy only starts to match the performance of DeL-TaCo when trained on 50 demonstrations per test task, showing the tremendous value of learning novel tasks with both language and demonstrations. This demonstrates that in situations where demonstrations are collected in environments that do not perfectly align with the environment the robot is evaluated in, providing both demonstrations and language to specify new tasks requires substantially less teacher effort than specifying tasks with either modality alone.