Using Both Demonstrations and Language Instructions to More Efficiently Learn Robotic Tasks
Method Overview
Motivation
Current multitask policies use task embeddings based on one hot vectors, language embeddings, or demonstration embeddings. However, language instructions and video demonstrations can often be ambiguous, especially if they were provided in environments that do not perfectly align with the environment the robot is evaluated in.
To resolve ambiguities and decrease teacher effort needed when specifying new tasks, we propose using task embedding vectors that are bimodal: containing both language embedding features and visual demonstration features. Directly conditioning a multitask policy on bimodal task embeddings allows the two modalities to contextually complement each other, enabling the robot to more clearly understand what a new task is and how to perform it.
Contributions
We present DeL-TaCo (Joint Demo-language Task Conditioning), a framework for conditioning a multitask policy simultaneously on both a demonstration and corresponding language instruction. We introduce a challenging distribution of hundreds of robotic pick-and-place tasks and show that DeL-TaCo improves generalization ability and significantly decreases the number of expert demonstrations needed when learning novel tasks during test time.
To our knowledge, this is the first work to show that simultaneously conditioning a multi-task robotic manipulation policy on both demonstration and language embeddings improves sample efficiency and generalization over conditioning on either modality alone.
Sample Tasks (with associated task language strings)
Each gif depicts four rollouts. Container positions and distractor objects are different on each reset.
Selected Train Tasks
Task ID 1: "Put fountain vase in green bin."
Task ID 87: "Put red colored object in red bin."
Task ID 141: "Put chalice shaped object in front bin."
Task ID 270: "Put bottle in right bin."
Selected Test Tasks
Task ID 32: "Put black and white colored object in green bin."
Task ID 97: "Put trapezoidal prism shaped object in red bin."
Task ID 178: "Put box sofa in back bin."
Task ID 248: "Put cylinder shaped object in left bin."
Train/Test Splits
Color Split
Object Split
Shape Split
All Train and Test Tasks
Expand Task Table
Rows = object identifiers. Columns = container identifiers. Numbers indicate task ids, which are only used by the onehot policies.
Architecture
Detailed Architecture
Results
Generalization to Novel Objects, Colors, and Shapes
Training a demo encoder from scratch and using pretrained DistilBERT as the language encoder. DeL-TaCo (red) achieves better generalization performance than policies conditioned on language-only (blue) or demonstration-only (orange), getting far closer to the performance of the one-hot oracle policy which was given access to all of the test tasks during training.
Using pretrained CLIP for the language and demonstration encoder, we still see value in task-conditioning with both demonstrations and language with DeL-TaCo (red) than by learning novel tasks with language alone (blue) or demonstration alone (orange).
View these results as a table
Generalization to Novel Colors, and Shapes
View these results as a table
When trained on all 32 objects and evaluated on only the test colors and shapes, DeL-TaCo outperforms language-only and demonstration-only policies by a wider ~9% margin. Here, we train the demonstration encoder from scratch and use DistilBERT as the language encoder.
How many demonstrations is language worth?
View these results as a table
To evaluate the value of training policies conditioned on both demonstrations and language, we finetune the demonstration-only policy on k demonstrations per test task, where k is indicated in the legend, and compare it to the performance of DeL-TaCo (red dotted line) which was not given access to any demonstration on any test task. We find that the demonstration-only policy only starts to match the performance of DeL-TaCo when trained on 50 demonstrations per test task, showing the tremendous value of learning novel tasks with both language and demonstrations. This demonstrates that in situations where demonstrations are collected in environments that do not perfectly align with the environment the robot is evaluated in, providing both demonstrations and language to specify new tasks requires substantially less teacher effort than specifying tasks with either modality alone.