Videos – Language-Conditioned Imitation Learning
Real-Robot Experiments – Qualitative Rollout Videos (including Failure Modes)
Overview
We provide a few qualitative videos depicting rollouts of the learned language-conditioned imitation learning policies, adapted on top of various pretrained representations. As in the main text, we evaluate the following models:
R-MVP – our reproduction of Masked Visual Pretraining trained on Something-Something v2.
R-R3M – our reproduction of Reuseable Representations for Robotic Manipulation trained on Something-Something-v2.
Note that we evaluate both the ResNet-50 variant and the ViT-Small variant.
V-Cond – Voltron with only language-conditioning (single-frame).
V-Dual – Voltron with dual-frame (initial + current frame) conditioning.
V-Gen – Voltron with dual-frame conditioning and α = 0.5 (language generation).
All models are trained on the same demonstration data in the following environment:
Real-World Language-Conditioned Imitation Learning Environment
We study five unique behaviors, and how performance changes when adding various distractors.
Videos
We show videos for the following three language instructions (from the marked split) across each representation learning approach:
(In-Distribution / No Distractors) "close the drawer all the way"
(In-Distribution / No Distractors) "move the mug to the purple plate"
(In-Distribution / No Distractors) "throw away the used coffee pods"
(Distractor: Purple -> Green Textbook) "close the drawer all the way"
(Distractor: Play Voltron: the Animated Series in background) "throw the chips in the garbage"
We note that all evaluation language instructions are "unseen" (held-out) and the environment is reset (some objects shuffled around, following the data collection procedure) between episodes. The video is shot on a mobile phone placed slightly below the robot camera (mounted on a tripod) – some parts of the scene are occluded.
R-MVP
"close the drawer all the way" – ✅
"move the mug to the purple plate" – ❌
"throw away the used coffee pods" – ❌
(Distractor: Textbook) "close the drawer all the way" – ✅
(Distractor: Voltron Video) "throw the chips in the garbage" – ❌
R-R3M (ResNet-50)
"close the drawer all the way" – 🤔
(partial credit – makes it to drawer)
"move the mug to the purple plate" – ❌
"throw away the used coffee pods" – ❌
(Distractor: Textbook) "close the drawer all the way" – ❌
(completely collapses)
(Distractor: Voltron Video) "throw the chips in the garbage" – ❌
(completely collapses)
R-R3M (ViT-Small)
"close the drawer all the way" – ✅
"move the mug to the purple plate" – ❌
"throw away the used coffee pods" – ❌
(Distractor: Textbook) "close the drawer all the way" – ❌
(completely collapses)
(Distractor: Voltron Video) "throw the chips in the garbage" – ❌
(completely collapses)
V–Cond
"close the drawer all the way" – ✅
"move the mug to the purple plate" – 🤔
(partial credit – attempts to grasp mug)
"throw away the used coffee pods" – 🤔
(partial credit – grasps coffee pods, fails drop)
(Distractor: Textbook) "close the drawer all the way" – ✅
(Distractor: Voltron Video) "throw the chips in the garbage" – 🤔
(partial credit – grasps chips, fails to drop in trash)
V–Dual
"close the drawer all the way" – ✅
"move the mug to the purple plate" – 🤔
(partial credit – grasps mug, misses drop)
"throw away the used coffee pods" – ❌
(Distractor: Textbook) "close the drawer all the way" – ✅
(Distractor: Voltron Video) "throw the chips in the garbage" – 🤔
(partial credit – grasps chips, but misses drop location completely)
V–Gen
"close the drawer all the way" – ✅
"move the mug to the purple plate" – 🤔
(partial credit – slightly misses grasp)
"throw away the used coffee pods" – ❌
(Distractor: Textbook) "close the drawer all the way" – ✅
(Distractor: Voltron Video) "throw the chips in the garbage" – ✅