DINOBot: Robot Manipulation via Retrieval and Alignment with Vision Foundation Models
Code for the retrieval and alignment parts downloadable here.
In a nutshell, what's the strength of the method?
By utilising both the pixel-level and image-level understanding abilities of Vision Foundation Models, DINOBot is able to learn tasks with a single demo, generalising to many different objects. Therefore, to obtain a general and versatile repertoire of manipulation abilities, only a handful of total demos are needed.
Explainer Video: How does DINOBot work?
In this short explainer video we describe how we provide demonstrations to the robot, what data is recorded, and how DINOBot acts at test time via the retrieval, alignment and replay phases. Videos sped up 4x.
Efficiency and Transferability: One-Shot Imitation Learning Ability
DINOBot can learn novel tasks with a single demonstration, generalising to unseen objects with different sizes and aspects. In these uninterrupted videos, we show how the user quickly provides one demo on an object, with DINOBot immediately generalising to novel objects with different shapes and aspect. We include both successes and failures as there are no cuts.
One Shot Generalisation to different bottles
Kettles
Cups
Pans
Bottles
Kettles
Cups
Pans
Generality: Few-Shot Generalisation to Out of Distribution Objects
Thanks to the retrieval, alignment and replay framework, DINOBot can efficiently generalise behaviour to unseen, out-of-distribution objects. With as few as 6 total demos, one per object, DINOBot can grasp a wide range of very different, unseen objects. This demonstrates unprecedented scalability and learning efficiency.
Dexterity and Versatility: Some Examples of New Task Learning
In these videos we show uninterrupted videos of quickly learning new tasks, including dexterous tasks that involve tools, or contact-rich tasks, with immediate deployment to novel poses of the objects and novel objects. To demonstrate DINOBot's generality, we use all the tasks from these recent published papers:
1) Relational-NDF, CoRL 2022 (where instead of bottle in container we have multiple, more precise insertion tasks) + 2) FISH, RSS 2023 (non multifingered ones - as door opening, we have open a microwave door - for key insertion, we have several precise insertion tasks with a <5mm error tolerance), + 3) VINN, RSS 2022 (where instead of pushing directly, we swipe, i.e. push with a tool), + 4) Relay Policy Learning, CoRL 2019 (all the kitchen tasks + grasping kettles).
Here are a few examples of tasks. We include both successes and failures as there are no cuts.
Flipping
Hanging
Opening and loading dishwasher
Sweeping
Assembling blocks
Stacking
Open Microwave
Turn Knob
Beyond Tabletop: One Shot Learning of Dexterous, 6-DoF Skills
Here we demonstrate DINOBot's one-shot learning abilities, using the alignment and replay phases, in a complex 6-DOF kitchen environment. The video demonstrates step-by-step keypoints extraction and matching, proving that DINOBot can work seamlessly also in more challenging, 6-DoF environments, quickly learning dexterous everyday tasks like opening a microwave. Videos sped up 4x.
Multi-Object Manipulation using off-the-shelf Object Detection
Here we show how, coupled with an off-the-shelf object segmentation model, DINOBot can tidy up a table by performing general grasping on many everyday object. All the grasping behaviour is decided by our framework, while the in-box placement movement is scripted. Please refer to the Applications section of our paper for more details. Videos sped up 4x.
Robustness: Invariance to Distractors, Lighting, Background and Novel Objects
Here we demonstrate DINOBot's one shot imitation learning robustness to distractors, different light conditions and different backgrounds. We provide a single demonstration in the GIF on the left, and deploy DINOBot immediately (using the alignment and replay phases) by not only using unseen objects (e.g. different beverages), but also adding distractors and changing light and background (right).
Demonstration
Novel object + distractors
Novel object + different background + different light + distractors
Scalability: Additional Examples of Tasks on Unseen Test Objects
A few examples of test time trials. All objects are novel, unseen object: DINOBot efficiently learns to solve 15 tasks on 49 objects with just a few minutes of total demonstrations. The robot is able to autonomously understand, after receiving a task specification (e.g. grasp or insert) how to interact with the unseen objects through the retrieval and alignment phases. Videos sped up 4x.