DINOBot: Robot Manipulation via Retrieval and Alignment with Vision Foundation Models

Code for the retrieval and alignment parts downloadable here.

In a nutshell, what's the strength of the method?

By utilising both the pixel-level and image-level understanding abilities of Vision Foundation Models, DINOBot is able to learn tasks with a single demo, generalising to many different objects. Therefore, to obtain a general and versatile repertoire of manipulation abilities, only a handful of total demos are needed.

Explainer Video: How does DINOBot work?

In this short explainer video we describe how we provide demonstrations to the robot, what data is recorded, and how DINOBot acts at test time via the retrieval, alignment and replay phases. Videos sped up 4x.

DINOBot Short Explainer.mp4

Efficiency and Transferability: One-Shot Imitation Learning Ability

DINOBot can learn novel tasks with a single demonstration, generalising to unseen objects with different sizes and aspects. In these uninterrupted videos, we show how the user quickly provides one demo on an object, with DINOBot immediately generalising to novel objects with different shapes and aspect. We include both successes and failures as there are no cuts. 

One Shot Generalisation to different bottles

Kettles

Cups

Pans

bottles_generalisation_hd.mp4

Bottles

kettles_generalisation_hd.mp4

Kettles

cups_generalisation_hd.mp4

Cups

pans_generalisation.mp4

Pans

Generality: Few-Shot Generalisation to Out of Distribution Objects

Thanks to the retrieval, alignment and replay framework, DINOBot can efficiently generalise behaviour to unseen, out-of-distribution objects. With as few as 6 total demos, one per object, DINOBot can grasp a wide range of very different, unseen objects. This demonstrates unprecedented scalability and learning efficiency.

OOD generalisation.mp4

Dexterity and Versatility: Some Examples of New Task Learning

In these videos we show uninterrupted videos of quickly learning new tasks, including dexterous tasks that involve tools, or contact-rich tasks, with immediate deployment to novel poses of the objects and novel objects. To demonstrate DINOBot's generality, we use all the tasks from these recent published papers:
1) Relational-NDF, CoRL 2022 (where instead of bottle in container we have multiple, more precise insertion tasks) + 2) FISH, RSS 2023 (non multifingered ones - as door opening, we have open a microwave door - for key insertion, we have several precise insertion tasks with a <5mm error tolerance), + 3) VINN, RSS 2022 (where instead of pushing directly, we swipe, i.e. push with a tool), + 4) Relay Policy Learning, CoRL 2019 (all the kitchen tasks + grasping kettles).
Here are a few examples of tasks. We include both successes and failures as there are no cuts. 

new_task_flip.mp4

Flipping

new_task_hang_cup.mp4

Hanging

multistage_dish_2.mp4

Opening and loading dishwasher

new_task_swipe.mp4

Sweeping

new_task_blocks.mp4

Assembling blocks

new_task_stack.mp4

Stacking

open_microwave_6dof.mp4

Open Microwave

turn_knob_6dof.mp4

Turn Knob

Beyond Tabletop: One Shot Learning of Dexterous, 6-DoF Skills

Here we demonstrate DINOBot's one-shot learning abilities, using the alignment and replay phases, in a complex 6-DOF kitchen environment. The video demonstrates step-by-step keypoints extraction and matching, proving that DINOBot can work seamlessly also in more challenging, 6-DoF environments, quickly learning dexterous everyday tasks like opening a microwave. Videos sped up 4x. 

Kitchen_Open_Microwave.mp4

Multi-Object Manipulation using off-the-shelf Object Detection

Here we show how, coupled with an off-the-shelf object segmentation model, DINOBot can tidy up a table by performing general grasping on many everyday object. All the grasping behaviour is decided by our framework, while the in-box placement movement is scripted. Please refer to the Applications section of our paper for more details. Videos sped up 4x. 

Tidying_Table.mp4

Robustness: Invariance to Distractors, Lighting, Background and Novel Objects

Here we demonstrate DINOBot's one shot imitation learning robustness to distractors, different light conditions and different backgrounds. We provide a single demonstration in the GIF on the left, and deploy DINOBot immediately (using the alignment and replay phases) by not only using unseen objects (e.g. different beverages), but also adding distractors and changing light and background (right).

Demonstration

Novel object + distractors

Novel object + different background + different light + distractors

Scalability: Additional Examples of Tasks on Unseen Test Objects

A few examples of test time trials. All objects are novel, unseen object: DINOBot efficiently learns to solve 15 tasks on 49 objects with just a few minutes of total demonstrations. The robot is able to autonomously understand, after receiving a task specification (e.g. grasp or insert) how to interact with the unseen objects through the retrieval and alignment phases. Videos sped up 4x.