Yixin Lin†, Jan Humplik†, Sandy H. Huang†, Leonard Hasenclever*, Francesco Romano*, Stefano Saliceti*,
Daniel Zheng*, Jose Enrique Chen*, Catarina Barros*, Adrian Collister*, Matt Young*, Adil Dostmohamed*, Ben Moran*,
Ken Caluwaerts, Marissa Giustina, Joss Moore, Kieran Connell, Francesco Nori‡, Nicolas Heess‡, Steven Bohez‡, and Arunkumar Byravan‡
Google DeepMind
†co-first author, *core contributor, ‡co-last author
In robot learning, it is common to either ignore the environment semantics, focusing on tasks like whole-body control which only require reasoning about robot-environment contacts, or conversely to ignore contact dynamics, focusing on grounding high-level movement in vision and language. In this work, we show that advances in generative modeling, photorealistic rendering, and procedural generation allow us to tackle tasks requiring both. By generating contact-rich trajectories with accurate physics in semantically-diverse simulations, we can distill behaviors into large multimodal models that directly transfer to the real world: a system we call Proc4Gem. Specifically, we show that a foundation model, Gemini, fine-tuned on only simulation data, can be instructed in language to control a quadruped robot to push an object with its body to unseen targets in unseen real-world environments. Our real-world results demonstrate the promise of using simulation to imbue foundation models with physical agency.
After fine-tuning Gemini on only simulated trajectories, we test its performance across a suite of real-world evaluations, in a living room with objects never seen during training.
Out-of-Distribution Evaluations
The fine-tuned Gemini policy generalizes to unseen target categories, including "giraffe" and "person". The policy is also robust, and is able to succeed even when the weight of the trolley is substantially higher.
unseen target category: "giraffe"
unseen target category: "person"
target "orange chair" with novel dynamics:
trolley with 10kg (22lb) of weights
Quantitative Evaluations
Compared to a baseline policy (modeled after SPOC), the fine-tuned Gemini policy is more successful in "hard" starting configurations. In the "hard" setting, the target object is not initially in the robot's view, so it must search for it as seen in the videos below.
Red Sofa (hard)
fine-tuned Gemini (success)
baseline (failure): pushes to wrong target
baseline (success)
Orange Armchair (hard)
fine-tuned Gemini (success)
baseline (failure): pushes out of bounds
baseline (success)
White Bin (hard)
fine-tuned Gemini (success)
baseline (failure): pushes into obstacle
baseline (success)