Proc4Gem

Proc4Gem: Foundation models for physical
agency through procedural generation

Yixin Lin†, Jan Humplik†, Sandy H. Huang†, Leonard Hasenclever*, Francesco Romano*, Stefano Saliceti*,
Daniel Zheng*, Jose Enrique Chen*, Catarina Barros*, Adrian Collister*, Matt Young*, Adil Dostmohamed*, Ben Moran*,
Ken Caluwaerts, Marissa Giustina, Joss Moore, Kieran Connell, Francesco Nori‡, Nicolas Heess‡, Steven Bohez‡, and Arunkumar Byravan‡

Google DeepMind

†co-first author, *core contributor, ‡co-last author

Foundation models for physical agency through procedural generation

Abstract

In robot learning, it is common to either ignore the environment semantics, focusing on tasks like whole-body control which only require reasoning about robot-environment contacts, or conversely to ignore contact dynamics, focusing on grounding high-level movement in vision and language. In this work, we show that advances in generative modeling, photorealistic rendering, and procedural generation allow us to tackle tasks requiring both. By generating contact-rich trajectories with accurate physics in semantically-diverse simulations, we can distill behaviors into large multimodal models that directly transfer to the real world: a system we call Proc4Gem. Specifically, we show that a foundation model, Gemini, fine-tuned on only simulation data, can be instructed in language to control a quadruped robot to push an object with its body to unseen targets in unseen real-world environments. Our real-world results demonstrate the promise of using simulation to imbue foundation models with physical agency.

Full paper: arXiv

Our Approach

Step 1

Step 2

Step 3

Real World Evaluations

After fine-tuning Gemini on only simulated trajectories, we test its performance across a suite of real-world evaluations, in a living room with objects never seen during training.

Out-of-Distribution Evaluations

The fine-tuned Gemini policy generalizes to unseen target categories, including "giraffe" and "person". The policy is also robust, and is able to succeed even when the weight of the trolley is substantially higher.

unseen target category: "giraffe"

unseen target category: "person"

target "orange chair" with novel dynamics:
trolley with 10kg (22lb) of weights

Quantitative Evaluations

Compared to a baseline policy (modeled after SPOC), the fine-tuned Gemini policy is more successful in "hard" starting configurations. In the "hard" setting, the target object is not initially in the robot's view, so it must search for it as seen in the videos below.