Language-Driven Representation Learning for Robotics

Siddharth Karamcheti, Suraj Nair, Annie S. Chen, Thomas Kollar, Chelsea Finn, Dorsa Sadigh, Percy Liang


Balancing  Visual Reconstruction & Language Generation to Shape Learned Representations
Pretraining Code & Model Artifacts | Evaluation Harness Code

Abstract

Recent work in visual representation learning for robotics demonstrates the viability of learning from large video datasets of humans performing everyday tasks. A product of recipes such as masked autoencoding and contrastive learning, these representations demonstrate strong transfer when applied to policy learning for visuomotor control. 

Yet, robot learning encompasses a diverse set of problems including grasp affordance prediction, language-conditioned imitation learning, and intent tracking for human-robot collaboration, amongst others. To motivate this work, we show that evaluating existing representations against this spectrum produces diverging results: masked autoencoding approaches pick up on low-level spatial features at the cost of high-level semantics, while contrastive learning approaches reflect the opposite; no existing approach enables flexibility in setting the balance of features captured in their learned representations

To remedy this, we introduce Voltron, a framework for language-driven representation learning from human videos and associated captions. Voltron is characterized by its tunability, explicitly trading off language-conditioned visual reconstruction to learn low-level visual patterns, and visually-grounded language generation to encode high-level semantics. 

We evaluate Voltron models against a new evaluation suite we construct that spans five distinct robot learning problems – a unified platform for holistically evaluating visual representation for robotics. Through comprehensive experiments across in all five areas, we show that Voltron's language-driven representations strictly outperform the prior art, while the tunability of our framework affords even stronger performance on targeted problems requiring higher-level features.

Voltron – Evaluation Suite

Voltron – Framework