R3M: A Universal Visual Representation for Robot Manipulation
Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, Abhinav Gupta
Meta AI | Stanford University
We study if visual representations pre-trained on diverse human videos can enable efficient robotic manipulation. We pre-train a single representation, R3M, utilizing an objective that combines time contrastive learning, video-language alignment, and a sparsity penalty.
Overview
Results
Given just 20 demonstrations (<10 minutes of human supervision) we use R3M to learn task in real world
We also demonstrate that pre-trained R3M representation enables data efficient imitation learning in a comprehensive simulation evaluations across three different benchmarks
Try it yourself
Try out the pre-trained models at https://github.com/facebookresearch/r3m. Using R3M is as simple as -