Jason Ma, Shagun Sodhani, Dinesh Jayaraman, Osbert Bastani, {Vikash Kumar*, Amy Zhang*}
FAIR, Meta AI | University of Pennsylvania
VIP trains an (implicit) goal-conditioned value function on large-scale, in-the-wild unlabeled human videos to learn effective visual representation that can perform zero-shot reward-specification on unseen downstream robot tasks.
Given VIP embedding and a goal image, the reward is as simple as the embedding distance (difference) to the goal.
Reward and representation learning are two long-standing challenges for learning an expanding set of robot manipulation skills from sensory observations. Given the inherent cost and scarcity of in-domain, task-specific robot data, learning from large, diverse, offline human videos has emerged as a promising path towards acquiring a generally useful visual representation for control; however, how these human videos can be used for general-purpose reward learning remains an open question.
We introduce Value-Implicit Pre-training (VIP), a self-supervised pre-trained visual representation capable of generating dense and smooth reward functions for unseen robotic tasks. VIP casts representation learning from human videos as an offline goal-conditioned reinforcement learning problem and derives a self-supervised dual goal-conditioned value-function objective that does not depend on actions, enabling pre-training on unlabeled human videos. Theoretically, VIP can be understood as a novel implicit time contrastive objective that generates a temporally smooth embedding, enabling the value function to be implicitly defined via the embedding distance, which can then be used to construct the reward for any goal-image specified downstream task. Trained on large-scale Ego4D human videos and without any fine-tuning on in-domain, task-specific data, VIP's frozen representation can provide dense visual reward for an extensive set of simulated and real-robot tasks, enabling diverse reward-based visual control methods and significantly outperforming all prior pre-trained representations. Notably, VIP can enable simple, few-shot offline RL on a suite of real-world robot tasks with as few as 20 trajectories.
The pre-trained VIP representation encodes a value function in its latent space and
can be used to generate dense rewards for unseen robot tasks without any in-domain fine-tuning.
Try out VIP reward generation here!
Offline RL made simple: with just 20 trajectories, VIP's frozen reward and representation makes offline RL simple and more effective than BC, with almost no added complexity.
CloseDrawer (articulated object)
PushBottle (transparent object)
PickPlaceMelon (soft object)
FoldTowel (deformable object)
VIP is completely self-supervised, requiring only an observation-only dataset (e.g., human videos) for pre-training:
VIP's core training loop can be as few as 10 lines of PyTorch code:
There are two ways to access VIP: (1) official code release, and (2) TorchRL
Use Cases: Visualize and play around with the released VIP model, train new VIP models.
Example:
from vip import load_vip
vip = load_vip()
vip.eval()
Use Cases: Using VIP as an out-of-box visual representation and reward (feature coming soon!) for existing RL environments and train RL policies
Example:
from torchrl.envs.transforms import VIPTransform
env = TransformedEnv(my_env, VIPTransform(keys_in=["next_pixels"], download=True)
On 36 diverse vision-based manipulation tasks, with just a goal image as task-specification, VIP provides zero-shot visual reward and representation that enables effective trajectory optimization and online RL without any task-specific fine-tuning.
Provided with just a goal image, VIP's embedding generates smooth dense rewards that can be used to solve diverse unseen robot tasks.
The default # of trajectories is 32, and we used this configuration for Figure 4.
With VIP's reward and representation, offline RL is simple, sample-efficient, and more effective than BC.
VIP vs. R3M Real-World Offline RL (and IL) Policies
VIP's zero-shot visual reward and representation can enable simple and practical few-shot offline RL for real-world robot learning,
significantly outperforming VIP-BC which fails to leverage reward information.
VIP (Offline RL) acquires complex manipulation behavior, such as executing recovery actions when failing to solve the task initially:
Through an extensive set of qualitative analysis, we find VIP's embedding to be much more temporally smooth and consistent compared to all other pre-trained representations.