VIP: Towards Universal Visual Reward and Representation
via
Value-Implicit Pre-Training
(ICLR 2023, SpotLight)
Jason Ma, Shagun Sodhani, Dinesh Jayaraman, Osbert Bastani, {Vikash Kumar*, Amy Zhang*}
FAIR, Meta AI | University of Pennsylvania
Overview
VIP trains an (implicit) goal-conditioned value function on large-scale, in-the-wild unlabeled human videos to learn effective visual representation that can perform zero-shot reward-specification on unseen downstream robot tasks.
Given VIP embedding and a goal image, the reward is as simple as the embedding distance (difference) to the goal.
Abstract
Reward and representation learning are two long-standing challenges for learning an expanding set of robot manipulation skills from sensory observations. Given the inherent cost and scarcity of in-domain, task-specific robot data, learning from large, diverse, offline human videos has emerged as a promising path towards acquiring a generally useful visual representation for control; however, how these human videos can be used for general-purpose reward learning remains an open question.
We introduce Value-Implicit Pre-training (VIP), a self-supervised pre-trained visual representation capable of generating dense and smooth reward functions for unseen robotic tasks. VIP casts representation learning from human videos as an offline goal-conditioned reinforcement learning problem and derives a self-supervised dual goal-conditioned value-function objective that does not depend on actions, enabling pre-training on unlabeled human videos. Theoretically, VIP can be understood as a novel implicit time contrastive objective that generates a temporally smooth embedding, enabling the value function to be implicitly defined via the embedding distance, which can then be used to construct the reward for any goal-image specified downstream task. Trained on large-scale Ego4D human videos and without any fine-tuning on in-domain, task-specific data, VIP's frozen representation can provide dense visual reward for an extensive set of simulated and real-robot tasks, enabling diverse reward-based visual control methods and significantly outperforming all prior pre-trained representations. Notably, VIP can enable simple, few-shot offline RL on a suite of real-world robot tasks with as few as 20 trajectories.
VIP Zero-Shot Reward:
The pre-trained VIP representation encodes a value function in its latent space and
can be used to generate dense rewards for unseen robot tasks without any in-domain fine-tuning.
Try out VIP reward generation here!
VIP Real-World Gallery:
Offline RL made simple: with just 20 trajectories, VIP's frozen reward and representation makes offline RL simple and more effective than BC, with almost no added complexity.
CloseDrawer (articulated object)
PushBottle (transparent object)
PickPlaceMelon (soft object)
FoldTowel (deformable object)
Algorithm
VIP is completely self-supervised, requiring only an observation-only dataset (e.g., human videos) for pre-training:
VIP's core training loop can be as few as 10 lines of PyTorch code:
Code
There are two ways to access VIP: (1) official code release, and (2) TorchRL
Official Code Release:
Use Cases: Visualize and play around with the released VIP model, train new VIP models.
Example:
from vip import load_vip
vip = load_vip()
vip.eval()
Use Cases: Using VIP as an out-of-box visual representation and reward (feature coming soon!) for existing RL environments and train RL policies
Example:
from torchrl.envs.transforms import VIPTransform
env = TransformedEnv(my_env, VIPTransform(keys_in=["next_pixels"], download=True)
Experiments:
Can VIP's frozen representation and reward support diverse visuomotor strategies?
Trajectory Optimization And Online Reinforcement Learning
On 36 diverse vision-based manipulation tasks, with just a goal image as task-specification, VIP provides zero-shot visual reward and representation that enables effective trajectory optimization and online RL without any task-specific fine-tuning.
VIP vs. R3M MPPI Trajectories
Provided with just a goal image, VIP's embedding generates smooth dense rewards that can be used to solve diverse unseen robot tasks.
Microwave-Close Goal Image (center view)
VIP
R3M
Leftdoor-open Goal Image (center view)
VIP
R3M
Sidedoor-Open Goal Image (right view)
VIP
R3M
Rightdoor-close Goal Image (left view)
VIP
R3M
VIP benefits from compute scaling in downstream control, whereas baselines often do worse as the optimizer becomes more powerful.
The default # of trajectories is 32, and we used this configuration for Figure 4.
Real-World Few-Shot Offline Reinforcement Learning
With VIP's reward and representation, offline RL is simple, sample-efficient, and more effective than BC.
VIP vs. R3M Real-World Offline RL (and IL) Policies
VIP's zero-shot visual reward and representation can enable simple and practical few-shot offline RL for real-world robot learning,
significantly outperforming VIP-BC which fails to leverage reward information.
VIP (Offline RL): 100%
VIP (BC): 50%
R3M (Offline RL): 80%
R3M (BC): 10%
VIP (Offline RL): 90%
VIP (BC): 50%
R3M (Offline RL): 70%
R3M (BC): 50%
VIP (Offline RL): 60%
VIP (BC): 10%
R3M (Offline RL): 0%
R3M (BC): 0%
VIP (Offline RL): 90%
VIP (BC): 20%
R3M (Offline RL): 0%
R3M (BC): 0%
VIP (Offline RL) acquires complex manipulation behavior, such as executing recovery actions when failing to solve the task initially:
Qualitative Analysis
Qualitative Analysis
Through an extensive set of qualitative analysis, we find VIP's embedding to be much more temporally smooth and consistent compared to all other pre-trained representations.