VIP: Towards Universal Visual Reward and Representation
via
Value-Implicit Pre-Training
(ICLR 2023, SpotLight)

Overview

VIP trains an (implicit) goal-conditioned value function on large-scale, in-the-wild unlabeled human videos to learn effective visual representation that can perform zero-shot reward-specification on unseen downstream robot tasks. 

Given VIP embedding and a goal image, the reward is as simple as the embedding distance (difference) to the goal.

Abstract

Reward and representation learning are two long-standing challenges for learning an expanding set of robot manipulation skills from sensory observations. Given the inherent cost and scarcity of in-domain, task-specific robot data, learning from large, diverse, offline human videos has emerged as a promising path towards acquiring a generally useful visual representation for control; however, how these human videos can be used for general-purpose reward learning remains an open question. 

We introduce Value-Implicit Pre-training (VIP), a self-supervised pre-trained visual representation capable of generating dense and smooth reward functions for unseen robotic tasks. VIP casts representation learning from human videos as an offline goal-conditioned reinforcement learning problem and derives a self-supervised dual goal-conditioned value-function objective that does not depend on actions, enabling pre-training on unlabeled human videos. Theoretically, VIP can be understood as a novel implicit time contrastive objective that generates a temporally smooth embedding, enabling the value function to be implicitly defined via the embedding distance, which can then be used to construct the reward for any goal-image specified downstream task. Trained on large-scale Ego4D human videos and without any fine-tuning on in-domain, task-specific data, VIP's frozen representation can provide dense visual reward for an extensive set of simulated and real-robot tasks, enabling diverse reward-based visual control methods and significantly outperforming all prior pre-trained representations. Notably, VIP can enable simple, few-shot offline RL on a suite of real-world robot tasks with as few as 20 trajectories. 

VIP Zero-Shot Reward:

The pre-trained VIP representation encodes a value function in its latent space and 

can be used to generate dense rewards for unseen robot tasks without any in-domain fine-tuning.

Try out VIP reward generation here

VIP Real-World Gallery:

Offline RL made simple: with just 20 trajectories, VIP's frozen reward and representation makes offline RL simple and more effective than BC, with almost no added complexity. 

CloseDrawer (articulated object)

PushBottle (transparent object)

PickPlaceMelon (soft object)

FoldTowel (deformable object)

Algorithm

VIP is completely self-supervised, requiring only an observation-only dataset (e.g., human videos) for pre-training:

VIP's core training loop can be as few as 10 lines of PyTorch code:

Code

There are two ways to access VIP: (1) official code release, and (2) TorchRL

Use Cases: Visualize and play around with the released VIP model,  train new VIP models.

Example:
from vip import load_vip

vip = load_vip()

vip.eval()


Use Cases: Using VIP as an out-of-box visual representation and reward (feature coming soon!) for existing RL environments and train RL policies 


Example:

from torchrl.envs.transforms import VIPTransform

env = TransformedEnv(my_env, VIPTransform(keys_in=["next_pixels"], download=True)

Experiments: 

Can VIP's frozen representation and reward support diverse visuomotor strategies? 


Trajectory Optimization And Online Reinforcement Learning

On 36 diverse vision-based manipulation tasks, with just a goal image as task-specification, VIP provides zero-shot visual reward and representation that enables effective trajectory optimization and online RL without any task-specific fine-tuning.

VIP vs. R3M MPPI Trajectories

Provided with just a goal image, VIP's embedding generates smooth dense rewards that can be used to solve diverse unseen robot tasks.

Microwave-Close Goal Image (center view)

VIP

R3M

Leftdoor-open Goal Image (center view)

VIP

R3M

Sidedoor-Open Goal Image (right view)

VIP

R3M

Rightdoor-close Goal Image (left view)

VIP

R3M

VIP benefits from compute scaling in downstream control, whereas baselines often do worse as the optimizer becomes more powerful.

The default # of trajectories is 32, and we used this configuration for Figure 4.


Real-World Few-Shot Offline Reinforcement Learning

With VIP's reward and representation, offline RL is simple, sample-efficient, and more effective than BC.

VIP vs. R3M Real-World Offline RL (and IL) Policies

VIP's zero-shot visual reward and representation can enable simple and practical few-shot offline RL for real-world robot learning, 

significantly outperforming VIP-BC which fails to leverage reward information.

VIP (Offline RL): 100%

VIP (BC): 50%

R3M (Offline RL): 80%

R3M (BC): 10%

VIP (Offline RL): 90%

VIP (BC): 50%

R3M (Offline RL): 70%

R3M (BC): 50%

VIP (Offline RL): 60%

VIP (BC): 10%

R3M (Offline RL): 0%

R3M (BC): 0%

VIP (Offline RL): 90%

VIP (BC): 20%

R3M (Offline RL): 0%

R3M (BC): 0%

VIP (Offline RL) acquires complex manipulation behavior, such as executing recovery actions when failing to solve the task initially:


Qualitative Analysis

Through an extensive set of qualitative analysis, we find VIP's embedding to be much more temporally smooth and consistent compared to all other pre-trained representations.