Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning

Ethan Perez*

Anthropic

David Lindner*

ETH Zurich

Abstract. Reinforcement learning (RL) requires either manually specifying a reward function, which is often infeasible, or learning a reward model from a large amount of human feedback, which is often very expensive. We study a more sample-efficient alternative: using pretrained vision-language models (VLMs) as zero-shot reward models (RMs) to specify tasks via natural language. We propose a natural and general approach to using VLMs as reward models, which we call VLM-RMs. We use VLM-RMs based on CLIP to train a MuJoCo humanoid to learn complex tasks without a manually specified reward function, such as kneeling, doing the splits, and sitting in a lotus position. For each of these tasks, we only provide a single sentence text prompt describing the desired task with minimal prompt engineering. We can improve performance by providing a second “baseline” prompt and projecting out parts of the CLIP embedding space irrelevant to distinguish between goal and baseline. Further, we find a strong scaling effect for VLM-RMs: larger VLMs trained with more compute and data are better reward models. The failure modes of VLM-RMs we encountered are all related to known capability limitations of current VLMs, such as limited spatial reasoning ability or visually unrealistic environments that are far off-distribution for the VLM. We find that VLM-RMs are remarkably robust as long as the VLM is large enough. This suggests that future VLMs will become more and more useful reward models for a wide range of RL applications.

Citation

Juan Rocamonde, Victoriano Montesinos, Elvis Nava, Ethan Perez, David Lindner. Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning. arXiv preprint arXiv:2310.12921


@article{rocamonde2023vision,

  title={Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning},

  author={Rocamonde, Juan and Montesinos, Victoriano and Nava, Elvis and Perez, Ethan and Lindner, David},

  journal={arXiv preprint arXiv:2310.12921},

  year={2023}

}

Vision-Language Models as Reward Models (VLM-RMs)

We successfully use vision-language models (VLMs), and specifically CLIP models, as reward models for RL agents. Instead of manually specifying a reward function, we only need to provide text prompts like "a humanoid robot kneeling" to instruct and provide feedback to the agent. We also propose "Goal-Baseline Regularization": a technique in which we optionally use an additional "baseline" prompt describing the task setting irrespective of the task goal, such as "a humanoid",  to improve VLM-RM performance.

Using VLM-RMs to train a Humanoid Robot

We train a MuJoCo humanoid to learn complex tasks from a text description. Given our focus on tasks without a ground-truth reward function, numerical evaluations rely on human labels and might be misleading. To address this, we provide videos of our agents' performance. 

Kneeling

Prompt: "a humanoid robot kneeling".

Success Rate: 100%.

Lotus Position

Prompt: "a humanoid robot seated down, meditating in the lotus position".

Success Rate: 100%.

Splits

Prompt: "a humanoid robot practicing gymnastics, doing the side splits".

Success Rate: 100%.

Arms Raised

Prompt: "a humanoid robot standing up, with both arms raised".

Success Rate: 100%.

Hands on Hips

Prompt: "a humanoid robot standing up with hands on hips".

Success Rate: 64%.

Standing Up

Prompt: "a humanoid robot standing up".

Success Rate: 100%.

Standing on One Leg

Prompt: "a humanoid robot standing up on one leg".

Success Rate: 0%.

Arms Crossed

Prompt: "a humanoid robot standing up, with its arms crossed".

Success Rate: 0%.

VLM-RM Performance Scaling with Model Size

Importantly, we find that larger VLMs provide more accurate reward signals. We focus on the “kneeling” task and consider 4 different large CLIP models. Increasing model size, we obtain steady improvements in EPIC distance to the ground truth reward function. Moreover, only the largest model (ViT-bigG-14) leads to a VLM-RM that successfully trains an agent to complete the task.

Left plot shows the EPIC distance between the CLIP reward model and human labelled states for different sizes of CLIP models (lower is better). Right plot shows human-evaluated success rate. Larger CLIP models are better reward models, and only the largest publicly available model achieves non-zero success rate on the kneeling task.

Below, we provide video samples of agent performance when training with CLIP models of different sizes.

RN50

Success Rate: 0%.

ViT-L-14

Success Rate: 0%.

ViT-H-14

Success Rate: 0%.

ViT-bigG-14

Success Rate: 100%.

Ablation Studies on the Humanoid Environment

We also perform ablation studies on the humanoid environment, as we obtained the best performance when changing the default environment camera angle and textures. We find that the texture change was the most important to obtain a successful experiment.

Original Camera Angle + Original Texture

Success Rate: 36%.

Original Camera Angle + Modif. Texture

Success Rate: 91%.

Modified Camera Angle + Modif. Texture

Success Rate: 100%.