FuRL: Visual-Language Models as Fuzzy Rewards for RL

Yuwei Fu¹ ² * Haichao Zhang² Di Wu¹ Wei Xu² Benoit Boulet¹

¹McGill University ²Horizon Robotics

*Work done during an internship at Horizon Robotics

ICML 2024

Abstract

In this work, we investigate how to leverage pre-trained visual-language models (VLM) for online Reinforcement Learning (RL). In particular, we focus on sparse reward tasks with pre-defined textual task descriptions. We first identify the problem of reward misalignment when applying VLM as a reward in RL tasks. To address this issue, we introduce a lightweight fine-tuning method, named Fuzzy VLM reward-aided RL (FuRL), based on reward alignment and relay RL. Specifically, we enhance the performance of SAC/DrQ baseline agents on sparse reward tasks by fine-tuning VLM representations and using Relay RL to avoid local minima. Extensive experiments on the Meta-world benchmark tasks demonstrate the efficacy of the proposed method.

Reward mis-alignment

in our work, we pointed out a reward-mis-alignment that was neglected by many existing work. It refers to the issue that a pre-trained VLM can be mis-leading in training RL agent, due to the reward mis-alignment issue.
for example, in the example shown on the left for the task "button-pushdown", along the expert trajectory, the L2 distance between the ee and goal position is decreasing, but the VLM reward is noisy with fluctuations, although ideally it should beincreasing.

VLM-as-Reward framework suffers from reward mis-alignment

VLM-as-Reward: using a pre-trained VLM model as a form of reward function for RL is a recent popular paradigm for RL and is referred to as VLM-as-Reward framework
VLM-as-reward framework can suffer from reward mis-alignment issue, leading to trained policies with undesired behavior. For example, in the window-open task as shown on the left, the VLM-reward based agent moves to the final position (star) but without touch the handle, therefore failed in the window-open task.

Framework

FuRL contains two major and interacting components:

Reward Alignment: it fine-tunes VLM representations in a lightweight form to improve the VLM rewards, which helps exploration and policy learning;
Relay RL: it helps to escape the local minima due to the fuzzy VLM rewards during exploration, and it also helps to collect more diverse data to improve the reward alignment and policy learning.

Performance Comparison

FuRL outperforms a number of baseline methods on the sparse reward MetaWorld benchmark.

Related Publications and Resources

FuRL: Visual-Language Models as Fuzzy Rewards for Reinforcement Learning

Yuwei Fu, Haichao Zhang, Di Wu, Wei Xu, Benoit Boulet

ICML 2024

[PDF] [Code]

@inproceedings{furl, title={{FuRL}: Visual-Language Models as Fuzzy Rewards for Reinforcement Learning}, author={Yuwei Fu and Haichao Zhang and Di Wu and Wei Xu and Benoit Boulet}, booktitle={Internatonal Conference on Machine Learning}, year={2024}}