Guide Your Agent with Adaptive Multimodal Rewards

Changyeon Kim, Younggyo Seo, Hao Liu, Lisa Lee,

Jinwoo Shin, Honglak Lee, Kimin Lee

Paper Code (Procgen)

TL;DR

We present ARP (Adaptive Return-conditioned Policy), a novel IL framework that trains a return-conditioned policy utilizing adaptive multimodal reward computed in the pre-trained multimodal embedding space.

Generalization to Unseen Instructions

ARP can guide the agent using our powerful multimodal reward even when the agent receives unseen text instructions associated with new target objects of unseen shape and color. 

CoinRun-bluegem

Train: the agent collects a coin that is consistently positioned on the far right of the map.  

Test: target object is changed to blue gem + the location of the coin is randomized.

Train Instruction: "The goal is to collect the coin."

coinrun_none_video_5_instructrl.mp4

InstructRL (Train)

coinrun_none_video_5_arp.mp4

ARP (Ours) (Train)

Test Instruction: "The goal is to collect the blue gem."

coinrun_aisc_gem_video_18_instructrl.mp4

InstructRL (Test)

coinrun_aisc_gem_video_18_arp.mp4

ARP (Ours) (Test)

Maze III

Train: the agent approaches a yellow diagonal line located at a random position

Test:

modified environment with three objects:  a yellow gem, a red diagonal line, and a red straight line

The goal of the agent is to reach the red diagonal line.

Train Instruction: "Navigate a maze to collect the line."

maze_ii_train.mov

InstructRL (Train)

maze_ii_train_arp.mov

ARP (Ours) (Train)

Test Instruction: "Navigate a maze to collect the red diagonal line."

maze_iii_instructrl_v2.mov

InstructRL (Test)

maze_iii_arp_v2.mov

ARP (Ours) (Test)

Procgen Experiments

While InstructRL misses the target objects in test environments where the location of the target object is changed, our ARP successfully finds the goal even in test environments. 

CoinRun

Train: the agent collects a coin that is consistently positioned on the far right of the map.  

Test: the location of the coin is randomized.

coinrun_none_video_5_instructrl.mp4

InstructRL (Train)

coinrun_aisc_video_87_instructrl.mp4

InstructRL (Test)

coinrun_none_video_5_arp.mp4

ARP (Ours) (Train)

coinrun_aisc_video_87_ARP.mp4

ARP (Ours) (Test)

Maze I

Train: the agent reaches a yellow cheese that is always located at the top right corner.

Test: the cheese is placed at a random position.

maze_i_train.mov

InstructRL (Train)

maze_i_instructrl_v2.mov

InstructRL (Test)

maze_i_train_arp.mov

ARP (Ours) (Train)

maze_i_arp.mov

ARP (Ours) (Test)

Maze II

Train: the agent approaches a yellow diagonal line located at a random position

Test: modified environment with two objects: a yellow gem and a red diagonal line, where the goal is to reach red diagonal line.

maze_ii_train.mov

InstructRL (Train)

maze_ii.mov

InstructRL (Test)

maze_ii_train_arp.mov

ARP (Ours) (Train)

maze_ii_arp.mov

ARP (Ours) (Test)

Abstract

Developing an agent capable of adapting to unseen environments remains a difficult challenge in imitation learning. In this work, we present Adaptive Return-conditioned Policy (ARP), an efficient framework designed to enhance the agent's generalization ability using natural language task descriptions and pre-trained multimodal encoders. Our key idea is to calculate a similarity between visual observations and natural language instructions in the pre-trained multimodal embedding space (such as CLIP) and use it as a reward signal. We then train a return-conditioned policy using expert demonstrations labeled with multimodal rewards. Because the multimodal rewards provide adaptive signals at each timestep, our ARP effectively mitigates the goal misgeneralization. This results in superior generalization performances even when faced with unseen text instructions, compared to existing text-conditioned policies. To improve the quality of rewards, we also introduce a fine-tuning method for pre-trained multimodal encoders, further enhancing the performance. 

Adaptive Return-conditioned Policy

Our main idea is to measure the similarity between visual observations and natural language task descriptions in the pre-training multimodal embedding space (CLIP) and use it as a reward signal. Subsequently, we train a return-conditioned policy using demonstrations annotated with these multimodal reward labels.

Main Experiments: Procgen

Main Experiments: RLBench

We evaluate our method on RLBench, which serves as a standard benchmark for visual-based robotic manipulations. We observe that our method facilitates the agent in reaching unseen goals by adaptive multimodal rewards on complex robotics tasks.