Guide Your Agent with Adaptive Multimodal Rewards
Changyeon Kim, Younggyo Seo, Hao Liu, Lisa Lee,
TL;DR
We present ARP (Adaptive Return-conditioned Policy), a novel IL framework that trains a return-conditioned policy utilizing adaptive multimodal reward computed in the pre-trained multimodal embedding space.
Generalization to Unseen Instructions
ARP can guide the agent using our powerful multimodal reward even when the agent receives unseen text instructions associated with new target objects of unseen shape and color.
CoinRun-bluegem
Train: the agent collects a coin that is consistently positioned on the far right of the map.
Test: target object is changed to blue gem + the location of the coin is randomized.
Train Instruction: "The goal is to collect the coin."
InstructRL (Train)
ARP (Ours) (Train)
⬇
Test Instruction: "The goal is to collect the blue gem."
InstructRL (Test)
ARP (Ours) (Test)
Maze III
Train: the agent approaches a yellow diagonal line located at a random position
Test:
modified environment with three objects: a yellow gem, a red diagonal line, and a red straight line
The goal of the agent is to reach the red diagonal line.
Train Instruction: "Navigate a maze to collect the line."
InstructRL (Train)
ARP (Ours) (Train)
⬇
Test Instruction: "Navigate a maze to collect the red diagonal line."
InstructRL (Test)
ARP (Ours) (Test)
Procgen Experiments
While InstructRL misses the target objects in test environments where the location of the target object is changed, our ARP successfully finds the goal even in test environments.
CoinRun
Train: the agent collects a coin that is consistently positioned on the far right of the map.
Test: the location of the coin is randomized.
InstructRL (Train)
⬇
InstructRL (Test)
ARP (Ours) (Train)
⬇
ARP (Ours) (Test)
Maze I
Train: the agent reaches a yellow cheese that is always located at the top right corner.
Test: the cheese is placed at a random position.
InstructRL (Train)
⬇
InstructRL (Test)
ARP (Ours) (Train)
⬇
ARP (Ours) (Test)
Maze II
Train: the agent approaches a yellow diagonal line located at a random position
Test: modified environment with two objects: a yellow gem and a red diagonal line, where the goal is to reach red diagonal line.
InstructRL (Train)
⬇
InstructRL (Test)
ARP (Ours) (Train)
⬇
ARP (Ours) (Test)
Abstract
Developing an agent capable of adapting to unseen environments remains a difficult challenge in imitation learning. In this work, we present Adaptive Return-conditioned Policy (ARP), an efficient framework designed to enhance the agent's generalization ability using natural language task descriptions and pre-trained multimodal encoders. Our key idea is to calculate a similarity between visual observations and natural language instructions in the pre-trained multimodal embedding space (such as CLIP) and use it as a reward signal. We then train a return-conditioned policy using expert demonstrations labeled with multimodal rewards. Because the multimodal rewards provide adaptive signals at each timestep, our ARP effectively mitigates the goal misgeneralization. This results in superior generalization performances even when faced with unseen text instructions, compared to existing text-conditioned policies. To improve the quality of rewards, we also introduce a fine-tuning method for pre-trained multimodal encoders, further enhancing the performance.
Adaptive Return-conditioned Policy
Our main idea is to measure the similarity between visual observations and natural language task descriptions in the pre-training multimodal embedding space (CLIP) and use it as a reward signal. Subsequently, we train a return-conditioned policy using demonstrations annotated with these multimodal reward labels.
Main Experiments: Procgen
Main Experiments: RLBench
We evaluate our method on RLBench, which serves as a standard benchmark for visual-based robotic manipulations. We observe that our method facilitates the agent in reaching unseen goals by adaptive multimodal rewards on complex robotics tasks.