Guide Your Agent with Adaptive Multimodal Rewards
Changyeon Kim, Younggyo Seo, Hao Liu, Lisa Lee,
Changyeon Kim, Younggyo Seo, Hao Liu, Lisa Lee,
We present ARP (Adaptive Return-conditioned Policy), a novel IL framework that trains a return-conditioned policy utilizing adaptive multimodal reward computed in the pre-trained multimodal embedding space.
ARP can guide the agent using our powerful multimodal reward even when the agent receives unseen text instructions associated with new target objects of unseen shape and color.
Train: the agent collects a coin that is consistently positioned on the far right of the map.
Test: target object is changed to blue gem + the location of the coin is randomized.
Train Instruction: "The goal is to collect the coin."
InstructRL (Train)
ARP (Ours) (Train)
⬇
Test Instruction: "The goal is to collect the blue gem."
InstructRL (Test)
ARP (Ours) (Test)
Train: the agent approaches a yellow diagonal line located at a random position
Test:
modified environment with three objects: a yellow gem, a red diagonal line, and a red straight line
The goal of the agent is to reach the red diagonal line.
Train Instruction: "Navigate a maze to collect the line."
InstructRL (Train)
ARP (Ours) (Train)
⬇
Test Instruction: "Navigate a maze to collect the red diagonal line."
InstructRL (Test)
ARP (Ours) (Test)
While InstructRL misses the target objects in test environments where the location of the target object is changed, our ARP successfully finds the goal even in test environments.
Train: the agent collects a coin that is consistently positioned on the far right of the map.
Test: the location of the coin is randomized.
InstructRL (Train)
⬇
InstructRL (Test)
ARP (Ours) (Train)
⬇
ARP (Ours) (Test)
Train: the agent reaches a yellow cheese that is always located at the top right corner.
Test: the cheese is placed at a random position.
InstructRL (Train)
⬇
InstructRL (Test)
ARP (Ours) (Train)
⬇
ARP (Ours) (Test)
Train: the agent approaches a yellow diagonal line located at a random position
Test: modified environment with two objects: a yellow gem and a red diagonal line, where the goal is to reach red diagonal line.
InstructRL (Train)
⬇
InstructRL (Test)
ARP (Ours) (Train)
⬇
ARP (Ours) (Test)
Developing an agent capable of adapting to unseen environments remains a difficult challenge in imitation learning. In this work, we present Adaptive Return-conditioned Policy (ARP), an efficient framework designed to enhance the agent's generalization ability using natural language task descriptions and pre-trained multimodal encoders. Our key idea is to calculate a similarity between visual observations and natural language instructions in the pre-trained multimodal embedding space (such as CLIP) and use it as a reward signal. We then train a return-conditioned policy using expert demonstrations labeled with multimodal rewards. Because the multimodal rewards provide adaptive signals at each timestep, our ARP effectively mitigates the goal misgeneralization. This results in superior generalization performances even when faced with unseen text instructions, compared to existing text-conditioned policies. To improve the quality of rewards, we also introduce a fine-tuning method for pre-trained multimodal encoders, further enhancing the performance.
Our main idea is to measure the similarity between visual observations and natural language task descriptions in the pre-training multimodal embedding space (CLIP) and use it as a reward signal. Subsequently, we train a return-conditioned policy using demonstrations annotated with these multimodal reward labels.
We evaluate our method on RLBench, which serves as a standard benchmark for visual-based robotic manipulations. We observe that our method facilitates the agent in reaching unseen goals by adaptive multimodal rewards on complex robotics tasks.