In this work, we propose Text-Aware Diffusion for Policy Learning (TADPoLe), a framework that utilizes a large-scale pretrained, frozen text-conditioned diffusion model to generate dense reward signals for policy learning. We demonstrate that TADPoLe enables the learning of novel, zero-shot policies that are flexibly and accurately conditioned on natural language inputs, across multiple robot configurations and environments, for both goal-achievement and continuous locomotion tasks. Furthermore, we observe that behaviors learned by TADPoLe are qualitatively appealing due to alignment with natural priors summarized from large-scale pretraining.
TADPoLe pipeline computes text-conditioned rewards through a pretrained, frozen diffusion model for policy learning. At each timestep, the subsequent frame rendered by the environment is converted to a reward signal as an alignment score with respect to a provided text prompt. The rendered image from the environment is first corrupted with a sampled Gaussian source noise vector, and then the diffusion model is utilized to predict the source noise that was added. The reward is designed to be large when the selected action produces frames well-aligned with the text prompt.
We perform comprehensive experiments to demonstrate the effectiveness of TADPoLe on goal achievement, continuous locomotion, and robotic manipulation tasks. For evaluation benchmarks, we select the Dog and Humanoid environments from the DeepMind Control Suite, both of which are known to be advanced in difficulty due to their large action space and complex transition dynamics. Additionally, we select 8 robotic manipulation tasks from Meta-World, which are balanced for diversity and complexity.
We showcase text-conditioned goal-reaching tasks learned via TADPoLe, for both Dog and Humanoid environments, and compare it against other text-to-reward approaches. TADPoLe is able to successfully learn a variety of behaviors, from standing upright to doing splits to kneeling. Noticeably, the results on the Dog experiments appear more aligned with how dogs naturally "stand".
"a dog standing"
TADPoLe
"a dog standing"
VLM-RM
"a dog standing"
LIV
"a dog standing"
Text2Reward
"a person standing"
TADPoLe
"a person standing"
VLM-RM
"a person standing"
LIV
"a person standing"
Text2Reward
"a person in lotus position"
TADPoLe
"a person in lotus position"
VLM-RM
"a person in lotus position"
LIV
"a person in lotus position"
Text2Reward
"a person kneeling"
TADPoLe
"a person kneeling"
VLM-RM
"a person kneeling"
LIV
"a person kneeling"
Text2Reward
"a person doing splits"
TADPoLe
"a person doing splits"
VLM-RM
"a person doing splits"
LIV
"a person doing splits"
Text2Reward
We further investigate whether or not TADPoLe is sensitive to subtle variations of the input prompt. When we change the conditioning phrase from "a person standing" to "a person standing with hands above head" and train an agent as before, we find that the resulting policy indeed respects the additional detail in the text specification and keeps its hands above its head. We demonstrate that a similar behavior for the prompt of "a person standing with hands on hips", and show that the agent learns to keep both its hands on its hips. We take this as further evidence that TADPoLe is a powerful approach to learning learning text-conditioned policies, capable of respecting fine-grained details and subtleties of the input prompts. We also include an additional example of a Dog learning to chase its tail.
"a person standing"
TADPoLe
"a person standing
with hands above head"
TADPoLe
"a person standing
with hands on hips"
TADPoLe
"a dog chasing its tail"
TADPoLe
We further explore the ability of Video-TADPoLe to learn continuous locomotion behaviors conditioned on natural language specifications. This is challenging for approaches that statically select a canonical goal-frame to achieve, such as CLIP or LIV, and we propose Video-TADPoLe, which leverages large-scale pretrained text-to-video generative models, as a promising direction forward. Indeed, Video-TADPoLe outperforms ViCLIP-RM, our proposed extension of VLM-RM into text-video alignment, in both quantitative and qualitative evaluations across both Dog and Humanoid.
"a dog walking"
Video-TADPoLe
"a dog walking"
ViCLIP-RM
"a dog walking"
LIV
"a dog walking"
Text2Reward
"a person walking"
Video-TADPoLe
"a person walking"
ViCLIP-RM
"a person walking"
LIV
"a person walking"
Text2Reward
We investigate how well TADPoLe can be applied to learn robotic manipulation tasks through dense text-conditioned feedback. Whereas Meta-World normally provides manually designed dense rewards guiding the agent to complete the task, we demonstrate TADPoLe's promising ability by replacing the ground-truth dense reward signal with TADPoLe's text-conditioned reward. We perform thorough comparisons between TADPoLe and VLM-RM by evaluating them on a diverse set of selected Meta-World tasks, in which we observe that TADPoLe provides meaningful zero-shot dense supervision that enables success across a variety of robotic manipulation tasks through text-conditioning.
"a robot arm is closing the drawer"
TADPoLe
"a robot arm is opening the drawer"
TADPoLe
"a robot arm is pushing a soccer ball into the net"
TADPoLe
"a robot arm is opening the window"
TADPoLe
"a robot arm is closing the drawer"
VLM-RM
"a robot arm is opening the drawer"
VLM-RM
"a robot arm is pushing a soccer ball into the net"
VLM-RM
"a robot arm is opening the window"
VLM-RM
We perform an additional larger paid user study, with 25 anonymous participants sourced from the general public, through the Prolific platform. In determining which motion was believed to be more natural between a diffusion-based supervisor against a CLIP-based supervisor, the majority of users preferred the policies learned by TADPoLe or Video-TADPoLe in all cases. We take this as further evidence that our framework's usage of a pre-trained generative model as a reward signal also supervises the policy to behave in a visually natural way.
"a dog walking"
Video-TADPoLe
(76.0% Preferred Naturalness)
"a dog walking"
ViCLIP-RM
(24.0% Preferred Naturalness)
"a dog walking"
TADPoLe
(91.7% Preferred Naturalness)
"a dog walking"
VLM-RM
(8.3% Preferred Naturalness)
"a dog standing"
TADPoLe
(87.5% Preferred Naturalness)
"a dog standing"
VLM-RM
(12.5% Preferred Naturalness)
"a person walking"
Video-TADPoLe
(84.0% Preferred Naturalness)
"a person walking"
ViCLIP-RM
(16.0% Preferred Naturalness)
"a person walking"
TADPoLe
(83.3% Preferred Naturalness)
"a person walking"
VLM-RM
(16.7% Preferred Naturalness)
"a person in lotus position"
TADPoLe
(62.5% Preferred Naturalness)
"a person in lotus position"
VLM-RM
(37.5% Preferred Naturalness)
"a person doing splits"
TADPoLe
(62.5% Preferred Naturalness)
"a person doing splits"
VLM-RM
(37.5% Preferred Naturalness)
"a person kneeling"
TADPoLe
(70.8% Preferred Naturalness)
"a person kneeling"
VLM-RM
(29.2% Preferred Naturalness)
We report failure modes of Video-TADPole and TADPoLe. We discover that for certain seeds, Video-TADPoLe is able to learn a Humanoid agent that walks according to the direction specified by the text prompt; however, for others, Video-TADPoLe may learn to walk in the opposite direction (visualized below). This demonstrates that although the agent consistently learns to walk, it may not always respect details such as direction. How to provide fine-grained control over the text-conditioning to focus on key words such as direction is interesting to explore for future work.
"a person walking to the right"
Video-TADPoLe
"a person walking to the left"
Video-TADPoLe
We visualize Video-TADPoLe dense text-conditioned rewards for natural videos, to demonstrate how it can assign aligned prompts a higher reward than less-aligned prompts, thus supporting its potential as a text-conditioned reward signal for policy learning in real-world settings. We utilize natural videos of robot arms performing tasks from the Bridge Dataset (version 2), and demonstrate that our reward computation assigns higher dense rewards per frame for a more-aligned text prompt describing the action. We also use natural videos of humans performing repetitive exercises from the RepNet dataset, which we do not visualize explicitly due to potential anonymization concerns (we can provide exact videos and details at the reviewers' request, with permission from the AC/PCs). We similarly find that the Video-TADPoLe dense reward exhibits preference for more-aligned text prompts relating to the natural video.
(Video from BridgeData V2)
(Video from BridgeData V2)
We visualize the frames corresponding to lowest three scores for each prompt (left three columns) as well as the frames corresponding to the highest three scores for each prompt (right three columns). We observe that Video-TADPoLe assigns high scores for placing the sponge on the plate as opposed to the initial setting. There is more confusion about which frames are preferred for the mis-aligned prompt.
We visualize the frames corresponding to lowest three scores for each prompt (left three columns) as well as the frames corresponding to the highest three scores for each prompt (right three columns). We observe that Video-TADPoLe assigns high scores for opening the drawer as opposed to the initial setting. There is more confusion about which frames are preferred for the mis-aligned prompt.
Video was sourced from the RepNet dataset, which depicted a person performing jumping jacks, which we do not show here (without AC or PC approval) due to potential anonymization considerations for the subject.
Video was sourced from the RepNet dataset, which depicted a person performing squats, which we do not show here (without AC or PC approval) due to potential anonymization considerations for the subject.
We implement TADPoLe for additional robotic tasks from Adroit, and FrankaKitchen. For Adroit, we implement TADPoLe with a DrM RL Backbone, but otherwise keep all hyperparameters the same as what was reported in the MetaWorld experiments. For FrankaKitchen we keep the exact same implementation as what was used for MetaWorld.
We observe that TADPoLe is able to replace dense ground-truth rewards with a purely text-conditioned reward, to still accomplish complex dextrous manipulation tasks. We reiterate that TADPoLe accomplishes this using a general-purpose, large-scale pretrained text-to-image diffusion model that has not (to our knowledge) observed any in-domain examples from Adroit.
We also report an example of a task that TADPoLe is able to learn in FrankaKitchen - FrankaKitchen has been known to struggle in online RL settings without observing prior demonstrations. Furthermore, LIV, a prior work, has found that fine-tuning for Franka-Kitchen is needed for language-grounding, and a text-conditioned reward is virtually meaningless without fine-tuning (Appendix G.4, and Figure 20 vs. Figure 21) . As TADPoLe does not use fine-tuning, but still attempts language-conditioned policy learning, we expect performance on FrankaKitchen to be challenging.
"opening a door"
Success Rate: 100%
TADPoLe
"spinning a pen"
Success Rate: 80%
TADPoLe
"hammering a nail"
Success Rate: 40%
TADPoLe
"switch on light"
Success Rate: 60%
TADPoLe