Video generative models hold great promise in facilitating robotic behavior - they can be used as visual planners, as well as used to critique the behavior of policies to supervise their learning. In particular, when video generative models are pretrained on internet-scale data, they encode natural motion priors and exhibit strong text-conditioned generalization capabilities; this makes them attractive to apply to robotic environments for similar generalization benefits across novel behaviors. However, such generally-pretrained video generative models may not understand the intricacies of particular environments of interest to supervise the learning of high-performing policies within them.
In this work, we provide a comprehensive study on how the text-conditioning capabilities of large-scale pretrained video generative models can be combined with environment-specific information to deliver improvements on text-conditioned generalization for robotic tasks. Our study encompasses two areas: an evaluation over different adaptation techniques, and a comparison across two different downstream task evaluation techniques.
NOTE: This website may be slow to load due to the high number of video visualizations - we appreciate the patience!
We explore how in-domain information can be integrated into large-scale text-to-video models through three different adaptation techniques: Subject Customization, Probabilistic Adaptation, and Direct Finetuning. Each technique has their own data and training resource considerations; these are outlined in the main paper.
Beyond simply scoring visual quality metrics such as FVD, we propose evaluating how adapted video models can enable text-conditioned generalization for robotic tasks via two approaches: visual planning and policy supervision.
We perform comprehensive experiments across continuous locomotion, as well as robotic manipulation environments. We evaluate the policy supervision and visual planning capabilities across adaptation techniques to evaluate both how they preserve performance on tasks seen during adaptation, as well as generalize to novel ones unseen during adaptation.
We apply adapted video generative models in a discriminative manner to perform policy supervision; we visualize the environment-rendered videos achieved by rolling out learned policies across adaptation techniques below. We note that inverse probabilistic adaptation achieves the best performance on MetaWorld tasks, whereas subject customization is performant for continuous control tasks from DeepMind Control.
"a robot arm pushing a white cup towards the coffee machine"
AnimateDiff (Vanilla)
✅
"a [D] robot arm pushing a white cup towards the coffee machine"
Subject Customization
"a robot arm pushing a white cup towards the coffee machine"/"coffee push"
Probabilistic Adaptation
"a robot arm pushing a white cup towards the coffee machine"/"coffee push"
Inverse Probabilistic Adaptation
✅
"a robot arm pushing a white cup towards the coffee machine"
Direct Finetuning
"a robot arm opening a door"
AnimateDiff (Vanilla)
"a [D] robot arm opening a door"
Subject Customization
"a robot arm opening a door"/"door open"
Probabilistic Adaptation
"a robot arm opening a door"/"door open"
Inverse Probabilistic Adaptation
✅
"a robot arm opening a door"
Direct Finetuning
"a humanoid walking"
AnimateDiff (Vanilla)
✅
"a [D] action figure walking"
Subject Customization
✅
"a humanoid walking"
Probabilistic Adaptation
"an action figure walking"
Direct Finetuning
"a dog walking"
AnimateDiff (Vanilla)
"a [D] dog walking"
Subject Customization
✅
"a dog walking"
Probabilistic Adaptation
"a dog walking"
Direct Finetuning
As a sanity check, we can visualize at an intuitive level how an adapted video model, leveraging powerful text-conditioning capabilities, can supervise the learning of a downstream policy conditioned on a novel text prompt. We therefore visualize the free-form video generated by adapted video models , conditioned on a novel text-prompt ("a dog jumping") that was unseen during adaptation. When using the adapted video model for policy supervision (simply as a critic that provides text-conditioned rewards), we showcase that it can successfully supervise a downstream Dog agent to behave according to this novel text specification in a zero-shot manner.
"a [D] dog jumping"
Free-form Generated Video w/ Subject Customization
"a dog jumping"
Free-form Generated Video w/ Direct Finetuning
"a dog jumping"
Env-Rendered Policy Rollout w/ Direct Finetuning
We apply adapted video generative models as visual planners, by rolling out a sequence of imagined future frames and converting them into executable actions. Below, we showcase the text-conditioned video plan produced by the adapted video model, which is above the video for the actual behavior executed by the agent. Our experiments highlight the superiority of probabilistic adaptation and its inverse for novel text-conditioned robotic behavior generalization, as they outperform using an in-domain model or a large-scale pretrained video model alone for novel tasks.
In practice, we use a plan horizon of 1 (in other words, we re-synthesize a plan after each environment step); but as this is not as visually appealing in terms of illustration purposes or interpreting the synthesized plan, we also visualize using a plan horizon of 8 below.
MetaWorld Drawer Close, Unseen/Novel Task, Plan Horizon: 1
"a robot arm closing a drawer"
AnimateDiff (Vanilla)
✅
"drawer close"
In-Domain-Only
"a [D] robot arm closing a drawer"
Subject Customization
"a robot arm closing a drawer"/"drawer close"
Probabilistic Adaptation
✅
"a robot arm closing a drawer"/"drawer close"
Inverse Probabilistic Adaptation
✅
"a robot arm closing a drawer"
Direct Finetuning
✅
MetaWorld Button Press (Unseen/Novel Task) - Plan Horizon: 1
"a robot arm pushing a button"
AnimateDiff (Vanilla)
"button press"
In-Domain-Only
✅
"a [D] robot arm pushing a button"
Subject Customization
"a robot arm pushing a button"/"button press"
Probabilistic Adaptation
✅
"a robot arm pushing a button"/"button press"
Inverse Probabilistic Adaptation
✅
"a robot arm pushing a button"
Direct Finetuning
MetaWorld Button Press (Unseen/Novel Task) - Plan Horizon: 8
"a robot arm pushing a button"
AnimateDiff (Vanilla)
"button press"
In-Domain-Only
"a [D] robot arm pushing a button"
Subject Customization
"a robot arm pushing a button"/"button press"
Probabilistic Adaptation
"a robot arm pushing a button"/"button press"
Inverse Probabilistic Adaptation
"a robot arm pushing a button"
Direct Finetuning
We investigate whether adapting large-scale pretrained video models to suboptimal demonstration data is still able to facilitate powerful text-conditioned in-domain plans, or if expert examples are explicitly needed for successful adaptation. To evaluate this, we perform planning with probabilistic adaptation and its inverse, where the available adaptation data is produced by a suboptimal agent. Below, we visualize these text-conditioned plans on two novel tasks unseen during adaptation, along with the executed behaviors.
In both Drawer Close and Window Close tasks, the in-domain model fails to generate feasible plans, whereas the adapted video models successfully complete the task. This is a promising sign that in adapting large-scale text-to-video models for robotic downstream tasks, expert demonstration may not be explicitly needed.
"drawer close"
In-Domain-Only
"a robot arm closing a drawer"
Probabilistic Adaptation
✅
"a robot arm closing a drawer"
Inverse Probabilistic Adaptation
✅
"window close"
In-Domain-Only
"a robot arm closing a window"
Probabilistic Adaptation
✅
"a robot arm closing a window"
Inverse Probabilistic Adaptation
✅
We propose a visualization technique called continued denoising, where we treat the noised video as an initialization and iteratively continue sampling to produce a final clean video prediction - thus, “continuing” the denoising procedure. For qualitative purposes, continued denoising provides us a visual intuition of how adapted video generative models can be used as policy supervisors to critique achieved frames, as well as a sanity check on the integration of in-domain information through adaptation. In our experiments we perform continued denoising conditioned on a desired text prompt, a noise level of 700, a total frame length of 16, and 10 denoising steps. Below, we visualize continued denoising across MetaWorld and DeepMind Control Suite policies, using different video generative models and adaptation techniques.
"a robot arm pushing a white cup towards the coffee machine"
AnimateDiff (Vanilla)
"coffee push"
In-Domain (Vanilla)
"a [D] robot arm pushing a white cup towards the coffee machine"
Subject Customization
"a robot arm pushing a white cup towards the coffee machine"/"coffee push"
Probabilistic Adaptation
"a robot arm pushing a white cup towards the coffee machine"/"coffee push"
Inverse Probabilistic Adaptation
"a robot arm pushing a white cup towards the coffee machine"
Direct Finetuning
"a dog walking"
AnimateDiff (Vanilla)
"a dog walking"
In-Domain (Vanilla)
"a [D] dog walking"
Subject Customization
"a dog walking"
Probabilistic Adaptation
"a dog walking"
Inverse Probabilistic Adaptation
"a dog walking"
Direct Finetuning
"a humanoid walking"
AnimateDiff (Vanilla)
"a humanoid walking"
In-Domain (Vanilla)
"a [D] action figure walking"
Subject Customization
"a humanoid walking"
Probabilistic Adaptation
"a humanoid walking"
Inverse Probabilistic Adaptation
"an action figure walking"
Direct Finetuning
Visual planning can be highly sensitive to the visual quality of videos produced by the generative model. For additional insight, we therefore visualize free-form generated video examples using different adaptation techniques below. We observe that in MetaWorld, the in-domain model alone cannot generate clear expert videos even for seen tasks like Coffee Push; this may be because it was constructed to have inherently small model capacity. However, we find that probabilistic adaptation and its inverse version enable the video generation of much higher quality. This could be essential for in-domain visual planning, where the inverse dynamics model can rely on clear visual context to extract correct actions. Compared to Vanilla AnimateDiff, we also showcase how Subject Customization and Direct Finetuning can effectively incorporate in-domain information.
"a robot arm pushing a white cup towards the coffee machine"
AnimateDiff (Vanilla)
"coffee push"
In-Domain-Only
"a [D] robot arm pushing a white cup towards the coffee machine"
Subject Customization
"a robot arm pushing a white cup towards the coffee machine"/"coffee push"
Probabilistic Adaptation
"a robot arm pushing a white cup towards the coffee machine"/"coffee push"
Inverse Probabilistic Adaptation
"a robot arm pushing a white cup towards the coffee machine"
Direct Finetuning
To determine reasonable hyperparameter settings (e.g. context window size, stride, and noise level) for policy supervision, we propose Policy Discrimination, an offline method that evaluates whether the adapted video model's VideoTADPoLe reward computation can correctly distinguish expert, text-aligned videos from poor, text-unaligned videos. For each task and adaptation technique, we calculate VideoTADPoLe rewards for an expert demonstration video and a video of poor behavior quality given a task-specific prompt. When the adapted video model properly grasps the alignment between text prompts and in-domain videos, the VideoTADPoLe rewards of expert videos (Expert Rewards) are expected to be higher than those of poor videos (Poor Rewards). Below we show examples of policy discrimination in two environments, in which we illustrate the delta rewards (i.e. Expert Rewards - Poor Rewards) in different hyperparameter settings.
In our examples for both environments, the Expert Rewards are generally higher than Poor Rewards using different hyperparameters. Furthermore, in MetaWorld we evaluate adapted video models on a task unseen during adaptation (Button Press); the rewards not only successfully distinguish expert behaviors from poor behaviors, but also show higher values for ending frames of expert videos, where the robot arm is approaching the task goal.
DeepMindControl Dog Walking
Expert Video
Poor Video
"a [D] dog walking"
Subject Customization
Context Window Size: 8; Stride: 4; Noise Level: 700
"a [D] dog walking"
Subject Customization
Context Window Size: 16; Stride: 8; Noise Level: 800
Expert Video
Poor Video
"a dog walking"
Probabilistic Adaptation
Context Window Size: 8; Stride: 4; Noise Level: 700
"a dog walking"
Inverse Probabilistic Adaptation
Context Window Size: 8; Stride: 4; Noise Level: 500
MetaWorld Button Press
Expert Video
Poor Video
"a robot arm pushing a button"/"button press"
Inverse Probabilistic Adaptation
Context Window Size: 8; Stride: 4; Noise Level: 700
"a robot arm pushing a button"/"button press"
Inverse Probabilistic Adaptation
Context Window Size: 16; Stride: 8; Noise Level: 800
To understand how subject customization might behave differently with various environment dynamics in downstream tasks, we design two Humanoid Walking tasks with different gravity constants, and evaluate the video models adapted with the same DreamBooth LoRA checkpoint through policy supervision. Below we show two policy rollouts achieved by subject customization in different environment dynamics. Subject customization achieves superior performance in the environment with regular gravity, however, we also observe a performance gap when the gravity condition changes.
We hypothesize that this is because in subject customization, the model has adapted to the environments visual characteristics but preserves the motion prior from the large-scale pretrained video model. Such large-scale motion priors may be generally applicable across environments, such as under normal gravity conditions for Humanoid/Dog, but may not be as robust if the environment exhibits severe out-of-distribution dynamics from the pretraining data of the large-scale video model (such as floating).
Regular Gravity
Average Return: 174.7 ± 42.7
"a [D] action figure walking"
Less Gravity
Average Return: 87.2 ± 19.3
"a [D] action figure walking"
Below, we visualize failure cases for applying probabilistic adaptation for visual planning. We note that in both cases shown below, these tasks were unseen during adaptation, thus testing novel text-conditioned generalization after adaptation. A planning horizon of 8 was used to visualize how well the inverse dynamics model enables the agent to follow imagined plans (with a planning horizon of 1, the plan visualization would essentially look like singular frames, since only the first frame is followed).
We first notice that the agent still follows the visual plans to a reasonable degree, suggesting the inverse dynamics model is not the main culprit for failure. Furthermore, the video quality of the plans after adaptation does appear to look visually acceptable, and the dynamics appear reasonably similar to in-domain dynamics. Rather, in many failed tasks, the video plans simply do not appear to perform the specified task (e.g. the robot arm appears to reach towards the net rather than go low towards the soccer ball). This suggests that rather than the inverse dynamics, or the in-domain motion modeling, the bottleneck may be the learned text-motion alignment in the utilized backbone model, and perhaps additional improvements in downstream robotic performance can directly follow from further advancements in the field of text-to-video modeling.
"a robot arm pushing a soccer ball into the net"/"soccer"
Probabilistic Adaptation
"a robot arm opening a drawer"/"drawer open"
Probabilistic Adaptation