Hiroki Furuta, Heiga Zen, Dale Schuurmans, Aleksandra Faust, Yutaka Matsuo, Percy Liang, Sherry Yang
Google DeepMind, The University of Tokyo, Stanford University
Overview
Large text-to-video models hold immense potential for a wide range of downstream applications. However, these models struggle to accurately depict dynamic object interactions, often resulting in unrealistic movements and frequent violations of real-world physics. One solution inspired by large language models is to align generated outputs with desired outcomes using external feedback. In this work, we investigate the use of feedback to enhance the quality of object dynamics in text-to-video models. We aim to answer a critical question: what types of feedback, paired with which specific self-improvement algorithms, can most effectively overcome movement misalignment and realistic object interactions? We begin by pointing out that offline RL-finetuning algorithms for text-to-video models can be equivalent as derived from a unified probabilistic objective. This perspective highlights that there is no algorithmically dominant method in principle; rather, we should care about the property of reward and data. While human feedback is less scalable, vision-language models could notice the events in video as humans do. We then propose leveraging vision-language models to provide perceptual feedback specifically tailored to object dynamics in videos. Compared to popular video quality metrics measuring alignment or dynamics (e.g., CLIP scores, optical flow), the experiments demonstrate that our approach with binary AI feedback drives the most significant improvements in the quality of interaction scenes in video, as confirmed by AI, human, and quality metric evaluations. Notably, we observe substantial gains when using signals from vision language models, particularly in scenarios involving complex interactions between multiple objects and realistic depictions of objects falling.
RL-Finetuning with AI Feedback
We investigate the recipe for improving dynamic interactions with objects in text-to-video models by leveraging external feedback. We first generate videos from the pre-trained models, and then put the AI feedback and reward labels on the generated videos. For the choice of feedback, we test metric-based feedback on semantics, human preference, and dynamics, and also propose leveraging the binary feedback obtained from large-scale VLMs capable of video understanding (such as Gemini). Those data are leveraged for offline and iterative RL-finetuning.
AI Feedback from Vision-Language Models
One of the most reliable evaluations of any generative model can be feedback from humans, while human evaluation requires a lot of costs. One scalable way to replace subjective human evaluation is AI feedback, which has succeeded in improving LLMs. Inspired by this, we propose employing VLMs, capable of video understanding, to obtain the AI feedback for text-to-video models.
We provide the textual prompt and video as inputs and ask VLMs to evaluate the input video in terms of overall coherence, physical accuracy, task completion, and the existence of inconsistencies. The feedback VLMs predict is a binary label; accepting the video if it is coherent and the task is completed correctly or rejecting it if it does not satisfy any evaluation criteria.
To demonstrate the capability of understanding physics in VLMs, we conduct preliminary evaluations; we ask Gemini to assess generated videos, such as "taking one body spray of many similar", and then we analyze the feedback and rationale in the response. VLMs score each generation individually (i.e., point-wise). VLMs could recognize the scene correctly, such as the success or failure of grasping a bottle. In addition, with the a priori that the true video is always preferable, we prepare pairs of the true and generated video, and measure the accuracy of AI feedback. We observe that VLMs can classify the true videos preferable to generated videos for 90.3% of the time, which supports that VLMs are capable enough to simulate human supervision.
Challenging Object Movements
To characterize the hallucinated behavior of text-to-video models in a dynamic scene, we curate dynamic and challenging object movements from a pair of prompt and reference videos.
Object Removal: To move something out from the container in the scene, or the camera frame itself. A transition to a new scene often induces out-of-place objects. For example, "taking a pen out of the box".
Multiple Objects: Object interaction with multiple instances. In a dynamic scene, it is challenging to keep the consistency of all the contents. For example, "moving lego away from mouse".
Deformable Object: To manipulate deformable objects, such as cloths and paper. The realistic movement of non-rigid instances requires sufficient expressibility and can test the text-content alignment. For example, "twisting shirt wet until water comes out".
Directional Movement: To move something in a specified direction. Text-to-video models can roughly follow the directions in the prompts, although they often fail to generate consistent objects in the scene. For example, "pulling water bottle from left to right".
Falling Down: To fall something down. This category often requires the dynamic expression towards the depth dimension. For example, "putting marker pen onto plastic bottle so it falls off the table".
VLM / Human Preference / Quality Evaluation
VLMs (Gemini/GPT), human preference, and VBench (V) evaluations are among the combinations of algorithms and rewards. {algorithm}-{reward} stands for finetuning text-to-video models by optimizing {reward} with {algorithm}. Compared to other metric rewards popular in the video domain, AI feedback from Gemini (RWR-AIF and DPO-AIF) achieves the best quality assessed by Gemini, GPT, and human feedback, as well as many VBench scores focusing on consistency and smoothness. RWR may achieve better quality on the train split than DPO while exhibiting over-fitting, where the performance on the test split degrades from pre-trained models.
Iterative RL-finetuning can resolve distribution shift issues in offline finetuning and leads to further alignment to preferable outputs. RWR-AIF continually improves its output by leveraging self-generations (52.66% → 56.41% → 58.83%), while DPO-AIF is saturated (52.66% → 56.17% → 56.25%). This can be because the paired data from one-iteration DPO becomes equally good due to the overall improvement; it is hard to assign binary preferences correctly.
Analysis on Generated Object Movement
Analysis of preferable generation per category with AI evaluation (above) and absolute improvement from pre-trained models (below). Text-to-video models generate high-quality videos of deformable object (DO) and directional movement (DM), while not good at generating scenes of object removal (OR), multiple objects (MO), and falling down (FD). AI feedback notably increases preferable outputs in multiple objects and falling down.
Example Videos
Prompt: taking rose bud from bush
Pre-Trained
RL-Finetuned (AIF)
Prompt: taking a pen out of the book
Pre-Trained
RL-Finetuned (AIF)
Prompt: taking one body spray of many similar
Pre-Trained
RL-Finetuned (AIF)
Prompt: tearing receipt into two pieces
Pre-Trained
RL-Finetuned (AIF)
Prompt: pushing a bottle so that it falls off the table
Pre-Trained
RL-Finetuned (AIF)