AVID: Adapting Video Diffusion Models to World Models
Marc Rigter, Tarun Gupta, Agrin Hilmkil, Chao Ma
We explore whether pretrained image-to-video diffusion models can be adapted to action-conditioned world models without access to the parameters of the original pretrained model. AVID trains a lightweight adapter on a small action-labelled dataset. Given a sequence of actions, the AVID adapter guides the pretrained model towards an accurate generation.
RT1 Experiments
For our experiments with the RT1 dataset, we use DynamiCrafter as the base pretrained model. Below, we compare AVID against training an action-conditioned diffusion model from scratch with the same amount of parameters and compute (28 GPU days). We see that AVID maintains much better consistency with the conditioning image.
Ground Truth
AVID 145M
Action-Cond. Diffusion 145M
By conditioning on different actions, AVID can generate alternative videos given the same initial frame:
Forward
Upwards
Right
The mask generated by AVID shows that the pretrained model is predominantly used for maintaining background consistency while the adapter is mostly responsible for generating the correct motion:
AVID Generation
AVID Mask
We found that finetuning the original pretrained model obtained the strongest performance, so this is the best approach if the model parameters are available. Below are examples of videos generated by finetuning DynamiCrafter (using a larger computational budget than AVID):
Generated Videos (DynamiCrafter 1.4B Finetune 100 GPU Days)
Ground Truth
Procgen Experiments
For the pretrained image-to-video model we train a diffusion model on videos from Procgen excluding one of the games, Coinrun. Examples of the pretraining dataset are below:
We then use AVID to adapt this pretrained diffusion model to generate action-conditioned videos for Coinrun:
Ground Truth
AVID
Action-Cond. Diffusion
For quantitative results, please see the paper!