avid

AVID: Adapting Video Diffusion Models to World Models

Marc Rigter, Tarun Gupta, Agrin Hilmkil, Chao Ma

We explore whether pretrained image-to-video diffusion models can be adapted to action-conditioned world models without access to the parameters of the original pretrained model. AVID trains a lightweight adapter on a small action-labelled dataset. Given a sequence of actions, the AVID adapter guides the pretrained model towards an accurate generation.

RT1 Experiments

For our experiments with the RT1 dataset, we use DynamiCrafter as the base pretrained model. Below, we compare AVID against training an action-conditioned diffusion model from scratch with the same amount of parameters and compute (28 GPU days). We see that AVID maintains much better consistency with the conditioning image.

Ground Truth

AVID 145M

Action-Cond. Diffusion 145M

By conditioning on different actions, AVID can generate alternative videos given the same initial frame:

Forward

Upwards

Right

The mask generated by AVID shows that the pretrained model is predominantly used for maintaining background consistency while the adapter is mostly responsible for generating the correct motion:

AVID Generation

AVID Mask

We found that finetuning the original pretrained model obtained the strongest performance, so this is the best approach if the model parameters are available. Below are examples of videos generated by finetuning DynamiCrafter (using a larger computational budget than AVID):

Generated Videos (DynamiCrafter 1.4B Finetune 100 GPU Days)

Ground Truth

Procgen Experiments

For the pretrained image-to-video model we train a diffusion model on videos from Procgen excluding one of the games, Coinrun. Examples of the pretraining dataset are below: