Diffusion Models for Video Prediction and Infilling

Tobias Höppe, Arash Mehrjou, Stefan Bauer, Didrik Nielsen, Andrea Dittadi

[Paper] [Code]

Introduction

To predict and anticipate future outcomes or reason about missing information in a sequence is a key ability for agents to be able to make intelligent decisions. This requires strong temporally coherent generative capabilities. Diffusion models have shown huge success in several generative tasks lately, but have not been extensively explored in the video domain. Therefore, we present Randomized-Masked Video Diffusion (RaMViD), which extends image diffusion models to videos using 3D convolutions and introduce a new conditioning technique during training. This allows us to use the same architecture as used for unconditional training, which enabels us to train the model in a conditional and unconditional fashion at the same time. We achieve state-of-the-art in video prediction and competitive results on unconditional video generation.


Experiments

We evaluate our model on three widely used video benchmark datasets. The BAIR dataset to evaluate predictive performance, then the Kinetics-600 dataset to test all kinds of video completion tasks and lastly UCF-101 to evaluate if the model is also able to perform unconditional video generation.

BAIR

On the BAIR dataset we evaluate the model given the widely used protocol by conditioning on one frame and predicting the next 15. This tests the generative ability of our model. We have trained for 250000 iterations on 8 GPUS, which took about three days.

Conditional Frame

Generated Video









Kinetics-600

We also train a larger model on the Kinetics-600 dataset. In the following experiments, we show, that the same model can perform several tasks: Video prediction, infilling and completion. This model was trained on 8 GPUs for 500000 steps (9 days)

Prediction

First, we test our model on prediction. We condition on the first 5 frames of a video to predict the next 11 frames. Below you can find several examples.

frame 0

frame 1


frame 2


frame 3


frame 4


Generated Video

Further Examples







Infilling

When doing infilling, we condition on the first two and the last two frames. I.e. the model does observe the starting movement and the final movement of the video and has to complete the action in the missing 12 frames.

Frame 0

Frame 1

Frame 14

Frame 15

Generated Video

Further Examples







Video Completion

Now we again generate a sequence of 16 frames, but we condition on frame number 0, 5, 10 and 15. In this case the model does not have the movement directly given, but has to infer it from the static frames.

Frame 0

Frame 5

Frame 10

Frame 15

Generated Video

Further Examples







UCF-101

By increasing the unconditional rate, we can not just use the model for video completion but also for unconditional video generation. We tested this model on the UCF-101 dataset and trained it for 450000 steps on 8 GPUs, which took about 9 days.

Unconditional Video Generation

The unconditional samples of our model do lack some clarity. The background is generated in a very realistic way, however, we can see that the moving objects are often unrealistic deformed.





Citation

@misc{https://doi.org/10.48550/arxiv.2206.07696,
doi = {10.48550/ARXIV.2206.07696},
url = {
https://arxiv.org/abs/2206.07696},
author = {Höppe, Tobias and Mehrjou, Arash and Bauer, Stefan and Nielsen, Didrik and Dittadi, Andrea},
title = {Diffusion Models for Video Prediction and Infilling},
publisher = {arXiv},
year = {2022},
copyright = {arXiv.org perpetual, non-exclusive license}
}