Predicting Video with VQVAE

Jacob Walker Ali Razavi Aaron van den Oord

In recent years, the task of video prediction—forecasting future video given past

video frames—has attracted attention in the research community. In this paper

we propose a novel approach to this problem with Vector Quantized Variational

AutoEncoders (VQ-VAE). With VQ-VAE we compress high-resolution videos into

a hierarchical set of multi-scale discrete latent variables. Compared to pixels, this

compressed latent space has dramatically reduced dimensionality, allowing us to

apply scalable autoregressive generative models to predict video. In contrast to

previous work that has largely emphasized highly constrained datasets, we focus

on very diverse, large-scale datasets such as Kinetics-600. We predict unconstrained

video at a higher resolution, 256×256, than any other previous method to our knowledge.

We further validate our approach against prior work via a crowdsourced human

evaluation.

Conditioning Videos licensed under Creative Commons BY 2.0

https://creativecommons.org/licenses/by/2.0/

All videos have been cropped and modified by our algorithm.

Selected videos are sampled with a temperature of 1.0. Random videos were sampled with a temperature of 1.2.

Page updated

Google Sites

Report abuse