In recent years, the task of video prediction—forecasting future video given past
video frames—has attracted attention in the research community. In this paper
we propose a novel approach to this problem with Vector Quantized Variational
AutoEncoders (VQ-VAE). With VQ-VAE we compress high-resolution videos into
a hierarchical set of multi-scale discrete latent variables. Compared to pixels, this
compressed latent space has dramatically reduced dimensionality, allowing us to
apply scalable autoregressive generative models to predict video. In contrast to
previous work that has largely emphasized highly constrained datasets, we focus
on very diverse, large-scale datasets such as Kinetics-600. We predict unconstrained
video at a higher resolution, 256×256, than any other previous method to our knowledge.
We further validate our approach against prior work via a crowdsourced human
evaluation.
Conditioning Videos licensed under Creative Commons BY 2.0
https://creativecommons.org/licenses/by/2.0/
All videos have been cropped and modified by our algorithm.
"Ocean Beach Surfing” By dakine kane: https://www.flickr.com/photos/61112791@N00/8361264621
“The Cable Car” By Aaron Tait: https://www.flickr.com/photos/96272984@N00/2442344223
“dave skiing 2” By Leigh Blackall: https://www.flickr.com/photos/97283472@N00/3072947082
Selected videos are sampled with a temperature of 1.0. Random videos were sampled with a temperature of 1.2.