Improved Conditional VRNNs for Video Prediction


Predicting future frames for a video sequence is a challenging generative modeling task. Promising approaches rely on probabilistic latent variable models such as the Variational Auto-Encoder (VAE). While VAEs can handle uncertainty and model multiple possible future sequences, they have a tendency to produce blurry predictions. In this work we argue that this is a sign of underfitting the data and propose to mitigate it by increasing the expressiveness of the latent distributions and using higher capacity likelihood models. Our model leverages a hierarchy of latent variables that define a family of prior and posterior distributions with complex covariance structures, allowing our model to better capture future outcomes. We validate our proposal through a series of ablation experiments and we compare our approach to current state-of-the-art latent variable models. Our approach performs favorably under several metrics in three different datasets.

Read the paper

Download the code

Predictions on Cityscapes