Stochastic Video Generation with a Learned Prior

Emily Denton and Rob Fergus

Department of Computer Science , Courant Institute, New York University


Generating video frames that accurately predict future world states is challenging. Existing approaches either fail to capture the full distribution of outcomes, or yield blurry generations, or both. In this paper we introduce an unsupervised video generation model that learns a prior model of un- certainty in a given environment. Video frames are generated by drawing samples from this prior and combining them with a deterministic estimate of the future frame. The approach is simple and easily trained end-to-end on a variety of datasets . Sample generations are both varied and sharp, even many frames into the future, and compare favorably to those from existing approaches.


Stochastic Video Generation with a Learned Prior

Emily Denton and Rob Fergus

arXiv preprint: 1802.07687, 2018

[PDF] [code]


We propose a new stochastic video generation (SVG) model that combines a deterministic frame predictor with time-dependent stochastic latent variables. We propose two variants of our model: one with a fixed prior over the latent variables (SVG-FP) and another with a learned prior (SVG- LP).

SVG-LP generations:

We now show some generations from our SVG-LP model. From left to right, each column shows:

  • Ground truth sequence of video frames from a test video
  • A video generated by running the approximate posterior inference network forward to sample the latent variables at each time step. Note, this column is not showing a true generated video since the inference network has access to the future when inferring the latent variables.
  • Best SSIM was constructed by sample 100 sequences and picking the sequence with the highest Structural Similarity (SSIM) with respect to the ground truth sequence.
  • Random sample 1-3 hows three randomly generated sequences. Note the variability in arm movement between the different samples.

BAIR robot pushing dataset (action free):

The BAIR robot pushing dataset (Ebert et al., 2017) contains videos of a Sawyer robotic arm pushing a variety of objects around a table top. The movements of the arm are highly stochastic, providing a good test for our model. Although the dataset does contain actions given to the arm, we discard them during training and make frame predictions based solely on the video input.

Stochastic Moving MNIST (SM-MNIST)

Stochastic Moving MNIST (SM-MNIST) is a dataset consisting of sequences of frames of size 64×64, containing one or two MNIST digits moving and bouncing off edge of the frame (walls). In the original Moving MNIST dataset (Srivastava et al., 2015) the digits move with constant velocity and bounce off the walls in a deterministic manner. By contrast, SM-MNIST digits move with a constant velocity along a trajectory until they hit at wall at which point they bounce off with a random speed and direction. This dataset thus contains segments of deterministic motion interspersed with moments of uncertainty, i.e. each time a digit hits a wall.

Two digit SM-MNIST generations:

Single digit SM-MNIST generations: