Generating 100 frames into the future with 3 input frames

Network Architecture

VideoFlow consists of an auto-regressive latent dynamic prior that models latent variables at time-step t as a Gaussian distribution whose mean and standard deviation is a function of the

1) Latent variables at higher levels

2) Latent variables at previous time-steps

Below are network illustrations for a setup where we model the frame at step t as a function of two previous steps.


Modelling X_t as a function of X_{t-1} and X_{t-2}


Generating X_t as a function of X_{t-1} and X_{t-2}

3-D Residual Network Architecture

Left: Network that predicts the latent gaussian at t as a function of the latents at previous steps and higher levels. Right: The detailed components of the 3-D residual network marked in the left figure.

Generating different outcomes with the same input and different noise

We demonstrate diversity in outcomes by provide the same input and displaying different outcomes at T=0.6.

Effect of temperature on the BAIR robot pushing dataset

A blue border represents conditioning frames and a red border represents generated frames.

We sample 100 videos from our model with different temperatures, compute the cosine similarity with the ground truth using a pre-trained VGG network on imagenet and display best results according to this metric for different conditioning frames.

T = 0.7

T = 1.0

T = 0.9

T = 0.5

T = 0.3

T = 0.1

We follow the same procedure as above but instead display our worst results as per the VGG metric. The generated videos are still temporally consistent and in the image manifold.

Video Modelling on the Stochastic Shapes dataset

Interpolations in Latent Space - BAIR robot pushing

We encode the first and last frame of a video from the test set using our trained VideoFlow encoder into two multi-level latent representations. We then interpolate between the representations at all levels, only the top level and only the bottom level. The bottom level interpolates the motion of objects which are at a smaller scale while the top level interpolates the position of the arm.

First frame

Interpolate - All levels

Interpolate -Bottom Only

Interpolate - Top Only

Last Frame

Interpolations in Latent Space - Stochastic Shapes

We fix the shape and display interpolations between two color, size pairs.







Bits-per-pixel vs quality:

We compare the quality of generated videos conditioned on test-set frames vs the mean bits-per-pixel as training progresses.

Bpp = 1.91

Bpp = 2.11


Bpp = 2.32


Comparison with SAVP-VAE


VideoFlow at T=0.7