# VideoFlow

### Generating 100 frames into the future with 3 input frames

### Network Architecture

VideoFlow consists of an auto-regressive latent dynamic prior that models latent variables at time-step t as a Gaussian distribution whose mean and standard deviation is a function of the

1) Latent variables at higher levels

2) Latent variables at previous time-steps

Below are network illustrations for a setup where we model the frame at step t as a function of two previous steps.

**Training: **

Modelling X_t as a function of X_{t-1} and X_{t-2}

**Generation:**

Generating X_t as a function of X_{t-1} and X_{t-2}

**3-D Residual Network Architecture**

**Left**: Network that predicts the latent gaussian at t as a function of the latents at previous steps and higher levels. **Right**: The detailed components of the 3-D residual network marked in the left figure.

### Generating different outcomes with the same input and different noise

We demonstrate diversity in outcomes by provide the same input and displaying different outcomes at T=0.6.

### Effect of temperature on the BAIR robot pushing dataset

A blue border represents conditioning frames and a red border represents generated frames.

We sample 100 videos from our model with different temperatures, compute the cosine similarity with the ground truth using a pre-trained VGG network on imagenet and display best results according to this metric for different conditioning frames.

**T = 0.7**

**T = 1.0**

**T = 0.9**

**T = 0.5**

**T = 0.3**

**T = 0.1**

We follow the same procedure as above but instead display our worst results as per the VGG metric. The generated videos are still temporally consistent and in the image manifold.

### Video Modelling on the Stochastic Shapes dataset

### Interpolations in Latent Space - BAIR robot pushing

We encode the first and last frame of a video from the test set using our trained VideoFlow encoder into two multi-level latent representations. We then interpolate between the representations at** **all levels, only the top level and only the bottom level. The bottom level interpolates the motion of objects which are at a smaller scale while the top level interpolates the position of the arm.

**First frame**

**Interpolate - All levels**

**Interpolate -Bottom Only**

**Interpolate - Top Only**

**Last Frame**

### Interpolations in Latent Space - Stochastic Shapes

We fix the shape and display interpolations between two color, size pairs.

**Color=blue**

**Interpolations**

**Color=yellow**

**Color=blue**

**Interpolations**

**Color=red**

### Bits-per-pixel vs quality:

We compare the quality of generated videos conditioned on test-set frames vs the mean bits-per-pixel as training progresses.

Bpp = 1.91

Bpp = 2.11

Bpp=2.21

Bpp = 2.32

Bpp=2.73

### Comparison with SAVP-VAE

SAVP-VAE

VideoFlow at T=0.7