New: Check out our
VideoFlow consists of an auto-regressive latent dynamic prior that models latent variables at time-step t as a Gaussian distribution whose mean and standard deviation is a function of the
1) Latent variables at higher levels
2) Latent variables at previous time-steps
Below are network illustrations for a setup where we model the frame at step t as a function of two previous steps.
Training:
Modelling X_t as a function of X_{t-1} and X_{t-2}
Generation:
Generating X_t as a function of X_{t-1} and X_{t-2}
3-D Residual Network Architecture
Left: Network that predicts the latent gaussian at t as a function of the latents at previous steps and higher levels. Right: The detailed components of the 3-D residual network marked in the left figure.
We demonstrate diversity in outcomes by provide the same input and displaying different outcomes at T=0.6.
A blue border represents conditioning frames and a red border represents generated frames.
We sample 100 videos from our model with different temperatures, compute the cosine similarity with the ground truth using a pre-trained VGG network on imagenet and display best results according to this metric for different conditioning frames.
T = 0.7
T = 1.0
T = 0.9
T = 0.5
T = 0.3
T = 0.1
We follow the same procedure as above but instead display our worst results as per the VGG metric. The generated videos are still temporally consistent and in the image manifold.
We encode the first and last frame of a video from the test set using our trained VideoFlow encoder into two multi-level latent representations. We then interpolate between the representations at all levels, only the top level and only the bottom level. The bottom level interpolates the motion of objects which are at a smaller scale while the top level interpolates the position of the arm.
First frame
Interpolate - All levels
Interpolate -Bottom Only
Interpolate - Top Only
Last Frame
We fix the shape and display interpolations between two color, size pairs.
Color=blue
Interpolations
Color=yellow
Color=blue
Interpolations
Color=red
We compare the quality of generated videos conditioned on test-set frames vs the mean bits-per-pixel as training progresses.
Bpp = 1.91
Bpp = 2.11
Bpp=2.21
Bpp = 2.32
Bpp=2.73
SAVP-VAE
VideoFlow at T=0.7