Accepted at ICML 2020
Designing video prediction models that account for the inherent uncertainty of the future is challenging. Most works in the literature are based on stochastic image-autoregressive recurrent networks, which raises several performance and applicability issues. An alternative is to use fully latent temporal models which untie frame synthesis and temporal dynamics. However, no such model for stochastic video prediction has been proposed in the literature yet, due to design and training difficulties. In this paper, we overcome these difficulties by introducing a novel stochastic temporal model whose dynamics are governed in a latent space by a residual update rule. This first-order scheme is motivated by discretization schemes of differential equations. It naturally models video dynamics as it allows our simpler, more interpretable, latent model to outperform prior state-of-the-art methods on challenging datasets.
We show in the following video samples of our model and baselines which we compare to. Ground truth and conditioning frames are highlighted by a green background, while model predictions have a red background. We present for each example the best and worst prediction with respect to the ground-truth sequence, as well as a random sample. We only present the best sample of each baseline.
Stochastic Moving MNIST
This dataset consists in MNIST digits randomly bouncing off of the video frames. After each bounce, a digit moves linearly according to its new randomly sampled direction and speed.
We only compare to SVG which is the only considered state-of-the-art baseline to have been tested on this dataset.
In particular, SVG tends to change the shape of digits when they cross, while our model maintains temporal consistency.
A typical failure case of prior methods is the case where no subject appears in the conditioning frames as they have a hard time to predict this appearance. In contrast, ours manages to handle this case.
Additional samples are presented below.
This dataset is also made of videos of subjects performing various actions. While there are more actions and details to capture with less training subjects than in KTH, the video backgrounds are less varied, and subjects always remain within the frames.
We compare to the state-of-the-art method StructVRNN. While both StructVRNN and our model do not perfectly capture the dynamics of this challenging dataset, ours produces more realistic subjects ans movements. Moreover, StructVRNN predictions generate more artefacts, especially at the initial position of the subject.
An area of improvement for both methods is the appearance of the subject which deviates from the ground truth. We suppose that this is due to the low number of subjects seen during training, preventing the models to generalize their appearance.
Generation at Higher Frame Rates
As explained in the paper, our model can be used to generate videos at a higher frame rate than the videos seen in the dataset, whether it has to be trained to this end or not. This is a challenging task, as there is no supervision on the intermediate frames.
The following examples show best samples from our model trained with step size Δt = 1 on BAIR, but tested with lower step sizes. Since the same model is tested on different time steps, it is expected to not have the same best samples between different settings.
We observe that predictions between different settings are similar, but the ones obtained with lower step sizes are smooth, showing the ability of our model to generate video at frame rates unseen during training.
The following examples show, from left to right, best samples from our model trained and tested with step size Δt = 1, trained and tested with step size Δt = 1/2 and trained with Δt = 1/2 and tested on Δt = 1/4 on KTH.
In order to assess the ability of our model to separate the content of a video from its dynamics (see the paper), we perform an experiment where we extract the content of a source video (left frame, content) but infer the dynamics from a target video (up, pose), and present in the following examples on Human3.6M and BAIR the results of combining those content and dynamics variables (bottom right, swap).
We observe that the content of a video, which consists in its background, is indeed coupled with the dynamics inferred from the target video (movement of the subject or of the robot arm).