Frame Prediction NIPS'17

Temporal Coherency based Criteria for Predicting Video Frames using Deep Multi-stage Generative Adversarial Networks

Prateep Bhattacharjee (prateepb@cse.iitm.ac.in), Sukhendu Das (sdas@iitm.ac.in)

Visualization and Perception Laboratory, Department of CS&E

Indian Institute of Technology Madras, India

Abstract

Predicting the future from a sequence of video frames has been recently a sought after yet challenging task in the field of computer vision and machine learning. Although there have been efforts for tracking using motion trajectories and flow features, the complex problem of generating unseen frames has not been studied extensively. In this paper, we deal with this problem using convolutional models within a multi-stage Generative Adversarial Networks (GAN) framework. The proposed method uses two stages of GANs to generate crisp and clear set of future frames. Although GANs have been used in the past for predicting the future, none of the works consider the relation between subsequent frames in the temporal dimension. Our main contribution lies in formulating two objective functions based on the Normalized Cross Correlation (NCC) and the Pairwise Contrastive Divergence (PCD) for solving this problem. This method, coupled with the traditional L1 loss, has been experimented with three real-world video datasets viz. Sports-1M, UCF-101 and the KITTI. Performance analysis reveals superior results over the recent state-of-the-art methods.

Paper

Temporal Coherency based Criteria for Predicting Video Frames using Deep Multi-stage Generative Adversarial Networks

In the 31st Conference on Advances in Neural Information Processing Systems (NIPS), 2017.

[Paper] [Supplementary] [Poster]

Architecture

The proposed MS-GAN framework. The stage-1 generator network produces a low-resolution version of predicted frames which are then fed to the stage-2 generator. Discriminators at both the stages predict 0 or 1 for each predicted frame to denote its origin: synthetic or original.

Quantitative Results

Comparison of performance for different methods using PSNR/SSIM scores for the UCF-101 and KITTI datasets. The first five rows report the results from [1]. (*) indicates models fine tuned on patches of size 64 × 64 [1]. (-) denotes unavailability of data. GDL stands for Gradient Difference Loss [1]. Best results are given in bold.

Qualitative Results

*Frames with RED borders are output from the proposed model (with different objective functions)