The proposed MS-GAN framework. The stage-1 generator network produces a low-resolution version of predicted frames which are then fed to the stage-2 generator. Discriminators at both the stages predict 0 or 1 for each predicted frame to denote its origin: synthetic or original.
Comparison of performance for different methods using PSNR/SSIM scores for the UCF-101 and KITTI datasets. The first five rows report the results from [1]. (*) indicates models fine tuned on patches of size 64 Ă— 64 [1]. (-) denotes unavailability of data. GDL stands for Gradient Difference Loss [1]. Best results are given in bold.
*Frames with RED
borders are output from the proposed model (with different objective functions)
Predicted frames using the ''Combined'' objectives
Predicted frames using the ''L1'' objective with the adversarial loss
Ground-truth frames
Predicted frames using the ''Combined'' objectives
Predicted frames using the ''L1'' objective with the adversarial loss
Ground-truth frames
Predicted frames using the ''Combined'' objectives
Predicted frames using the ''L1'' objective with the adversarial loss
Ground-truth frames
Predicted frames using the ''Combined'' objectives
Predicted frames using the ''L1'' objective with the adversarial loss
Ground-truth frames
Predicted frames using the ''Combined'' objectives
Predicted frames using the ''L1'' objective with the adversarial loss
Ground-truth frames
[1] M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-scale video prediction beyond mean square error. International Conference on Learning Representations (ICLR), 2016.
[2] N. Sedaghat. Next-flow: Hybrid multi-tasking with next-frame prediction to boost optical-flow estimation in the wild. arXiv preprint arXiv:1612.03777, 2016.
[3] Z. Liu, R. Yeh, X. Tang, Y. Liu, and A. Agarwala. Video frame synthesis using deep voxel flow. arXiv preprint arXiv:1702.02463, 2017.