Depth-Aware Video Frame Interpolation

Wenbo Bao*, Wei-Sheng Lai#, Chao Ma*, Xiaoyun Zhang*, Zhiyong Gao*, Ming-Hsuan Yang#&

*Shanghai Jiao Tong University, #University of California, Merced, &Google


Video frame interpolation aims to synthesize non-existent frames in-between the original frames. While significant advances have been made from the deep convolutional neural networks, the quality of interpolation is often reduced due to large object motion or occlusion. In this work, we propose to explicitly detect the occlusion by exploring the depth cue in frame interpolation. Specifically, we develop a depth-aware flow projection layer to synthesize intermediate flows that preferably sample closer objects than farther ones. In addition, we learn hierarchical features as the contextual information. The proposed model then warps the input frames, depth maps, and contextual features based on the optical flow and local interpolation kernels for synthesizing the output frame. Our model is compact, efficient, and fully differentiable to optimize all the components. We conduct extensive experiments to analyze the effect of the depth-aware flow projection layer and hierarchical contextual features. Quantitative and qualitative results demonstrate that the proposed model performs favorably against state-of-the-art frame interpolation methods on a wide variety of datasets.

News, Blogposts, and Applications

Starting from the birth of photographing in the 18-th centuries, videos became important media to keep vivid memories of their age being captured. And it's shown in varying forms including movies, animations, and vlogs. However, due to the limit of video technologies including sensor density, storage and compression, quite a lot of video contents in the past centuries remain at low quality. Among those important metrics for video quality, the most important one is the temporal resolution measured in frame-per-second or fps for short. Higher-frame-rate videos bring about more immersive visual experience to users so that the reality of the captured content is perceived. Therefore, the demand to improve the low-frame-rate videos, particularly the 12fps old films, 5~12fps animations, pixel-arts and stop motions, 25~30 fps movies, 30fps video games, becomes more and more urgent.

The video temporal interpolation is more complex than spatial interpolation due to the object motions between frames. However, our DAIN algorithm being an AI tool efficiently and gracefully synthesizes the unseen frames. We obtain much attention from, Leiphone (雷锋网), tomorrowscience (明日科學), Futurism, redsharknews, metro, Vintage News, Boing Boing, Xataka (Español) , Sina (新浪网), gadgety (עברית), gismeteo (русский), ifanr (爱范儿), ScienceAlert. Take your familiar language to digest more.See more astonishing results from varying media source, blog posts and demonstrations. Checkout more videos on this DAIN-App Playground collections.

Old Movie by Denis Shiryaev

Two Minute Papers

Stop Motion by LegoEddy

Demo by Omega Sisters

Demo by Geekerwan

Animations by Gabriel Poetsch

Apollo 16 by Denis Shiryaev

Video Demos


    author    = {Bao, Wenbo and Lai, Wei-Sheng and Ma, Chao and Zhang, Xiaoyun and Gao, Zhiyong and Yang, Ming-Hsuan}, 
    title     = {Depth-Aware Video Frame Interpolation}, 
    booktitle = {IEEE Conferene on Computer Vision and Pattern Recognition},
    year      = {2019}
    author    = {Bao, Wenbo and Lai, Wei-Sheng and Ma, Chao and Zhang, Xiaoyun and Gao, Zhiyong and Yang, Ming-Hsuan}, 
    title     = {Depth-Aware Video Frame Interpolation}, 
    booktitle = {},
    year      = {2019}


Network Architecture

Code and Results


[1] S. Baker, D. Scharstein, J. Lewis, S. Roth, M. J. Black, and R. Szeliski. A database and evaluation methodology for optical flow. IJCV, 2011. 2

[2] W. Bao, W.-S. Lai, X. Zhang, Z. Gao, and M.-H. Yang. MEMC-Net: Motion Estimation and Motion Compensation Driven Neural Network for Video Interpolation and Enhancement. arXiv, 2018. 1

[14] H. Jiang, D. Sun, V. Jampani, M.-H. Yang, E. LearnedMiller, and J. Kautz. Super SloMo: High Quality Estimation of Multiple Intermediate Frames for Video Interpolation. In CVPR, 2018.

[19] Z. Li and N. Snavely. Megadepth: Learning single-view depth prediction from internet photos. In CVPR, 2018. 2

[22] G. Long, L. Kneip, J. M. Alvarez, H. Li, X. Zhang, and Q. Yu. Learning image matching by simply watching video. In ECCV, 2016.

[23] S. Niklaus and F. Liu. Context-aware synthesis for video frame interpolation. In CVPR, 2018.

[25] S. Niklaus, L. Mai, and F. Liu. Video frame interpolation via adaptive separable convolution. In ICCV, 2017.

[29] J. Revaud, P. Weinzaepfel, Z. Harchaoui, and C. Schmid. Epicflow: Edge-preserving interpolation of correspondences for optical flow. In CVPR, 2015.

[33] K. Soomro, A. R. Zamir, and M. Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. In CRCV-TR-12-01, 2012. 2

[39] T. Xue, B. Chen, J. Wu, D. Wei, and W. T. Freeman. Video enhancement with task-oriented flow. arXiv, 2017. 2