Decomposing Motion and Content for Natural Video Sequence Prediction

Decomposing Motion and Content for

Natural Video Sequence Prediction

Ruben Villegas¹, Jimei Yang², Seunghoon Hong³, Xunyu Lin⁴, Honglak Lee^1,5

¹Dept. of Computer Science and Engineering, University of Michigan, Ann Arbor, USA

²Adobe Research, San Jose, CA

³Dept. of Computer Science and Engineering, POSTECH, Korea

⁴Beihang University, Beijing, China

⁵Google Brain, Mountain View, CA

Abstract:

We propose a deep neural network for the prediction of future frames in natural video sequences. To effectively handle complex evolution of pixels in videos, we propose to decompose the motion and content, two key components generating dynamics in videos. Our model is built upon the Encoder-Decoder Convolutional Neural Network and Convolutional LSTM for pixel-level prediction, which independently capture the spatial layout of an image and the corresponding temporal dynamics. By independently modeling motion and content, predicting the next frame reduces to converting the extracted content features into the next frame content by the identified motion features, which simplifies the task of prediction. Our model is end-to-end trainable over multiple time steps, and naturally learns to decompose motion and content without separate training. We evaluate the proposed network architecture the human activity videos using KTH, Weizmann action, and UCF-101 datasets. We show state-of-the-art performance in comparison to recent approaches. To the best of our knowledge, this is the first end-to-end trainable network architecture with motion and content separation to model the spatio-temporal dynamics for pixel-level future prediction in natural videos.

Paper:

Decomposing Motion and Content for Natural Video Sequence Prediction.

Ruben Villegas, Jimei Yang, Seunghoon Hong, Xunyu Lin, Honglak Lee.

In International Conference on Learning Representations (ICLR), 2017.

[PDF][ArXiv][code]

Architecture Overview:

Overall architecture of the proposed network. (a) Illustrates MCnet without the Motion-Content Residual skip connections, and (b) Illustrates MCnet with such connections. Our network observes a history of image differences through the motion encoder and last observed image through the content encoder. Subsequently, our network proceeds to compute motion-content features and communicates them to the decoder for the prediction of the next frame.