Decomposing Motion and Content for
Natural Video Sequence Prediction

Ruben Villegas1,    Jimei Yang2,    Seunghoon Hong3,    Xunyu Lin4,    Honglak Lee1,5

1Dept. of Computer Science and Engineering, University of Michigan, Ann Arbor, USA
2Adobe Research, San Jose, CA
3Dept. of Computer Science and Engineering, POSTECH, Korea
Beihang University, Beijing, China
5Google Brain, Mountain View, CA

We propose a deep neural network for the prediction of future frames in natural video sequences. To effectively handle complex evolution of pixels in videos, we propose to decompose the motion and content, two key components generating dynamics in videos. Our model is built upon the Encoder-Decoder Convolutional Neural Network and Convolutional LSTM for pixel-level prediction, which independently capture the spatial layout of an image and the corresponding temporal dynamics. By independently modeling motion and content, predicting the next frame reduces to converting the extracted content features into the next frame content by the identified motion features, which simplifies the task of prediction. Our model is end-to-end trainable over multiple time steps, and naturally learns to decompose motion and content without separate training. We evaluate the proposed network architecture the human activity videos using KTH, Weizmann action, and UCF-101 datasets. We show state-of-the-art performance in comparison to recent approaches. To the best of our knowledge, this is the first end-to-end trainable network architecture with motion and content separation to model the spatio-temporal dynamics for pixel-level future prediction in natural videos.

Decomposing Motion and Content for Natural Video Sequence Prediction.
Ruben Villegas, Jimei Yang, Seunghoon Hong, Xunyu Lin, Honglak Lee.
In International Conference on Learning Representations (ICLR), 2017.

Architecture Overview:
Overall architecture of the proposed network. (a) Illustrates MCnet without the Motion-Content Residual skip connections, and (b) Illustrates MCnet with such connections. Our network observes a history of image differences through the motion encoder and last observed image through the content encoder. Subsequently, our network proceeds to compute motion-content features and communicates them to the decoder for the prediction of the next frame. 

Video comparison among methods in the paper, green frame means input, and red frame means prediction.

UCF101 6-frame prediction: all models are trained to observe 4 frames and predict 1 on Sports1m, and tested on UCF101:

KTH 20 frame prediction (mostly foreground motion): all models are trained to observe 10 frames and predict 10 frames:

Running                                                                                             Jogging 

Walking                                                                                             Boxing

Handclapping                                                                                  Handwaving

KTH 20-frame prediction (with zoom in/zoom out transition from camera):
all models are trained to observe 10 frames and predict 10 frames (only actions with effects from camera were boxing, handclapping and handwaving):

        Boxing                                                                                       Handclapping


Weizmann action 20-frame prediction: all models are trained to observe 10 frames and predict 10 frames on KTH, and tested on Weizmann action:

Running                                                                                            Walking

One-hand waving                                                                        
Two-hand waving