Learning to Generate Long-term Future
via Hierarchical Prediction

Ruben Villegas1*, Jimei Yang2, Yuliang Zou1, Sungryull Sohn1, Xunyu Lin3Honglak Lee1,4

1Dept. of Computer Science and Engineering, University of Michigan, Ann Arbor, USA
2Adobe Research, San Jose, CA
3Beihang University, Beijing, China
4Google Brain, Mountain View, CA

We propose a hierarchical approach for making long-term predictions of future frames. To avoid inherent compounding errors in recursive pixel-level prediction, we propose to first estimate high-level structure in the input frames, then predict how that structure evolves in the future, and finally by observing a single frame from the past and the predicted high-level structure, we construct the future frames without having to observe any of the pixel-level predictions. Long-term video prediction is difficult to perform by recurrently observing the predicted frames because the small errors in pixel space exponentially amplify as predictions are made deeper into the future. Our approach prevents pixel-level error propagation from happening by removing the need to observe the predicted frames. Our model is built with a combination of LSTM and analogy based encoder-decoder convolutional neural networks, which independently predict the video structure and generate the future frames, respectively. In experiments, our model is evaluated on the Human3.6M and Penn Action datasets on the task of long-term pixel-level video prediction of humans performing actions and demonstrate significantly better results than the state-of-the-art.

Learning to Generate Long-term Future via Hierarchical Prediction.
Ruben Villegas, Jimei Yang, Yuliang Zou, Sungryull Sohn, Xunyu Lin, Honglak Lee.
In Proceedings of the 34th International Conference on Machine Learning (ICML), 2017.
[PDF][Supplementary Material][ArXiv][code coming soon]

Architecture Overview:
Overall hierarchical approach to pixel-level video prediction. Our algorithm first observes frames from the past and estimate the high-level structure, in this case human pose xy-coordinates, in each frame. The estimated structure is then used to predict the future structures in a sequence to sequence manner. Finally, our algorithm takes the last observed frame, its estimated structure, and the predicted structure sequence, in this case represented as heat-maps, and generates the future frames. Green denotes input to our network and red denotes output from our network.

Video Prediction:
Video comparison among our method and baselines, green frame means input, and red frame means prediction. Our method generates the first video starting from left-to-right.
Clarification: we did not train for each action category separately, all videos in each dataset are generated from a single model trained on all actions.

Penn Action up to 64-frame prediction for each action category
(some actions' ground-truth ends before 64 frames are predicted)
Baseball pitch                                     Baseball swing                                   Clean and jerk
Golf swing                                     Jump rope                                   Jumping jacks
Tennis forehand                                   Tennis serve
More videos

Human3.6M 128-frame prediction for each action category
Directions                                     Discussion                                   Greeting
Take photo                                    Talking on the phone                                 Eating           
    Posing                                                  Sitting                                                 Purchases
 Sitting down                                            Smoking                                               Waiting       
   Walk dog                                               Walking                                           Walk together

Numerical Evaluation:
Per-Class Human Evaluation and Analysis:
We follow a human psycho-physical quantitative evaluation metric similar to Vondrick et al. (2016). Amazon Mechanical Turk (AMT) workers are given a two-alternative choice to indicate which of two videos looks more realistic. Specifically, the workers are shown a pair of videos (generated by two different methods) consisting of the same input frames indicated by a green box and predicted frames indicated by a red box, in addition to the action label of the video.

Penn Action: Quantitatively, the action sequences generated by our network are perceptually higher quality than the baselines and also predict the correct action sequence. A relatively small (although still substantial) margin is observed when comparing to convolutional LSTM for the jump rope action (i.e., 66.7% for ours vs 33.3% for Convolutional LSTM). We hypothesize that convolutional LSTM is able to do a reasonable job for this action class due to the highly cyclic motion nature of jumping up and down in place.  The remainder of the human actions contain more complicated non-linear motion, which is much more complicated to predict.  Overall, our method outperforms the baselines by a large margin (i.e. 82.4% for ours vs 17.6% for Convolutional LSTM, and 86.1% for ours vs 13.9% for Optical Flow).

Human3.6M: The videos generated by our network are perceptually higher quality and reflect a reasonable future compared to the baselines on average. Unexpectedly, our network does not perform well on videos where the action involves minimal motion, such as sitting, sitting down, eating, taking a photo, and waiting. These actions usually involve the person staying still or making very unnoticeable motion which can result in a static prediction (by convolutional LSTM and/or optical flow) making frames look far more realistic than the prediction from our network. Overall, our method outperforms the baselines by a large margin (i.e. 70.3% for ours vs 29.7% for Convolutional LSTM, and 72.3% for ours vs 27.7% for Optical Flow). Figure 5 shows that our network generates far higher quality future frames compared to the convolutional LSTM baseline.

Motion-Based Pixel-Level Evaluation, Analysis and Control Experiments:
In this section, we evaluate the predictions by deciles of motion similar to Villegas et al. (2017) using Peak Signal-to-NoiseRatio (PSNR) measure, where the 10th decile contains videos with the most overall motion.  We add a modification to our hierarchical method based on a simple heuristic by which we copy the background pixels from the last observed frame using the predicted pose heat-maps as foreground/background masks (Ours BG). Additionally, we perform experiments based on an oracle that provides our image generator the exact future pose trajectories (Ours GT-pose∗) and we also apply the previously mentioned heuristics (Ours GT-pose BG∗). We put * marks to clarify that these are hypothetical methods as they require ground-truth future pose trajectories.

In our method, the future frames are strictly dictated by the future structure. Therefore, the prediction based on the future pose oracle sheds light on how much predicting a different future structure affects PSNR scores.  (Note:  many future trajectories are possible given a single past trajectory.) Further, we show that our conditional image generator given the perfect knowledge of the future pose trajectory (e.g.,Ours GT-pose∗) produces high-quality video prediction that both matches the ground-truth video closely and achieves much higher PNSRs.  These results suggest that our hierarchical approach is a step in the right direction towards solving the problem of long-term pixel-level video prediction.

Penn Action: Below, we show evaluation on the 10th, 9th, and 8th deciles (please refer to our paper for all deciles). The plots show that our method outperforms the baselines for long-term frame prediction. In addition, by using the future pose determined by the oracle as input to our conditional image generator, our method can achieve even higher PSNR scores. We hypothesize that predicting future frames that reflect similar action semantics as the ground-truth, but with possibly different pose trajectories, causes lower PSNR scores. The figure below supports this hypothesis by showing that higher MSE in predicted pose tends to correspond to lower PSNR score.

The fact that PSNR can be low even if the predicted future is one of the many plausible futures suggest that PSNR may not be the best way to evaluate long-term video prediction when only a single future trajectory is predicted. This issue might be alleviated when a model can predict multiple possible future trajectories, but this investigation using our hierarchical decomposition is left as future work. Below, we show videos where PSNR is low when a different future (from the ground-truth) is predicted (left), and video where PSNR is high because the predicted future is close to the ground-true future (right).

                                                Low PSNR                                                                                     High PSNR




To directly compare our image generator using the predicted future pose (Ours) and the ground-truth future pose given by the oracle (Ours GT-pose∗), we present qualitative experiments below. We can see that the both predicted videos contain the action in the video. The oracle based video prediction reflects the exact future very well.


Human3.6M: Below, we show evaluation (PSNRs over time) of different methods on the 10th, 9th, 8th deciles of motion (please refer to our paper for all deciles). Our hierarchical approach (e.g., Ours BG) tends to achieve PSNR performance that is better than optical flow based method and comparable to convolutional LSTM. In addition, when using the oracle future pose predictor as input to our image generator, the PSNR scores get a larger boost compared to Section A.1. This is because there is higher uncertainty of the actions being performed in the Human3.6M dataset compared to Penn Action dataset. Therefore, even plausible future predictions can still deviate significantly from the ground-truth future trajectory, which can penalize PSNRs.

To gain further insight on this problem, we provide two additional analysis.  First, we compute how the average PSNR changes as the future pose MSE increases in the plot below.  The figure clearly shows the negative correlation between the predicted pose MSE and frame PSNR, meaning that larger deviation of the predicted future pose from the ground future pose tend to cause lower PSNRs.

Second, we show snapshots of video prediction from different methods along with the PNSRs that change over time.  Our method tend to make plausible future pose trajectory but it can deviate from the ground-truth future pose trajectory; in such case, our method tend to achieve low PSNRs. However, when the future pose prediction from our method matches well with the ground-truth, the PSNR is much higher and the generated image frame is perceptually very similar to the ground-truth frame. In contrast, optical flow and convolutional LSTM make prediction that often loses the structure of the foreground (e.g., human) over time, and eventually their predicted videos tend to become static.  It is interesting to note that our method is comparable to convolutional LSTM in terms of PSNR, but that our method still strongly outperforms convolutional LSTM in terms of human evaluation.

                                                Low PSNR                                                                                     High PSNR



To directly compare our image generator using the predicted future pose (Ours) and the ground-truth future pose given by the oracle (Ours GT-pose∗), we present qualitative experiments below. We can see that the both predicted videos contain the action in the video. However, the oracle based video reflects the exact future very well.


Control ExperimentLong-term Frame Generation from Pose Oracle
To generate the videos below, we use an oracle that provides our image generator the exact future pose trajectories (Ours GT-pose∗). Our image generator observes the initial frame in the video, the corresponding human pose, and the future pose provided by the oracle to generate up to 1000-frame videos. These additional videos are provided to show that our image generator can produce frames far into the future. This gives further evidence that our hierarchical prediction approach is a step in the right direction towards a solution to the pixel-level video prediction problem.


Limitations and Future Work:
Our method is not perfect and has the following limitations that we aim to address in future work:
  • High-level structure needed for making analogies of the future needs to be given during training.
    • We are currently using pose annotations as structure information which we then represent as heat-maps to train our conditional image generator. As future work, we aim to automatically discover the structure.
  • We only predict a single future.
    • Multiple futures are possible given the past. Our network is currently capable of predicting a single future, but we aim to predict multiple futures given the past in future work.
  • We currently do not handle background motion.
    • This is a highly challenging task since background comes in and out of sight, which makes it difficult for the network to "imagine" what the unseen background should look like far into the future.
Subpages (2): h36m_more pennaction_more