Video Prediction via Example Guidance
Abstract
In video prediction tasks, one major challenge is to capture the multi-modal nature of future contents and dynamics. In this work, we propose a simple yet effective framework that can efficiently predict plausible future states. The key insight is that the potential distribution of a sequence could be approximated with analogous ones in a repertoire of training pool, namely, expert examples. By further incorporating a novel optimization scheme into the training procedure, plausible predictions can be sampled efficiently from distribution constructed from the retrieved examples. Meanwhile, our method could be seamlessly integrated with existing stochastic predictive models; significant enhancement is observed with comprehensive experiments in both quantitative and qualitative aspects. We also demonstrate the generalization ability to predict the motion of unseen class, i.e., without access to corresponding data during training phase.
INSIGHTS
Predictive models can heavily rely on similarity between past experiences and the new ones, implying that sequences with similar motion might fall into the same modal with a high probability. The key insight of our work, deduced from the above observation, is that the potential distribution of sequence to be predicted can be approximated by analogous ones in a data pool, namely, examples. As shown in following demos on PennAction Dataset, the input sequence generally falls into one variation pattern of retrieved examples, which confirms the key insight of our work.
For each demo, the leftmost one is input, while the other four are retrieved examples.
We also plot one dimension of motion feature (here is the key-point coordinate) for further illustration as follows. For all sub-figures, the X-axis stands for time step and Y-axis refers to the value corresponding to one dimension of learned motion feature. For notation, the 5 solid lines are retrieved example sequences and blue star one refers to the predicted sequence. Orange-dot line is the ground truth.
Experimental results
- MovingMnist prediction under deterministic setting.
Red box: input frames; Blue box: predicted frames. For each demo, from left to right: GT, Retrieved Example, Ours, DFN, SVG
2. MovingMnist prediction under stochastic setting.
Red box: input frames; Blue box: predicted frames. For each demo, from left to right: GT, Retrieved Example, Ours, DFN, SVG
3. RobotPush prediction (For evaluation of motion accuracy).
Red box: input frames; Blue box: predicted frames. For each demo, from left to right: GT, Retrieved Example, Ours, SV2P, SVG
4. RobotPush prediction (For evaluation of motion diversity).
Red box: input frames; Blue box: predicted frames. For each demo, all 5 images are randomly predicted.
5. PennAction prediction.
For each demo, from left to right: GT, Example 1, Example 2, Ours, Kim et.al.(2019)
6. Generalizing to unseen motion.
For each demo, from left to right: GT, Example 1, Example 2, Ours, Kim et.al.(2019)
Training class: baseball_swing, golf_swing, tennis_serve. Testing class: Jumping_Jack