Home

Video Prediction with Appearance and Motion Conditions

Yunseok Jang†‡, Gunhee Kim‡ and Yale Song¶

†: University of Michigan, ‡: Seoul National University, ¶: Microsoft AI & Research

Figure 1. Given an input frame (appearance condition), our method generates videos showing different emotions (motion condition). The appearance and motion conditions impose constraints on the future frames. This reduces uncertainty and results in sharper frames compared to the previous approaches.

Abstract:

Video prediction aims to generate realistic future frames by learning dynamic visual patterns. One fundamental challenge is to deal with future uncertainty: How should a model behave when there are multiple correct, equally probable future? We propose an Appearance-Motion Conditional GAN to address this challenge. We provide appearance and motion information as conditions that specify how the future may look like, reducing the level of uncertainty. Our model consists of a generator, two discriminators taking charge of appearance and motion pathways, and a perceptual ranking module that encourages videos of similar conditions to look similar. To train our model, we develop a novel conditioning scheme that consists of different combinations of appearance and motion conditions. We evaluate our model using facial expression and human action datasets and report favorable results compared to existing methods.

Paper:

Video Prediction with Appearance and Motion Conditions

Yunseok Jang, Gunhee Kim and Yale Song

in ICML 2018

[PDF] [Slide] [Poster] [BibTeX]

Approach:

Our goal is to generate a video given an appearance and motion information. We formulate this as learning the conditional distribution p(x|y) where x is a video and y is a set of conditions known to occur. We define two conditioning variables, ya and ym, that encode appearance and motion information, respectively.

We propose an Appearance-Motion Conditional GAN, shown in below. The generator G (marked as blue box) seeks to produce realistic future frames. The two discriminator networks (marked as orange box) attempt to distinguish the generated videos from the real ones: The appearance discriminator checks if individual frames look realistic given appearance condition, the motion discriminator checks if a video contains realistic motion given motion condition. Note that either discriminator alone would be insufficient to achieve our goal: without the appearance discriminator, a generated video may have inconsistent visual appearance across frames, without motion discriminator, a generated video may not depict the motion we intend to hallucinate.

Datasets:

(1) MUG Facial Expression dataset

The MUG dataset contains 931 video clips performing six basic emotions (anger, disgust, fear, happy, sad, surprise). We preprocess it so that each video has 32 frames with 64 x 64 pixels. We use 11 facial landmark locations (2, 9, 16, 20, 25, 38, 42, 45, 47, 52, 58th) as keypoints for each frame, detected using the OpenFace toolkit.

(2) NATOPS Aircraft Handling Signals Dataset

The NATOPS dataset contains 9,600 video clips performing 24 action categories. We crop the video to 180 x 180 pixels with the chest at the center position and rescale it to 64 x 64 pixels. We use 9 joint locations (head, chest, naval, L/R-shoulders, L/R-elbows, L/R-wrists) as keypoints for each frame, provided by the dataset.

Results:

Figure 2. Ours vs. baseline methods. We show the two input conditions (appearance and motion), the ground truth video, and four generated videos from different approaches: CDNA [1], Adv+GDL [2], MCNet [3], and AMC-GAN (Ours).

Figure 3. Our results under different motion conditions. We show results generated using the same appearance condition (input image) but with different motion conditions (the mismatched condition means the motion information is different from the ground truth video).

Acknowledgements:

We thank Kang In Kim for helpful comments about building a human evaluation page. We also appreciate Youngjin Kim, Youngjae Yu, Juyoung Kim, Insu Jeon and Jongwook Choi for helpful discussions related to the design of our model. This work is partially supported by Korea-U.K. FP Programme through the National Research Foundation of Korea (NRF-2017K1A3A1A16067245).

References:

[1] Finn et al., Unsupervised Learning for Physical Interaction through Video Prediction, in NIPS 2016

[2] Mathieu et al., Deep Multi-Scale Video Prediction Beyond Mean Square Error, in ICLR 2016

[3] Villegas et al., Decomposing Motion and Content for Natural Video Sequence Prediction, in ICLR 2017