Unsupervised Learning of Disentangled Representations from Video

Emily Denton, Vighnesh Birodkar

Department of Computer Science , Courant Institute, New York University


We present a new model, DrNET, that learns disentangled image representations from video. Our approach leverages the temporal coherence of video and a novel adversarial loss to learn a representation that factorizes each frame into a stationary part and a temporally varying component. The disentangled representation can be used for a range of tasks. For example, applying a standard LSTM to the time-vary components enables prediction of future frames. We evaluate our approach on synthetic and real videos, demonstrating the ability to coherently generate up to several hundred steps into the future.


Unsupervised Learning of Disentangled Representations from Video

Emily Denton and Vighnesh Birodkar

In Neural Information Processing Systems (NIPS), 2017

[PDF] [code]

Model overview:

Our model, which we call Disentangled Representation Net (DrNet ), is form of predictive auto-encoder which uses a novel adversarial loss to factor the latent representation for each video frame into two components, one that is roughly time-independent (i.e. approximately constant throughout the clip) and another that captures the dynamic aspects of the sequence, thus varying over time. We refer to these as content and pose components, respectively. The adversarial loss relies on the intuition that while the content features should be distinctive of a given clip, individual pose features should not. Thus the loss encourages pose features to carry no information about clip identity.

Two separate encoders produce distinct feature representations of content and pose for each frame. They are trained by requiring that the content representation of frame xt and the pose representation of future frame xt+k can be combined (via concatenation) and decoded to predict the pixels of future frame xt+k. We thus introduce a novel adversarial loss on the pose features that prevents them from being discriminable from one video to another, thus ensuring that they cannot contain content information. A further constraint, motivated by the notion that content information should vary slowly over time, encourages temporally close content vectors to be similar to one another.

Left: Scene discriminator C is trained with binary cross entropy (BCE) loss to predict if a pair of pose vectors comes from the same (top portion) or different (lower portion) scenes. Right: The overall model, showing all terms in the loss function. Note that when the pose encoder Ep is updated, the scene discriminator is held fixed.

Video prediction:

Predicting future video frames is straightforward to do using our representation. We apply a standard LSTM model to the pose features, conditioning on the content features from the last observed frame. Despite the simplicity of our model relative to other video generation techniques, we are able to generate convincing long-range frame predictions, out to hundreds of time steps in some instances.

Generating future frames by recurrently predicting hp, the latent pose vector.

KTH 100-step video prediction:

100 step video generation on KTH where green frames indicate conditioned input and red frames indicate generations. Generations from the MCNet of Villegas et al. (2017), are shown for comparison.


Hand clapping

Hand waving





[1] R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee. Decomposing motion and content for natural video sequence prediction. In ICLR, 2017.