Marco Fraccaro*¹, Simon Kamronn*¹, Ulrich Paquet², Ole Winther¹
¹Technical University of Denmark, ²DeepMind
* indicates equal contribution
The Kalman variational auto-encoder is a framework for unsupervised learning of sequential data that disentangles two latent representations: an object’s representation, coming from a recognition model, and a latent state describing its dynamics. The recognition model is represented by a convolutional variational auto-encoder and the latent dynamics model as a linear Gaussian state space model (LGSSM).
As shown in the paper, the KVAE can be trained end-to-end, and is able learn a recognition and dynamics model from the videos. The model can be used to generate new sequences, as well as to do missing data imputation without the need to generate high-dimensional frames at each time step.
The dynamic parameter network is able to learn the appropriate mixture of multiple linear dynamics in each step by only observing the low-dimensional latent representation.
We demonstrate the model’s ability to separately learn a recognition and dynamics model from video, and use it to impute missing data and perform long-term generation in four different environments.
All videos and data used for training are available here.
Imputation where 30% of the frames are dropped at random. The red ball is the ground truth.
Long-term generation with 4 frame initialization.
Imputation where 30% of the frames are dropped at random. The red ball is the ground truth.
Long-term generation with 4 frame initialization.
Imputation where 30% of the frames are dropped at random. The red ball is the ground truth.
Long-term generation with 4 frame initialization.
Imputation where 30% of the frames are dropped at random. The red ball is the ground truth.
Long-term generation with 4 frame initialization.