Keyframing the Future:

Keyframe Discovery for Visual Prediction and Planning

* equal contribution

Conference on Learning for Dynamics and Control, 2020

Abstract

Temporal observations such as videos contain essential information about the dynamics of the underlying scene, but they are often interleaved with inessential, predictable details. One way of dealing with this problem is by focusing on the most informative moments in a sequence. In this paper, we propose a model that learns to discover these important events and the times when they occur and uses them to represent the full sequence. We do so using a hierarchical Keyframe-Inpainter (KeyIn) model that first generates a video's keyframes and then inpaints the rest by generating the frames at the intervening times. We propose a fully differentiable formulation to efficiently learn this procedure. We show that KeyIn finds informative keyframes in several datasets with different dynamics and visual properties. KeyIn outperforms other recent hierarchical predictive models for planning.

Approach

We propose a hierarchical Keyframe-Inpainter (KeyIn) model that first generates a video's keyframes as well as the times when they occur with a keyframe predictor module, and then inpaints the rest of the sequence with a sequence inpainter module. We formulate KeyIn as a latent variable model and leverage a variational lower bound to train it. We show how optimizing this lower bound objective model leads to the discovery of keyframe structure.

We want our model to dynamically predict the keyframe placement τ. However, learning a distribution over this discrete variable is challenging due to the need of discrete sampling. To learn the keyframe placement efficiently and in a differentiable manner, we propose a continuous relaxation of the objective.


In our soft relaxation of the objective, instead of sampling from τ and computing the loss on the corresponding target, we use the expected target in the loss. As shown to the left, the target for each keyframe is simply the weighted average of ground truth frames under the corresponding distribution τ. The keyframe prediction loss is then the reconstruction loss between the prediction and this target.

We can now use our keyframe model to perform hierarchical planning. To construct a plan, we search for a sequence of keyframes that starts at the current observed image and reaches the goal. We use the cross-entropy method in the latent space to optimize the sequence.

Once we constructed the plan expressed in terms of keyframes, we treat them as subgoals, and pass them to a low-level controller, e.g. based on model predictive control. We use the controller to produce action trajectories that reach each keyframe sequentially, until the final goal is reached.

Results

Keyframe Discovery

Pushing data demonstration example.

Example of discovered keyframe structure. Shown are aggregated predictions by the network, with the initial state in purple, keyframes in pink, and intermediate states half-transparent.

Generations on Stochastic Brownian Motion Data

Trajectories from the SBM dataset.

Corresponding KeyIn reconstructions.

KeyIn video generation results conditioned on the same beginning sequence.

Generations on Gridworld Data

Trajectories from the gridworld environment.

Corresponding KeyIn reconstructions.

KeyIn video generation results conditioned on the same beginning sequence.

Generations on Pushing Data

Ground truth demonstrations on the Pushing data.

Corresponding KeyIn video reconstruction results.

KeyIn video generation results conditioned on the same frame.

CIGAN (Wang et al., RSS 2019) video generation results conditioned on the same frame.

Planning Results

KeyIn (ours): Complete test set planning results. Semi-transparent object shows the goal location. Success rate: 64.2%.

Fixed Segment Baseline: Complete test set planning results. Semi-transparent object shows the goal location. Success rate: 58.8%.

No Hierarchy Baseline: Complete test set planning results. Semi-transparent object shows the goal location. Success rate: 15.0%.

Bibtex

@article{pertsch2020keyin,
  title={KeyIn: Keyframing for Visual Planning},
  author={Pertsch, Karl and Rybkin, Oleh and Yang, Jingyun and Zhou, Shenghao and Derpanis, Kosta and Lim, Joseph and   Daniilidis, Kostas and Jaegle, Andrew},
  journal={Conference on Learning for Dynamics and Control},
  year={2020}
}