MSPred: Video Prediction at Multiple Spatio-Temporal Scales with  Hierarchical Recurrent Networks

Angel Villar-Corrales¹*, Ani Karapetyan¹*,  Andreas Boltres², and Sven Behnke¹

1: Autonomous Intelligent Systems, University of Bonn
2: Autonomous Learning Robots, Karlsruhe Institute of Technology (KIT) & SAP SE  
*  Denotes equal contribution

Abstract

Autonomous systems not only need to understand their current environment, but should also be able to predict future actions conditioned on past states, for instance based on captured camera frames. However, existing models mainly focus on forecasting future video frames for short time-horizons, hence being of limited use for long-term action planning. We propose Multi-Scale Hierarchical Prediction (MSPred), a novel video prediction model able to simultaneously forecast future possible outcomes of different levels of granularity at different spatio-temporal scales. By combining spatial and temporal downsampling, MSPred efficiently predicts abstract representations such as human poses or locations over long time horizons, while still maintaining a competitive performance for video frame prediction. In our experiments, we demonstrate that MSPred accurately predicts future video frames as well as high-level representations (e.g. keypoints or semantics) on bin-picking and action recognition datasets, while consistently outperforming popular approaches for future frame prediction. Furthermore, we ablate different modules and design choices in MSPred, experimentally validating that combining features of different spatial and temporal granularity leads to a superior performance. 

Method

Given a sequence of seed frames, MSPred predicts representations of different levels of  granularity at distinct time-scales. Low-level representations, such as video frames, are predicted for short-time horizons with a fine temporal resolution. Conversely, higher-level representations, such as human poses or locations, are forecasted longer into the future using coarser time resolutions, hence allowing for long-term predictions with a small number of iterations.

MSPred Architecture

Given a sequence of seed frames, MSPred predicts representations of different levels of  granularity at distinct time-scales. Low-level representations, such as video frames, are predicted for short-time horizons with a fine temporal resolution. Conversely, higher-level representations, such as human poses or locations, are forecasted longer into the future using coarser time resolutions, hence allowing for long-term predictions with a small number of iterations.

Comparison with Video Prediction Methods