OBJECT-CENTRIC VIDEO PREDICTION VIA DECOUPLING OF OBJECT DYNAMICS AND INTERACTIONS

Angel Villar-Corrales*, Ismail Wahdan*,  and Sven Behnke

All  authors are part of: Autonomous Intelligent Systems, University of Bonn, Germany,     *  Denotes equal contribution

IEEE International Conference on Image Processing (ICIP) 2023

Abstract

We present a framework for object-centric video prediction, i.e., parsing a video sequence into objects, and modeling their dynamics and interactions in order to predict the future object states from which video frames are rendered. To facilitate the learning of meaningful spatio-temporal object representations and forecasting of their states, we propose two novel object-centric video prediction (OCVP) transformer modules, which decouple the processing of temporal dynamics and object interactions. We show how OCVP predictors outperform object-agnostic video prediction models on two different datasets. Furthermore, we observe that OCVP modules learn consistent and interpretable object representations

Proposed Object-Centric Video Prediction Framework

Investigated Transformer Predictor Modules

Quantitative Results

Qualitative Results on Obj3D

Qualitative Results on MOVi-A