Improving Generative Imagination in Object-Centric World Models

Zhixuan Lin, Yi-Fu Wu, Skand, Bofeng Fu, Jindong Jiang, Sungjin Ahn

Rutgers University, Zhejiang University & Tianjin University

in ICML 2020

Introduction

We propose a new temporal generative model, called Generative Structured World Models (G-SWMs), for unsupervised learning of object-centric state representation and efficient future simulation. G-SWM not only unifies the key abilities of previous models in a principled framework but also achieves multimodal uncertainty and situated behavior. The main contributions of this paper are as follows:

Integration of important abilities of previous models: interaction, occlusion, scalability, background modeling, etc.
Two crucial new abilities: multimodal uncertainty and situation awareness.

Overview

Objects and Contexts

In G-SWM, we assume that each frame of the video can be divided into objects and contexts . Each object is modeled with one set of latents describing its states. Everything that is non-object is considered “context” and modeled with a context latent variable.

Object-Centric Generation

G-SWM do generation in an object-centric way. To generate new frames, we first determine the new object latents and the context latent. These latents are then rendered into the foreground and background of the frame.

Versatile Propagation

As the core of G-SWM, the Versatile Propagation (V-Prop) module that integrates various information from previous steps and generate the new objects states. This includes the previous object state, object-object interaction, and object-context interaction. At the same time, the hierarchical modeling of object dynamics allows for multimodal behavior.