Improving Generative Imagination in Object-Centric World Models

Zhixuan Lin, Yi-Fu Wu, Skand, Bofeng Fu, Jindong Jiang, Sungjin Ahn

Rutgers University, Zhejiang University & Tianjin University

in ICML 2020

Paper | Code | Slides

Introduction

We propose a new temporal generative model, called Generative Structured World Models (G-SWMs), for unsupervised learning of object-centric state representation and efficient future simulation. G-SWM not only unifies the key abilities of previous models in a principled framework but also achieves multimodal uncertainty and situated behavior. The main contributions of this paper are as follows:

  1. Integration of important abilities of previous models: interaction, occlusion, scalability, background modeling, etc.

  2. Two crucial new abilities: multimodal uncertainty and situation awareness.

Overview

Objects and Contexts

In G-SWM, we assume that each frame of the video can be divided into objects and contexts . Each object is modeled with one set of latents describing its states. Everything that is non-object is considered “context” and modeled with a context latent variable.

Object-Centric Generation

G-SWM do generation in an object-centric way. To generate new frames, we first determine the new object latents and the context latent. These latents are then rendered into the foreground and background of the frame.

Versatile Propagation

As the core of G-SWM, the Versatile Propagation (V-Prop) module that integrates various information from previous steps and generate the new objects states. This includes the previous object state, object-object interaction, and object-context interaction. At the same time, the hierarchical modeling of object dynamics allows for multimodal behavior.

Experiments

Interaction, Occlusion, and Scalability

Through experiments on four different bouncing ball datasets, we show that G-SWM can handle occlusion and interaction jointly, while being scalable.

Multimodal Uncertainty and Situation Awareness

The maze experiment demonstrates G-SWM's modeling of multimodal uncertainty and situation awareness:

  1. Multimodal uncertainty: in generation, the agents explore all possible paths.

  2. Situation awareness: the agents correctly follow the corridor of the maze.

3D Interactions

Through experiment on more realistic 3D environment, we show that G-SWM can model 3D collisions and occlusions.

More Visualizations

Bouncing Balls

Maze

3D Interactions