We propose a new temporal generative model, called Generative Structured World Models (G-SWMs), for unsupervised learning of object-centric state representation and efficient future simulation. G-SWM not only unifies the key abilities of previous models in a principled framework but also achieves multimodal uncertainty and situated behavior. The main contributions of this paper are as follows:
Integration of important abilities of previous models: interaction, occlusion, scalability, background modeling, etc.
Two crucial new abilities: multimodal uncertainty and situation awareness.
Objects and Contexts
In G-SWM, we assume that each frame of the video can be divided into objects and contexts . Each object is modeled with one set of latents describing its states. Everything that is non-object is considered “context” and modeled with a context latent variable.
G-SWM do generation in an object-centric way. To generate new frames, we first determine the new object latents and the context latent. These latents are then rendered into the foreground and background of the frame.
As the core of G-SWM, the Versatile Propagation (V-Prop) module that integrates various information from previous steps and generate the new objects states. This includes the previous object state, object-object interaction, and object-context interaction. At the same time, the hierarchical modeling of object dynamics allows for multimodal behavior.
Interaction, Occlusion, and Scalability
Through experiments on four different bouncing ball datasets, we show that G-SWM can handle occlusion and interaction jointly, while being scalable.
Multimodal Uncertainty and Situation Awareness
The maze experiment demonstrates G-SWM's modeling of multimodal uncertainty and situation awareness:
Multimodal uncertainty: in generation, the agents explore all possible paths.
Situation awareness: the agents correctly follow the corridor of the maze.
Through experiment on more realistic 3D environment, we show that G-SWM can model 3D collisions and occlusions.