SPACE: Unsupervised Object-Oriented Scene Representation via Spatial Attention and Decomposition

{Zhixuan Lin, Yi-Fu Wu, Skand Vishwanath Peri, Weihao Sun}*, Gautam Singh, Fei Deng, Jindong Jiang, Sungjin Ahn

Rutgers University & Zhejiang University

*Names in {___} indicates equal contribution



Human perception naturally decomposes a scene into objects and background. Our model, SPACE, provides a unified probabilistic modeling framework for modeling scenes with multiple objects and complex background. Combining the best of previous models (i.e. mixture-scene and spatial-attention models), SPACE can explicitly provide factorized object representation per foreground object while also decomposing background segments of complex morphology. With the proposed parallel-spatial attention, SPACE resolves the scalability problem of previous methods and thus makes the model applicable to scenes with a much larger number of objects without performance degradation.

Atari Games


When trained on Atari games with colorful, complex backgrounds, SPACE correctly extracts foreground objects with bounding boxes, while providing a clean background segmentation.

Dynamic Background

Even when the background is highly dynamic, SPACE is still able to do meaningful segmentation.

Handling Many Objects

With parallel-spatial attention instead of sequential processing, SPACE scales well to games with a large number of objects.

3D Room

We further tested SPACE on two synthetic 3D room datasets with different number of objects. Objects in this dataset are more diverse, varying in size, color and shape in a regular way. Further, occlusions happen frequently. SPACE successfully handles these difficulties, producing accurate bounding boxes and clear background segments.

3D-Room-Small and 3D-Room-Large



SPACE is the fastest among all models that we have implemented thanks to our parallel inference design. Even with a large decomposition capacity (i.e. maximum number of components) like 24*24, SPACE is still fast compared to other models.

Scene Representation

We compare our model to SPAIR, IODINE and GENESIS. SPAIR works reasonably well, but the background is encoded in a single vector and thus not disentangled. IODINE and GENESIS never distinguish between foreground and background. SPACE provides the most comprehensive and disentangled scene representation.

More Examples

Atari Games

3D Room