Human perception naturally decomposes a scene into objects and background. Our model, SPACE, provides a unified probabilistic modeling framework for modeling scenes with multiple objects and complex background. Combining the best of previous models (i.e. mixture-scene and spatial-attention models), SPACE can explicitly provide factorized object representation per foreground object while also decomposing background segments of complex morphology. With the proposed parallel-spatial attention, SPACE resolves the scalability problem of previous methods and thus makes the model applicable to scenes with a much larger number of objects without performance degradation.
When trained on Atari games with colorful, complex backgrounds, SPACE correctly extracts foreground objects with bounding boxes, while providing a clean background segmentation.
Even when the background is highly dynamic, SPACE is still able to do meaningful segmentation.
Handling Many Objects
With parallel-spatial attention instead of sequential processing, SPACE scales well to games with a large number of objects.
We further tested SPACE on two synthetic 3D room datasets with different number of objects. Objects in this dataset are more diverse, varying in size, color and shape in a regular way. Further, occlusions happen frequently. SPACE successfully handles these difficulties, producing accurate bounding boxes and clear background segments.
3D-Room-Small and 3D-Room-Large
SPACE is the fastest among all models that we have implemented thanks to our parallel inference design. Even with a large decomposition capacity (i.e. maximum number of components) like 24*24, SPACE is still fast compared to other models.
We compare our model to SPAIR, IODINE and GENESIS. SPAIR works reasonably well, but the background is encoded in a single vector and thus not disentangled. IODINE and GENESIS never distinguish between foreground and background. SPACE provides the most comprehensive and disentangled scene representation.