SIMONe

View-Invariant, Temporally-Abstracted Object Representations via Unsupervised Video Decomposition

Rishabh Kabra, Daniel Zoran, Goker Erdogan, Loic Matthey, Antonia Creswell
Matthew Botvinick, Alexander Lerchner, Christopher P. Burgess

Animation 1: Reconstructed crossovers showing SIMONe object latents re-composed with frame latents
(from four scene videos each) in a matrix of combinations (fully unsupervised)

CATER (moving camera)

Objects Room 9

Animation 2: Novel view synthesis from limited context
(using view-supervised variant, SIMONe-VS)

Playroom

Playroom

Animation 3: Instance segmentation (fully unsupervised)

CATER (moving camera). Note the object occluded (the distant yellow sphere) for some frames in example 2; it is tracked stably by SIMONe. Moreover, SIMONe assigns each object's shadows (up to three due to multiple lights) to the same segment.

Playroom. Number of unique foreground objects across the sequence in each example: 28, 15, and 29.

Animation 4: Learnt representations, visualized by manipulating single latent attributes (fully unsupervised)

Object latent attributes

Frame latent attributes