SIMONe
View-Invariant, Temporally-Abstracted Object Representations via Unsupervised Video Decomposition
Rishabh Kabra, Daniel Zoran, Goker Erdogan, Loic Matthey, Antonia Creswell
Matthew Botvinick, Alexander Lerchner, Christopher P. Burgess
Animation 1: Reconstructed crossovers showing SIMONe object latents re-composed with frame latents
(from four scene videos each) in a matrix of combinations (fully unsupervised)
Animation 1: Reconstructed crossovers showing SIMONe object latents re-composed with frame latents
(from four scene videos each) in a matrix of combinations (fully unsupervised)
CATER (moving camera)
Objects Room 9
Animation 2: Novel view synthesis from limited context
(using view-supervised variant, SIMONe-VS)
Animation 2: Novel view synthesis from limited context
(using view-supervised variant, SIMONe-VS)
Playroom
Playroom
Animation 3: Instance segmentation (fully unsupervised)
Animation 3: Instance segmentation (fully unsupervised)
CATER (moving camera). Note the object occluded (the distant yellow sphere) for some frames in example 2; it is tracked stably by SIMONe. Moreover, SIMONe assigns each object's shadows (up to three due to multiple lights) to the same segment.
Playroom. Number of unique foreground objects across the sequence in each example: 28, 15, and 29.
Animation 4: Learnt representations, visualized by manipulating single latent attributes (fully unsupervised)
Animation 4: Learnt representations, visualized by manipulating single latent attributes (fully unsupervised)
Object latent attributes
Frame latent attributes