STEVE: Slot-TransformEr for VidEos
Unsupervised object-centric learning aims to represent the modular, compositional, and causal structure of a scene as a set of object representations and thereby promises to resolve many critical limitations of traditional single-vector representations such as poor systematic generalization. Although there have been many remarkable advances in recent years, one of the most critical problems in this direction has been that previous methods work only with simple and synthetic scenes but not with complex and naturalistic images or videos. In this paper, we propose Slot-Transformer for Videos or STEVE, an unsupervised model for object-centric learning in videos. We test a minimal architecture that combines the SLATE decoder with a standard slot-level recurrence model. Moreover, the learning objective is simply to reconstruct the observation. Our experiment results on various complex and naturalistic videos show that the proposed architecture significantly outperforms previous state-of-the-art baseline models.
Figure 1. Left: SLATE auto-encoder combines the representational bottleneck of slots with an auto-regressive transformer decoder and shows object discovery in visually complex images. However, the question of whether this framework can deal with complex and naturalistic videos is unexplored. Middle: Conventional object-centric video models such as Slot Attention Video deal with videos by applying slot attention recurrently on the frames and applying a pixel-mixture decoder to reconstruct the frames. However, their ability to handle complex and naturalistic videos is lacking. Right: Our model Slot Transformer for Videos provides a simple and minimal architecture leveraging an auto-regressive transformer decoder that can effectively handle complex and naturalistic videos.
Unsupervised Video Segmentation in Complex and Naturalistic Videos
Here, we visualize a comparison of STEVE with the baseline Slot Attention Video. For Slot Attention Video, we use the fully unsupervised version that is trained only using the image reconstruction objective without optical flow or label information in the first frame. For simplicity, we will refer to this unsupervised Slot Attention Video as SAVi. In these visualizations, we focus on a comparison with SAVi as our model is most similar to it with one main difference: our model uses transformer-based decoding while SAVi uses mixture-based decoding.
In these visualizations, the rows (from top to bottom) show: the input video, true object segmentation, the predicted object segmentation, predicted object segmentation overlaid on the input video, and finally, for better interpretability, separate visualizations of each predicted segment by multiplying each segment with the input video.
MOVi-E
MOVi-Tex
CATERTex
MOVi-Solid
MOVi-D
Youtube Traffic
Youtube Aquarium