This site is companion to a paper entitled "Exploiting Spatial Invariance for Scalable Unsupervised Object Tracking", which introduces Spatially Invariant, Label-free Object Tracking (SILOT), an architecture for unsupervised object tracking inspired by Attend, Infer, Repeat (AIR) and Sequential Attend, Infer, Repeat (SQAIR). SILOT is better able to scale to large scenes containing many objects than these previous architectures, made possible through extensive use of spatially invariant computations such as convolutions and spatial attention. Here we present a number of videos showing object tracking by trained SILOT networks. In supervised object tracking, networks are given bounding box annotations at training time; in the unsupervised setting explored here, no such annotations are provided.
In all video pairs below, ground truth videos are shown on the left and SILOT's reconstructions are shown on the right. For both videos, bounding boxes of objects proposed by SILOT are superimposed. Object identity according to the network is represented by box color. Objects that have been discovered in a current frame have dashed boxes, while objects propagated from the previous frame have solid boxes. The network is trained on videos containing at most 8 frames, whereas here we apply the network to videos consisting of up to 100 frames. Notice that the network is able to maintain object identities over time, even when objects become heavily occluded by other objects.