SCALOR: Scalable Object-Oriented Sequential Generative Models

We introduce SCALOR, a generative approach to SCALable sequential Object-oriented Representation learning. With the proposed spatially-parallel attention and proposal-rejection mechanism, SCALOR can track orders of magnitude more number of objects compared to the current state-of-the-art models. Besides, SCALOR contains a "Background Model" to deal with scenes containing complex moving background in addition to many foreground moving objects. SCALOR is the first completely unsupervised method capable of performing reasonably well in realistic natural scenes containing several tens of moving objects.


  • SCALOR significantly improves the tracking scalability (two orders of magnitude) compared to the state of the art models.
  • It is applicable to nearly a hundred objects while being comparable to SQAIR (which scales only to a few objects) in terms of computation time.
  • Propagation–discovery process is parallelized by introducing the propose–reject model, reducing the time complexity from O(N) to O(1).
  • SCALOR can model scenes with complex moving background.
  • SCALOR is the first probabilistic model demonstrating its working not only on natural images but also at a significant complexity level of tens of objects and with background.

Object Tracking in Scenes With Complex Background

On the left, we see SCALOR's performance on a scene containing 50-60 moving DSprite objects in the foreground in addition to a complex moving background. This experiment is intended to evaluate SCALOR's tracking performance in scenes with complex dynamics present in the background.

SCALOR is able to correctly separate moving background from the foreground objects while tracking the objects accurately and consistently.

From top left to bottom right: (1) Original scene, (2) Reconstructed scene, (3) Inferred dynamic background, (4) Tracked objects, (5) Bounding boxes of discovery step, (6) Bounding boxes of propagation steps.

Tracking MNIST Digits

We also evaluate tracking performance on the "Crowded Moving MNIST Dataset", which contains 40-60 moving MNIST digits in each frame. This dataset is more challenging than moving DSprite shapes as MNIST digits are more fine-grained. As shown, SCALOR is able to perform well on this task as well.

Very High Density Environment

In order to test the limits of SCALOR, we test the tracking performance in an extreme case where there is very high density of objects (80-100 objects) present in the scene. Interestingly, SCALOR can identify and keep track of objects although the number of objects are larger than its cells. As a result, discovery of certain objects gets delayed to time-step t=2 .

Frequent Discovery in Dense Environments

This experiment evaluates the ability to discover many newly introduced objects across several time-steps. This is important because in many applications, only key-frames of a video, i.e. the frames at which significant changes happen, are available. In the object tracking domain, an example of a key-frame is where many objects get introduced in the same frame due to a sudden change of the observer’s view point. 10-15 objects are introduced in time-steps 1, 7, 14 and 21. In such scenarios, SCALOR can discover many objects while tracking the previous objects consistently due to the powerful design of the discovery module.

Natural Scenes

We also evaluate SCALOR's performance on natural real-world camera footage obtained from CCTV camera recording pedestrian movement in Grand Central Station. SCALOR is able to decompose the scene into foreground objects and background image and succeeds in accurate pedestrian detection and tracking.

From top left to bottom right: (1) Original scene, (2) Inferred segmentation mask with colored IDs (3) Inferred location of each object (4) Reconstructed scene, (5) Inferred background, (6) Extracted object trajectories.

Future time prediction: SCALOR is shown the first 5 time-steps (without the margin) and conditionally generates the next 5 time-steps (specified by red margin). Interestingly, SCALOR is able to generates consistent trajectory movement while introducing new object into the scene as well.

From top left to bottom right: (1) Original scene, (2) Inferred/Generated segmentation mask with colored IDs (3) Inferred/Generated location of each object (4) Reconstructed/Generated scene, (5) Inferred/Generated background, (6) Extracted/Generated object trajectories.

More Examples

Grand Central Station Dateset

Inference

Conditional Generation