Generative World Models with Scalable Object Representations

Jindong Jiang*, Sepehr Janghorbani*, Gerard de Melo, Sungjin Ahn



We introduce SCALOR, a generative approach to SCALable sequential Object-oriented Representation learning. With the proposed spatially-parallel attention and proposal-rejection mechanism, SCALOR can track orders of magnitude more number of objects compared to the state-of-the-art models. SCALOR can be seen as a first attempt toward a holistic unsupervised perception which can do detection, segmentation, tracking, and generation, all in one single model without supervision.


  • SCALOR significantly improves the tracking scalability (two orders of magnitude) compared to the state of the art models.
  • It is applicable to nearly a hundred objects while still significantly faster than the state of the art models in terms of computation time.
  • Propagation–discovery process is parallelized by introducing the propose–reject model and the propagation attention, reducing the time complexity from O(N) to O(1).
  • SCALOR can model scenes with complex dynamic background.
  • SCALOR is the first unsupervised object representation model shown to work for natural scenes containing several tens of moving objects.

Architecture Details

Propagation Attention

1. In propagation phase, SCALOR tracks the existing objects.

2. We apply a fully parallelized attention mechanism for each object to attend to its own region of interest.

3. This is done by applying a bilinear pooling in the image feature according to the objects’ location at the previous time step.


1. In discovery phase, SCALOR discovers the newly introduced objects.

2. We exploit the spatial invariant property of CNNs and propose all new objects at different locations in parallel.

2. This powerful discovery module introduces the problem of propagation collapsing: the model does not learn to propagate but learn to rediscover all objects at each time step, thus losing the ability to track.

3. To resolve this problem, we introduce the proposal-rejection mechanism where all proposed objects are put into an accept-reject test. If an object has a high overlap rate with objects in the propagation list, it is deemed a false proposal and will be rejected.

Background Module

1. After inferring the foreground objects, the background module infers the background representation.

2. The background module infers the background representation by incorporating the information from the input image and the foreground objects.

Qualitative Results

Tracking and detection

SCALOR (SC) outperforms the state of the art model SQAIR (SQ) on tracking and detaction on all experiment settings.

Training and inference speed

SCALOR not only train faster but also converge to a lower minimum. And it has a constant processing time regardless of the number of objects in the scene in comparison to a linear processing time of the baseline model.

Quantitative Results

Tracking in scenes with complex background

SCALOR is able to correctly separate moving background from the foreground objects while tracking the objects accurately and consistently.

From top left to bottom right: (1) Original scene, (2) Bounding boxes of discovery step, (3) Bounding boxes of propagation steps, (4) Reconstructed scene, (5) Inferred dynamic background, (6) Tracked objects.

Tracking in "Crowded Moving MNIST Dataset"

This dataset is more challenging than moving dSprites shapes as MNIST digits are more fine-grained.

Tracking in "Very High Density Environment"

In order to test the limits of SCALOR, we test the tracking performance in an extreme case where there is very high density of objects (80-100 objects) present in the environment.

Frequent discovery in dense environments

This experiment evaluates the ability to discover many newly introduced objects across several time-steps. This is important because in many applications, only key-frames of a video, i.e. the frames at which significant changes happen, are available. In the object tracking domain, an example of a key-frame is where many objects get introduced in the same frame due to a sudden change of the observer’s view point. 10-15 objects are introduced in time-steps 1, 7, 14 and 21. In such scenarios, SCALOR can discover many objects while tracking the previous objects consistently due to the powerful design of the discovery module.

Scene Generation

Conditional generation and generation from time step 0

As a generative model model, SCALOR is also capable of doing image generation. The top video shows the result of conditional generation (when red margin appears). And the bottom video shows the generation from time step 0 where all objects and the background are generated from the prior distribution.

Natural Scenes


We also evaluate SCALOR's performance on natural real-world camera footage obtained from CCTV camera recording pedestrian movement in Grand Central Station. SCALOR is able to decompose the scene into foreground objects and background image and succeeds in accurate pedestrian detection and tracking.

From top left to bottom right: (1) Original scene, (2) Inferred segmentation mask with colored IDs (3) Inferred location of each object (4) Reconstructed scene, (5) Inferred background, (6) Extracted object trajectories.


Future time prediction: SCALOR is shown the first 5 time-steps and conditionally generates the next 5 time-steps (specified by red margin). Interestingly, SCALOR is able to generates consistent trajectory movement while introducing new object into the scene at each time step.

From top left to bottom right: (1) Original scene, (2) Inferred/Generated segmentation mask with colored IDs (3) Inferred/Generated location of each object (4) Reconstructed/Generated scene, (5) Inferred/Generated background, (6) Extracted/Generated object trajectories.

More Examples

Grand Central Station Dateset


Conditional Generation

BibTeX entry

  title={SCALOR: Generative World Models with Scalable Object Representations},
  author={Jindong Jiang and Sepehr Janghorbani and Gerard {de Melo} and Sungjin Ahn},
  booktitle={Proceedings of ICLR 2020},
  publisher = {},
  location = {Addis Ababa, Ethiopia},
  url = {},