1. In propagation phase, SCALOR tracks the existing objects.
2. We apply a fully parallelized attention mechanism for each object to attend to its own region of interest.
3. This is done by applying a bilinear pooling in the image feature according to the objects’ location at the previous time step.
1. In discovery phase, SCALOR discovers the newly introduced objects.
2. We exploit the spatial invariant property of CNNs and propose all new objects at different locations in parallel.
2. This powerful discovery module introduces the problem of propagation collapsing: the model does not learn to propagate but learn to rediscover all objects at each time step, thus losing the ability to track.
3. To resolve this problem, we introduce the proposal-rejection mechanism where all proposed objects are put into an accept-reject test. If an object has a high overlap rate with objects in the propagation list, it is deemed a false proposal and will be rejected.
1. After inferring the foreground objects, the background module infers the background representation.
2. The background module infers the background representation by incorporating the information from the input image and the foreground objects.
SCALOR (SC) outperforms the state of the art model SQAIR (SQ) on tracking and detaction on all experiment settings.
SCALOR not only train faster but also converge to a lower minimum. And it has a constant processing time regardless of the number of objects in the scene in comparison to a linear processing time of the baseline model.
SCALOR is able to correctly separate moving background from the foreground objects while tracking the objects accurately and consistently.
From top left to bottom right: (1) Original scene, (2) Bounding boxes of discovery step, (3) Bounding boxes of propagation steps, (4) Reconstructed scene, (5) Inferred dynamic background, (6) Tracked objects.
This dataset is more challenging than moving dSprites shapes as MNIST digits are more fine-grained.
In order to test the limits of SCALOR, we test the tracking performance in an extreme case where there is very high density of objects (80-100 objects) present in the environment.
This experiment evaluates the ability to discover many newly introduced objects across several time-steps. This is important because in many applications, only key-frames of a video, i.e. the frames at which significant changes happen, are available. In the object tracking domain, an example of a key-frame is where many objects get introduced in the same frame due to a sudden change of the observer’s view point. 10-15 objects are introduced in time-steps 1, 7, 14 and 21. In such scenarios, SCALOR can discover many objects while tracking the previous objects consistently due to the powerful design of the discovery module.
As a generative model model, SCALOR is also capable of doing image generation. The top video shows the result of conditional generation (when red margin appears). And the bottom video shows the generation from time step 0 where all objects and the background are generated from the prior distribution.
We also evaluate SCALOR's performance on natural real-world camera footage obtained from CCTV camera recording pedestrian movement in Grand Central Station. SCALOR is able to decompose the scene into foreground objects and background image and succeeds in accurate pedestrian detection and tracking.
From top left to bottom right: (1) Original scene, (2) Inferred segmentation mask with colored IDs (3) Inferred location of each object (4) Reconstructed scene, (5) Inferred background, (6) Extracted object trajectories.
Future time prediction: SCALOR is shown the first 5 time-steps and conditionally generates the next 5 time-steps (specified by red margin). Interestingly, SCALOR is able to generates consistent trajectory movement while introducing new object into the scene at each time step.
From top left to bottom right: (1) Original scene, (2) Inferred/Generated segmentation mask with colored IDs (3) Inferred/Generated location of each object (4) Reconstructed/Generated scene, (5) Inferred/Generated background, (6) Extracted/Generated object trajectories.
@inproceedings{JiangJanghorbaniDeMeloAhn2020SCALOR,
title={SCALOR: Generative World Models with Scalable Object Representations},
author={Jindong Jiang and Sepehr Janghorbani and Gerard {de Melo} and Sungjin Ahn},
booktitle={Proceedings of ICLR 2020},
year={2020},
publisher = {OpenReview.net},
location = {Addis Ababa, Ethiopia},
url = {https://openreview.net/pdf?id=SJxrKgStDH},
}