Abstract: We test the hypothesis that modeling a scene in terms of entities and their local interactions, as opposed to modeling the scene globally, provides a significant benefit in generalizing to physical tasks in a combinatorial space the learner has not encountered before. We present object-centric perception, prediction, and planning (OP3), which to the best of our knowledge is the first fully probabilistic entity-centric dynamic latent variable framework for model-based reinforcement learning that acquires entity representations from raw visual observations without supervision and uses them to predict and plan. OP3 enforces entity-abstraction -- symmetric processing of each entity representation with the same locally-scoped function -- which enables it to scale to model different numbers and configurations of objects from those in training. Our approach to solving the key technical challenge of grounding these entity representations to actual objects in the environment is to frame this variable binding problem as an inference problem, and we developing an interactive inference algorithm that uses temporal continuity and interactive feedback to bind information about object properties to the entity variables. On block-stacking tasks, OP3 generalizes to novel block configurations and more objects than observed during training, outperforming an oracle model that assumes access to object supervision and achieving two to three times better accuracy than a state-of-the-art video prediction model.
OP3 enforces the entity abstraction, factorizing the latent state into local entity states, each of which are symmetrically processed with the same function that takes in a generic entity as an argument. In contrast, prior work either process a global latent state or assume a fixed set of entities processed in a permutation-sensitive manner.
Importantly, the symmetry that all objects follow the same physical laws enables us to define these learnable entity-centric functions to take as input argument a variable that represents a generic entity, the specific instantiations of which are all processed by the same function. If we assume access to these variables, then it is possible to perform symbolic relational computation using the dynamics model in the space of entities, rather than pixel features.
However, the general difficulty with using purely symbolic, abstract representations is that it is unclear how to continuously update these representations with more raw data. In the case of OP3, this symbol grounding problem is a variable binding problem: how do we bind information about object properties to the entity variables, without any supervision on what constitutes an object and what the latent variables should correspond to?
Iterative inference computes the recognition distribution via a procedure, rather than a single forward pass of an encoder, that iteratively refines an initial guess for the posterior parameters lambda by using gradients from how well the model is able to predict the observation based on the current posterior estimate.
We develop an interactive inference algorithm that uses the dynamics model to produce an initial estimate of the posterior parameters at a timestep, and then uses a refinement network to iteratively refine the estimate within a timestep.
From training on only predicting the effects of gravity, OP3 can automatically plan to solve various block stacking tasks, with different numbers of blocks than observed during training, without ever been trained to stack blocks.
On this block-stacking dataset, OP3 was only trained on up to five objects but can generalize to nine objects and new configurations during testing.
OP3 achieves 82% accuracy, compared to 76% accuracy for O2P2 and 24% accuracy for SAVP. SAVP is a state-of-the-art video prediction model that does not infer separate latents per object. O2P2 is an oracle model that assumes access to object segmentations.
Below is an example of a planning sequence.
Step 1
Step 2
Step 3
Step 4
Step 5
Step 6
In interactive inference, OP3 interacts with the objects by executing pre-specified actions in order to disambiguate objects already present in the scene by taking advantage of temporal continuity and receiving feedback from how well its prediction of how an action affects an object compares with the ground truth result.
The figure on the right shows the execution of interactive inference during training, where OP3 alternates between four refinement steps and one prediction step.
Notice that OP3 infers entity representations that decompose the scene into coherent objects and that entities that do not model objects model the background.
We also observe in the last column (t=2) that OP3 predicts the appearance of the green block even though the green block was partially occluded in the previous timestep, which shows its ability to retain information across time.
Estimating the state of each object through interactive inference produces a pointer to each object (the index of the entity latent that represents that object) enabling OP3 to plan in rollouts in a higher-level object space, rather than in the low-level position control action space as is typically done.
Whereas typical approaches for image segmentation rely on labelled supervision, the latent entities inferred with OP3 naturally can be decoded into semantic and temporally coherent image segmentations of objects without any supervision.
We compare OP3, applied on dynamic videos, with IODINE applied independently to each frame of the video, to illustrate that using a dynamics model to propagate information across time enables better object disambiguation. We observe that initially, both OP3 (green circle) and IODINE (cyan circles) both disambiguate objects via color segmentation because color is the only signal in a static image to group pixels. However, we observe that as time progresses, OP3 separates the arm, object, and background into separate latents (purple) by using its currently estimates latents predict the next observation and comparing this prediction with the actually observed next observation. In contrast, applying IODINE on a per-frame basis does not yield benefits of temporal consistency and interactive feedback (red).
The code is available in this repo: https://github.com/jcoreyes/OP3.