Abstract: Object rearrangement is a challenge for embodied agents because solving these tasks requires generalizing across a combinatorially large set of underlying entities that take the value of object states. Worse, these entities are often unknown and must be inferred from sensory percepts. We present a hierarchical abstraction approach to uncover these underlying entities and achieve combinatorial generalization from unstructured inputs. By constructing a factorized transition graph over clusters of object representations inferred from pixels, we show how to learn a correspondence between intervening on states of entities in the agent's model and acting on objects in the environment. We use this correspondence to develop a method for control that generalizes to different numbers and configurations of objects, which outperforms current offline deep RL methods when evaluated on a set of simulated rearrangement and stacking tasks.
Solving object rearrangement requires solving two challenging problems:
(a) The correspondence problem: abstracting raw sensorimotor signal into representations of entities such that there is a correspondence between how an agent intervenes on an entity and how its action affects an object in the environment.
(b) The combinatorial problem: representing the combinatorial space of tasks in a way that captures common state transitions in a way that is agnostic to the type of the target object that was moved and to the configuration of context objects in the rest of the scene.
Neural Constraint Satisfaction addresses these problems using a two-level hierarchy to abstract video interactions into a graph over individual state transitions.
The modeling component of NCS constructs a two-level abstraction hierarchy.
(a) Level one: NCS learns to infer a set of entities from sensorimotor transitions with pick-and-move actions, in which one entity is moved per transition. We explicitly enforce that the type z of an entity remains unchanged between time-steps. The transformer dynamics model learns to sparsely predict the states s of the entities at the next time-step. Empirically, NCS learns to modify only the state of the entity that it uses to represent the target object that was moved in the environment. This solves the correspondence problem by forcing the network to use predict and reconstruct observations through the entity bottleneck.
(b) Level two: NCS abstracts transitions over entity-sets into transitions over individual states, constructing a graph where states are nodes and transitions between them are edges. This is done by clustering entity transitions that share similar initial states and similar final states. This solves the combinatorial problem by constructing the state transitions to be agnostic to the identities or contexts of the entities from which the transitions were derived.
Given a rearrangement problem specified only by the current and goal observations, NCS decomposes the rearrangement problem into one subproblem per entity.
(a) For each subproblem, NCS infers the entities from both the current and goal observations, and selects an action to transform the entity's state from the current to the goal. Breaking this down:
(b) NCS aligns the indices of current and goal entities by matching their types.
(c) It selects the index of the next goal constraint to satisfy, as indicated by the red box. The selected goal constraint and current entities have the same type but different states.
(d) It binds the selected goal constraint and its corresponding current entity to nodes in the transition graph.
Lastly, it identifies the edge connecting those two nodes and executes the action tagged to that edge in the environment.
We observe that NCS discovers that states and identities correspond to object locations and appearance respectively. Here we show the clustering of the state component of entities inferred for the robogym environment, where each centroid is treated as a node in our transition graph. Each cluster is labeled with an attention mask computed by averaging the masks that slot attention produces for the entities in the cluster, showing that similar states correspond to similar object positions, and the nodes of the graph effectively partition the states from the training set into equivalence classes.
The crucial test for knowledge reuse is to solve tasks from parts of the combinatorial space of problems that are disjoint from the part of the space given for training, i.e. generalization to different numbers of objects. We find that NCS does generalize successfully, though it also makes mistakes.
We consider three challenging environments, and we test settings where all objects or only a subset of objects have associated constraints (e.g. top left example).
We measure the mean change in the number of satisfied constraints divided by the number of initially unsatisfied constraints for each episode. We compare NCS with a version of MPC that replaces factorized graph search with CEM, an ablation that constructs a monolithic graph over transitions between sets of slot states rather than individual slot states (MGS), and state-of-the-art pixel-based behavior cloning (BC) and implicit Q-learning (IQL) implementations based off of JAXRL.
We find that NCS performs significantly better than the baselines (about a 5-10x improvement), suggesting that modeling the properties of independence, symmetry, and factorization of entities enables NCS to collapse the combinatorial structure of the task space.