Object manipulation is a common component of everyday tasks, but learning to manipulate objects from high-dimensional observations presents significant challenges. These challenges are heightened in multi-object environments due to the combinatorial complexity of the state space as well as of the desired behaviors. While recent approaches have utilized large-scale offline data to train models from pixel observations, achieving performance gains through scaling, these methods struggle with compositional generalization in unseen object configurations with constrained network and dataset sizes. To address these issues, we propose a novel behavioral cloning (BC) approach that leverages object-centric representations and an entity-centric Transformer with diffusion-based optimization, enabling efficient learning from offline image data. Our method first decomposes observations into Deep Latent Particles (DLP), which are then processed by our entity-centric Transformer that computes attention at the particle level, simultaneously predicting object dynamics and the agent's actions. Combined with the ability of diffusion models to capture multi-modal behavior distributions, this results in substantial performance improvements in multi-object tasks and, more importantly, enables compositional generalization. We present BC agents capable of zero-shot generalization to perform tasks with novel compositions of objects and goals, including larger numbers of objects than seen during training.
We propose Entity-Centric Diffuser (EC-Diffuser), a diffusion-based policy that leverages an object centric representation and an entity-centric Transformer to denoise future states and actions.
We train our policy with 3 objects, their colors randomly chosen from 6 colors. We then zero-shot transfer it on environments with up to 6 objects.
Below, we visualize the (trajectory rollout, goal image).
Tested on 4 cubes
Tested on 5 cubes
Tested on 6 cubes
We train an agent with 3 objects and 3 goals. We then zero-shot transfer it to scenraios with up to 4 objects and varying goal compositions.
Below, we visualize the (trajectory rollout, goal image).
Tested on 4 objects, 1 color
Tested on 4 objects, 2 colors
Tested on 4 objects, 3 colors
From the same inital and goal observations, EC-Diffuser can generate multimodal behaviors.
Red->Blue->Yellow
counter-clockwise
push green T down
Red->Yellow->Blue
clockwise
push green T in-place
We train our policy with 3 objects, their colors randomly chosen from 6 colors. We then zero-shot transfer it on the following environments .
New-colored
Star-shaped
Rectangular-shaped
T-shaped (location push)
Carl Qi, Dan Haramati, Tal Daniel, Aviv Tamar, and Amy Zhang. "EC-Diffuser: Multi-Object Manipulation via Entity-Centric Behavior Generation." Proceedings of the Twelfth International Conference on Learning Representations (ICLR). 2025.
@inproceedings{
qi2025ecdiffuser,
title={{EC}-Diffuser: Multi-Object Manipulation via Entity-Centric Behavior Generation},
author={Carl Qi and Dan Haramati and Tal Daniel and Aviv Tamar and Amy Zhang},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=o3pJU5QCtv}
}