EC-Diffuser: Multi-Object Manipulation via Entity-Centric Behavior Generation

EC-Diffuser: Multi-Object Manipulation

via Entity-Centric Behavior Generation

Carl Qi, Dan Haramati, Tal Daniel, Aviv Tamar, Amy Zhang

ICLR 2025

Paper / Code / OpenReview

Abstract

Highlight - PushCube Generalization

Highlight - PushT Generalization

Highlight - Multimodal Behavior Generation

Highlight - More Generalizations

Citation

Abstract

Object manipulation is a common component of everyday tasks, but learning to manipulate objects from high-dimensional observations presents significant challenges. These challenges are heightened in multi-object environments due to the combinatorial complexity of the state space as well as of the desired behaviors. While recent approaches have utilized large-scale offline data to train models from pixel observations, achieving performance gains through scaling, these methods struggle with compositional generalization in unseen object configurations with constrained network and dataset sizes. To address these issues, we propose a novel behavioral cloning (BC) approach that leverages object-centric representations and an entity-centric Transformer with diffusion-based optimization, enabling efficient learning from offline image data. Our method first decomposes observations into Deep Latent Particles (DLP), which are then processed by our entity-centric Transformer that computes attention at the particle level, simultaneously predicting object dynamics and the agent's actions. Combined with the ability of diffusion models to capture multi-modal behavior distributions, this results in substantial performance improvements in multi-object tasks and, more importantly, enables compositional generalization. We present BC agents capable of zero-shot generalization to perform tasks with novel compositions of objects and goals, including larger numbers of objects than seen during training.

We propose Entity-Centric Diffuser (EC-Diffuser), a diffusion-based policy that leverages an object centric representation and an entity-centric Transformer to denoise future states and actions.

Highlight - PushCube Generalization

We train our policy with 3 objects, their colors randomly chosen from 6 colors. We then zero-shot transfer it on environments with up to 6 objects.

Below, we visualize the (trajectory rollout, goal image).

Tested on 4 cubes

Tested on 5 cubes

Tested on 6 cubes

Highlight - PushT Generalization

We train an agent with 3 objects and 3 goals. We then zero-shot transfer it to scenraios with up to 4 objects and varying goal compositions.

Below, we visualize the (trajectory rollout, goal image).

Tested on 4 objects, 1 color

Tested on 4 objects, 2 colors

Tested on 4 objects, 3 colors

Highlight - Multimodal Behavior Generation

From the same inital and goal observations, EC-Diffuser can generate multimodal behaviors.

Red->Blue->Yellow

counter-clockwise

push green T down

Red->Yellow->Blue

clockwise

push green T in-place

Highlight - More Generalizations

We train our policy with 3 objects, their colors randomly chosen from 6 colors. We then zero-shot transfer it on the following environments .

New-colored

Star-shaped

Rectangular-shaped

T-shaped (location push)

Citation

Carl Qi, Dan Haramati, Tal Daniel, Aviv Tamar, and Amy Zhang. "EC-Diffuser: Multi-Object Manipulation via Entity-Centric Behavior Generation." Proceedings of the Twelfth International Conference on Learning Representations (ICLR). 2025.

@inproceedings{

qi2025ecdiffuser,

title={{EC}-Diffuser: Multi-Object Manipulation via Entity-Centric Behavior Generation},

author={Carl Qi and Dan Haramati and Tal Daniel and Aviv Tamar and Amy Zhang},

booktitle={The Thirteenth International Conference on Learning Representations},

year={2025},

url={https://openreview.net/forum?id=o3pJU5QCtv}

}

Google Sites

Report abuse