Entity-Centric Reinforcement Learning
for Object Manipulation from Pixels
Dan Haramati, Tal Daniel, Aviv Tamar
ICLR 2024 - Spotlight (top 5%)
Goal-Conditioned Reinforcement Learning Workshop, NeurIPS 2023 - Spotlight Talk
Code | arXiv | OpenReview
Abstract
Manipulating objects is a hallmark of human intelligence, and an important task in domains such as robotics. In principle, Reinforcement Learning (RL) offers a general approach to learn object manipulation. In practice, however, domains with more than a few objects are difficult for RL agents due to the curse of dimensionality, especially when learning from raw image observations. In this work we propose a structured approach for visual RL that is suitable for representing multiple objects and their interaction, and use it to learn goal-conditioned manipulation of several objects. Key to our method is the ability to handle goals with dependencies between the objects (e.g., moving objects in a certain order). We further relate our architecture to the generalization capability of the trained agent, based on a theoretical result for compositional generalization, and demonstrate agents that learn with 3 objects but generalize to similar tasks with over 10 objects.
Compositional Generalization
Agent trained on manipulating 3 cubes and during inference, is provided a goal image containing 3 cubes of different colors.
The agent is then deployed in an environment containing 12 cubes, 4 of each color.
Agent Rollout - Inference
Goal Image
Agent Rollout - Training
Agent trained on manipulating 3 cubes and evaluated on 6 cubes
Goal Image
Training
Agent Rollout - Training
Agent Rollout - Inference
Goal Image
Inference
In the Eyes of the Agent
The agent learns from an object-centric latent representation of images extracted with Deep Latent Particles (DLP)
This latent representation is fed to our Transformer-based architecture for the RL agent, the Entity Interaction Transformer (EIT)
Goal Image
Sideview
Sideview
Frontview
Goal Image
Frontview
Deep Latent Particles (DLP) Decomposition
The Entity Interaction Transformer (EIT)
Complex Objects and Goals
We present preliminary results on the Push-2T task.
The agent is required to push two T-shaped blocks to a single goal orientation (angle) specified by an image.
Rollout
Goal
The agent handles objects with more complex dynamics and goals that are not explicit in the latent representation.