Relational object rearrangement (ROR) tasks (e.g. put plate in rack) require a robot to manipulate objects with precise semantic and geometric reasoning. Existing approaches either rely on pre-collected demonstrations that struggle to capture complex geometric constraints or generate goal-state observations to capture semantic and geometric knowledge, but fail to explicitly couple object transformation with action prediction, resulting in errors due to generative noise. To address these limitations, we propose Imagine2Act, a 3D imitation-learning framework that incorporates semantic and geometric constraints of objects into policy learning to tackle high-precision manipulation tasks. We first generate imagined goal images conditioned on language instructions and reconstruct corresponding 3D point clouds to provide robust semantic and geometric priors. This imagined goal point clouds serve as additional inputs to the policy model, while an object–action consistency strategy with soft pose supervision explicitly aligns predicted end-effector motion with generated object transformation. This design enables Imagine2Act to reason about semantic and geometric relationships between objects and predict accurate actions across diverse tasks. Experiments in both simulation and real world demonstrate that Imagine2Act outperforms previous state-of-the-art policies.
Before robot execution, the semantic–geometric constraint generation module produces an imagined point cloud conditioned on the initial observation. During training, this imagined point cloud is used as an additional input to the policy. Furthermore, by introducing Object–Action Consistency Learning, we compute the transformation between the initial and imagined object poses, which serves as an auxiliary prior input and contributes a loss term that enforces the strong correlation between object transformation and end-effector motion.
We evaluate Imagine2Act on RLBench and in the real world. On RLBench across 7 relational object rearrangement tasks, Imagine2Act achieves a mean success rate of 0.79, yielding an absolute improvement of at least 10% compared to 3D Diffuser Actor, Imagine Policy, and 3D-LOTUS. In real-world setting, the policy learns multi-task precise manipulation and delivers consistent improvements across 6 high-precision rearrangement tasks with an average increase of 25% in success rate compared to 3D Diffuser Actor. The approach is further applied to articulated object manipulation tasks in RLBench to verify its scalability to other types of tasks, which still shows promising performance.
Evaluation in Real-world. Success rate is reported as the number of successes out of 10 trials.
Evaluation in xRLBench of relational object rearrangement tasks. We report the success rate across 7 tasks. The last column reports the margin of Imagine2Act over each baseline.
Evaluation in RLBench of articulated object manipulation tasks. We report the success rate across 5 tasks.