TAX3D: Non-Rigid Relative Placement through Dense Diffusion

Eric Cai, Octavian Donca, Ben Eisner, David Held

Robotics Institute, School of Computer Science, Carnegie Mellon University

{eycai, odonca, baeisner, dheld}@andrew.cmu.edu

Abstract

The task of "relative placement" is to predict the placement of one object in relation to another, e.g. placing a mug onto a mug rack. Through explicit object-centric geometric reasoning, recent methods for relative placement have made tremendous progress towards data-efficient learning for robot manipulation while generalizing to unseen task variations. However, they have yet to represent deformable transformations, despite the ubiquity of non-rigid bodies in real world settings. As a first step towards bridging this gap, we propose "cross-displacement" - an extension of the principles of relative placement to geometric relationships between deformable objects - and present a novel vision-based method to learn cross-displacement through dense diffusion. To this end, we demonstrate our method's ability to generalize to unseen object instances, out-of-distribution scene configurations, and multimodal goals on multiple highly deformable tasks (both in simulation and in the real world) beyond the scope of prior works.

TAX3D in the Real World

demo 1 small 2.mp4

demo 2 small 2.mp4

demo 3 small 2.mp4

demo 4 small 2.mp4

TAX3D performing a cloth-hanging task in the real world (videos shown at 3x speed), trained on 10 real-world demonstrations collected by the human. We show that TAX3D generalizes to novel anchor poses, minor changes to anchor geometry (removal/swapping of pegs), as well as multi-modal placements (bottom row).

Model Architecture

Left: During inference, randomly sampled displacements are de-noised conditioned on action and anchor features; the final set of cross-displacements is predicted to transform the action into a goal configuration. Right: Our modified Diffusion Transformer architecture combines self-attention and cross-attention for object-centric and scene-level reasoning.

Visualization of the Denoising Process

TAX3D-CD

TAX3D-CP

Simulation Experiments

On two separate simulation task environments (cloth hanging and bag hanging), TAX3D significantly outperforms a state-of-the-art method in end-to-end visuomotor policy learning (3D Diffusion Policy) when generalizing to novel anchor poses and novel cloth geometries. All models are trained on 16 demonstrations.

TAX3D-CP (Ours)

Cloth Hanging

Bag Hanging