DiMSam: Diffusion Models as Samplers for

Task and Motion Planning under Partial Observability

IROS 2024 (Oral)

Best Conference Paper/Best Student Paper Awards Finalist

Xiaolin Fang1 , Caelan Reed Garrett2, Clemens Eppner2,

Tomás Lozano-Pérez1, Leslie Pack Kaelbling1, Dieter Fox2

MIT CSAIL1 NVIDIA2

Abstract

Generative models such as diffusion models, excel at capturing high-dimensional distributions with diverse input modalities, e.g. robot trajectories, but are less effective at multi-step constraint reasoning. Task and Motion Planning (TAMP) approaches are suited for planning multi-step autonomous robot manipulation. However, it can be difficult to apply them to domains where the environment and its dynamics are not fully known. We propose to overcome these limitations by composing diffusion models using a TAMP system. We use the learned components for constraints and samplers that are difficult to engineer in the planning model, and use a TAMP solver to search for the task plan with constraint-satisfying action parameter values. To tractably make predictions for unseen objects in the environment, we define the learned samplers and TAMP operators on learned latent embedding of changing object states. We evaluate our approach in a simulated articulated object manipulation domain and show how the combination of classical TAMP, generative modeling, and latent embedding enables multi-step constraint-based reasoning. We also apply the learned sampler directly in the real world.

Towards Multi-Step Robot Manipulation in Unseen Environment

Our Approach: DiMSam

Diffusion Models as Samplers for Task and Motion Planning

Compose diffusion models and engineered samplers (e.g. RRT) through the TAMP framework

Adopt diffusion models as a generative representation of TAMP constraints.

✔️ Trajectory/skill modeling capability of diffusion models

✔️ Planning and reasoning capability of Task and Motion Planners

✔️ TAMP from raw perception input

Method

Each task plan defines a Constraint-Satisfaction Problem (CSP)

Solve the CSP using learned and given samplers

Alternate between task plan searching and CSP solving

Skill Representation

Trajectory of Object and Robot State

We model both the object state and the robot state (action) [Diffuser, Janner*, Du* et.al].

Motivation: Constraints are often defined on object states. Modeling the object state allows us to add or check constraints on the skill at testing time.

Constraint Representation

Classifiers defined on Factored State

The trajectory model encodes general constraints on the object and robot states.

To draw samples that satisfy certain constraints during planning, such as trajectories that close the microwave, we use classifier-guided conditional sampling of the diffusion model.

State Representation

Flexible, as long as constraints can be defined using the same representation

Different skills may require different state and action representations. DiMSam can compose heterogeneous skills as long as constraints can be defined on the state representations used by those skills.

Bibtex

@inproceedings{fang2024dimsam,

title={{DiMSam: Diffusion Models as Samplers for Task and Motion Planning under Partial Observability}},

author={Xiaolin Fang and Caelan Reed Garrett and Clemens Eppner and Tomás Lozano-Pérez and Leslie Pack Kaelbling and Dieter Fox},

booktitle={IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},

year={2024},

}

Guided Sampling of Diffusion Models

Click to switch between images

Simulation Environment

Real World Observation

Failure Cases

close_fail1.mp4

Diffusion prediction error

Microwave door is not fully closed

open_fail1.MOV

Planning (non-diffusion component) and execution error

Collision detected in pre-pushing trajectory

Execution uncertainty: stick slips in the gripper
Partial observability: sparsity of the point cloud