Constrained-Context Conditional Diffusion Models
for Imitation Learning

Vaibhav Saxena*, Yotto Koga^, Danfei Xu*

*School of Interactive Computing, Georgia Tech
^Robotics Lab, Autodesk Research

Paper | Code coming soon!

Abstract

Offline Imitation Learning (IL) is a powerful paradigm to learn visuomotor skills, especially for high-precision manipulation tasks. However, IL methods are prone to spurious correlation - expressive models may focus on distractors that are irrelevant to action prediction - and are thus fragile in real-world deployment. Prior methods have addressed this challenge by exploring different model architectures and action representations. However, none were able to balance between sample efficiency, robustness against distractors, and solving high-precision manipulation tasks with complex action space. To this end, we present Constrained-Context Conditional Diffusion Model (C3DM), a diffusion model policy for solving 6-DoF robotic manipulation tasks with high precision and ability to ignore distractions. A key component of C3DM is a fixation step that helps the action denoiser to focus on task-relevant regions around the predicted action while ignoring distractors in the context. We empirically show that C3DM is able to consistently achieve high success rate on a wide array of tasks, ranging from table top manipulation to industrial kitting, that require varying levels of precision and robustness to distractors.

place red in green

kitting part

hang cup

two-part assembly

Method Overview

C3DM is a method for visuomotor imitation learning in high-precision tasks.

Processing top-down view

Camera capture is passed to a fully convolutional network. The embeddings are concatenated with a random 6-dof action which is fed into an MLP, whose output is a score function that is used to denoise the random action.

Predicting "fixation point"

The denoised action is used to then predict a fixation point in the observation space, which enables us to zoom into the context around that point. This zooming allows our model to be very precise as it now needs to predict actions in the new scaled action space, while also getting rid of distractors far away from the fixation point.

Action prediction using de-noising diffusion

The denoised action is used to then predict a fixation point in the observation space, which enables us to zoom into the context around that point. This zooming allows our model to be very precise as it now needs to predict actions in the new scaled action space, while also getting rid of distractors far away from the fixation point.

Fixation while Denoising

Here we show the action-denoising process for the place-red-in-green task (picking the red block 🟥 here) coupled with the context-constraining. The model uses the generated score field to zoom into the observation around predicted fixation points.

Quantitative evaluation in simulation

Robot demos

Ignoring Distractions (place big red block in green bowl)

PXL_20230516_212518474.TS.mp4

PXL_20230516_214820656.TS.mp4

Ignoring Distractions + High Precision (place small red block in green bowl)

PXL_20230515_220553201.TS.mp4

PXL_20230515_235948857.TS.mp4

Very High Precision (place screw in hole)

PXL_20230521_003704678.TS.mp4

PXL_20230520_195646337.TS.mp4

Constrained-Context Conditional Diffusion Modelsfor Imitation Learning