Constrained-Context Conditional Diffusion Models
for Imitation Learning
Vaibhav Saxena*, Yotto Koga^, Danfei Xu*
*School of Interactive Computing, Georgia Tech
^Robotics Lab, Autodesk Research
Paper | Code coming soon!
Abstract
Offline Imitation Learning (IL) is a powerful paradigm to learn visuomotor skills, especially for high-precision manipulation tasks. However, IL methods are prone to spurious correlation - expressive models may focus on distractors that are irrelevant to action prediction - and are thus fragile in real-world deployment. Prior methods have addressed this challenge by exploring different model architectures and action representations. However, none were able to balance between sample efficiency, robustness against distractors, and solving high-precision manipulation tasks with complex action space. To this end, we present Constrained-Context Conditional Diffusion Model (C3DM), a diffusion model policy for solving 6-DoF robotic manipulation tasks with high precision and ability to ignore distractions. A key component of C3DM is a fixation step that helps the action denoiser to focus on task-relevant regions around the predicted action while ignoring distractors in the context. We empirically show that C3DM is able to consistently achieve high success rate on a wide array of tasks, ranging from table top manipulation to industrial kitting, that require varying levels of precision and robustness to distractors.
place red in green
kitting part
hang cup
two-part assembly
Method Overview
C3DM is a method for visuomotor imitation learning in high-precision tasks.
Processing top-down view
Camera capture is passed to a fully convolutional network. The embeddings are concatenated with a random 6-dof action which is fed into an MLP, whose output is a score function that is used to denoise the random action.
Predicting "fixation point"
The denoised action is used to then predict a fixation point in the observation space, which enables us to zoom into the context around that point. This zooming allows our model to be very precise as it now needs to predict actions in the new scaled action space, while also getting rid of distractors far away from the fixation point.
Action prediction using de-noising diffusion
The denoised action is used to then predict a fixation point in the observation space, which enables us to zoom into the context around that point. This zooming allows our model to be very precise as it now needs to predict actions in the new scaled action space, while also getting rid of distractors far away from the fixation point.
Fixation while Denoising
Here we show the action-denoising process for the place-red-in-green task (picking the red block 🟥 here) coupled with the context-constraining. The model uses the generated score field to zoom into the observation around predicted fixation points.