Learning Distributional Demonstration Spaces for Task-Specific Cross-Pose Estimation
Jenny Wang*, Octavian Donca*, and David Held
*Contributed equally
Robotics Institute, School of Computer Science, Carnegie Mellon University
{jennyw2, odonca, dheld}@andrew.cmu.edu
ICRA 2024
Abstract
Relative placement tasks are an important category of tasks in which one object needs to be placed in a desired pose relative to another object. Previous work has shown success in learning relative placement tasks from just a small number of demonstrations, when using relational reasoning networks with geometric inductive biases. However, such methods fail to consider that demonstrations for the same task can be fundamentally multimodal, like a mug hanging on any of n racks. We propose a method that retains the provably translation-invariant and relational properties of prior work but incorporates additional properties that enable learning multimodal, distributional examples. We show that our method is able to learn precise relative placement tasks with a small number of multimodal demonstrations with no human annotations across a diverse set of objects within a category.
Videos
1 Rack
2 Racks
3 Racks
Bottle
Bowl
Intuition
Our method for relative placement prediction tasks learns a spatially-grounded latent distribution over demonstrations without human annotations, using an architecture with geometric inductive biases.
Method Overview
We formulate our method as a cVAE. After ablation studies, we settle on an architecture with a dense latent distribution over points, regularized to a learned prior. During training time, we encode a demo Y into the latent space and decode a sample to the same demo Y. During inference time, we sample from the learned prior, conditioned on the object point clouds in arbitrary poses (ie scattered on a table) and decode the sample to a demo.
Learned Distributional Demonstration Spaces
Our method learns a latent distribution over demonstrations. Decoded samples from this learned distribution capture multiple ways of successfully completing a task, as shown in green, yellow, and red, overlaid on the static scene in blue.