Learning Distributional Demonstration Spaces for Task-Specific     Cross-Pose Estimation

Jenny Wang*, Octavian Donca*, and David Held
*Contributed equally

Robotics Institute, School of Computer Science, Carnegie Mellon University
{jennyw2, odonca, dheld}@andrew.cmu.edu


ICRA 2024

Abstract

Relative placement tasks are an important category of tasks in which one object needs to be placed in a desired pose relative to another object.  Previous work has shown success in learning relative placement tasks from just a small number of demonstrations, when using relational reasoning networks with geometric inductive biases. However, such methods fail to consider that demonstrations for the same task can be fundamentally multimodal, like a mug hanging on any of n racks. We propose a method that retains the provably translation-invariant and relational properties of prior work but incorporates additional properties that enable learning multimodal, distributional examples. We show that our method is able to learn precise relative placement tasks with a small number of multimodal demonstrations with no human annotations across a diverse set of objects within a category.

Videos

1 Rack

2 Racks

3 Racks

Bottle

Bowl

Intuition

Our method for relative placement prediction tasks learns a spatially-grounded latent distribution over demonstrations without human annotations, using an architecture with geometric inductive biases.

Method Overview

We formulate our method as a cVAE. After ablation studies, we settle on an architecture with a dense latent distribution over points, regularized to a learned prior. During training time, we encode a demo Y into the latent space and decode a sample to the same demo Y. During inference time, we sample from the learned prior, conditioned on the object point clouds in arbitrary poses (ie scattered on a table) and decode the sample to a demo.

Learned Distributional Demonstration Spaces

Our method learns a latent distribution over demonstrations. Decoded samples from this learned distribution capture multiple ways of successfully completing a task, as shown in green, yellow, and red, overlaid on the static scene in blue.

1 Rack Mug Placement

2 Racks Mug Placement

3 Racks Mug Placement

Bottle Placement

Bowl Placement

Mug Grasping

Bottle Grasping

Bowl Grasping

Appendix

TAXPoseD_ICRA_Appendix.pdf