TAX-Pose: Task-Specific Cross-Pose Estimation

for Robot Manipulation

Chuer Pan, Brian Okorn, Harry Zhang, Ben Eisner, David Held

Robotics Institute, School of Computer Science, Carnegie Mellon University

{chuerp, bokorn, haolunz, baeisner, dheld}@andrew.cmu.edu

(* indicates equal contribution)

Abstract

How do we imbue robots with the ability to efficiently manipulate unseen objects and transfer relevant skills based on demonstrations? End-to-end learning methods often fail to generalize to novel objects or unseen configurations. Instead, we focus on the task-specific pose relationship between relevant parts of interacting objects. We conjecture that this relationship is a generalizable notion of a manipulation task that can transfer to new objects in the same category; examples include the relationship between the pose of a pan relative to an oven or the pose of a mug relative to a mug rack. We call this task-specific pose relationship "cross-pose" and provide a mathematical analysis of this concept. We propose a vision-based system that learns to estimate the cross-pose between two objects for a given manipulation task using learned cross-object correspondences. The estimated cross-pose is then used to guide a downstream motion planner to manipulate the objects into the desired pose relationship (placing a pan into the oven or the mug onto the mug rack). We train our system using a small number of examples and demonstrate its capability to generalize to unseen objects in both simulation and the real world, deploying our policy on a Franka Panda across a number of tasks. Results show that our system achieves state-of-the art performance in both simulated and real-world experiments.

Method

Our method learns a task-specific "cross-pose," which can be used move a pair of objects into a goal configuration. We learn this "cross-pose" as a function of a dense set of soft correspondences, based on cross-object attention, augmented with a set of correspondence residuals. These residuals allows our virtual correspondence to map to points outside the convex hull of each object. Given the dense set of virtual correspondences, a single rigid transform can be computed using a weighted SVD to reach the goal configuration.

Training Pipeline

TAX-Pose Training Overview: Our method takes as input two point clouds given a specific task and outputs the cross-pose between them for the task. TAX-Pose first learns point clouds features using two DGCNN networks and two transformers. Then the learned features will be input to two point residual networks to predict per-point soft correspondence and weights between the two objects. Then the desired cross-pose can be inferred analytically using singular value decomposition.

Physical Experiments Results

In the video below, we demonstrate that TAX-Pose is able to accomplish mug-hanging tasks, which require high precision. The mug-hanging TAX-Pose model here was trained using 10 real mug-rack demonstrations and tested on novel rack poses and 10 other novel mugs, without requiring any simulation demonstrations or keypoint annotations.

In the image below, we show the 10 trained mugs and 10 test mugs.

Our method reasons about the point clouds for both the mug and the rack to estimate the pose relationship between these two objects; thus, our method can generalize to new poses for both the mug and the rack, as shown in the video below (same video as at the top of the page), in which we continuously move around both the mug and the mug rack and repeatedly hang the mug at each new rack location.

A set of continuous trials for mug-hanging with TAX-Pose in real world

In the videos below, we demonstrate that TAX-Pose is able to achieve a wider variety of tasks where the robot places one object relative to another. Specifically, we apply three task-specific models (Inside, Top, and Left) trained in simulation to the real-world without fine-tuning.

Semantic Goal: INSIDE. We place one object (i.e. a cube and a bowl) INTO another object (i.e. a drawer, a fridge, and an oven). We trained a model that accomplishes the "Inside" task in simulation.

Top Row: Real-world execution of the "Inside" task.

Bottom Row: Point clouds of action object starting pose, anchor object, and action object predicted TAX-Pose.

Semantic Goal: TOP. We place one object (i.e. a cube and a bowl) ON TOP OF another object (i.e. a drawer, a fridge, and an oven). We trained a model that accomplishes the "Top" task in simulation.

Top Row: Real-world execution of the "Top" task. Bottom Row: Point clouds of action object starting pose, anchor object, and action object predicted TAX-Pose.

Semantic Goal: LEFT. We place one object (i.e. a cube and a bowl) TO THE LEFT OF another object (i.e. a drawer, a fridge, and an oven). We trained a model that accomplishes the "Left" task in simulation.

Top Row: Real-world execution of the "Left" task. Bottom Row: Point clouds of action object starting pose, anchor object, and action object predicted TAX-Pose.

Real-World Failures of TAX-Pose Prediction. Here we show a failure of "Inside" prediction and a failure of "Inside" prediction.

In the failure of the "Inside" prediction, the predicted TAX-Pose penetrates the oven base too much. In the failure of the "Left" prediction, the predicted TAX-Pose penetrates the drawer base too much.

Simulation Details

Here we show the simulation details (PyBullet) for the PartNet Mobility Objects Placement Task. We select a set of household furniture objects from the PartNet-Mobility dataset as the anchor objects, and a set of small rigid objects released with the Ravens simulation environment as the action objects. For each anchor object, we define a set of semantic goal positions (i.e. ‘top’, ‘left’, ‘right’, ‘in’), where action objects should be placed relative to each anchor. Each semantic goal position defines a unique task in our cross-pose prediction framework. We leverage the PartNet-Mobility dataset to find common household objects as the anchor object for TAX-Pose prediction. The selected subset of the dataset contains 8 categories of objects. We split the objects into 54 seen and 14 unseen instances.