Latent Space Planning for Multi-Object Manipulation with Environment-Aware Relational Classifiers

IEEE Transactions on Robotics (T-RO)

Paper Code Twitter Appendix

Abstract

For all but the simplest of tasks, robots must understand how the objects they manipulate will interact with structural elements of the environment. We examine the problem of predicting inter-object and object-environment relations between previously unseen objects and novel environments purely from partial-view point clouds. Our approach enables robots to plan and execute manipulation sequences to complete tasks defined purely from logical relations. This removes the burden of the users' need to provide explicit metric state goals. Key to our method is a novel transformer-based neural network that both predicts object-environment relations and learns a latent-space dynamics function. We achieve reliable sim-to-real transfer without any finetuning. Our experiments show that our model understands how changes in observed environmental geometry relate to semantic relations between objects.

Method Overview

Taking a segmented, partial-view point cloud as input, we first process it using PointConv to generate segment-specific features. We then pass these features into an encoder to predict a latent state X. We can decode X to predict both if the segment is a movable object and relations between objects and environment segments. By learning an action-conditioned, latent-space dynamic model our approach can be used to solve multi-step planning problems.

Simulation training examples

Visualization of the simulation dataset collection with different environments. (Fig. 8 in the paper)

Robot Experiments

The robot can reason about both pick and place and pushing actions in an environment with multiple shelves. (Fig. 1 and Fig. 10 in the paper)

Goal: Contact(all red boxes, low shelf) = 1

Goal: Contact(white, coffee can) = 1

Given the same initial scene the robot is tasked with moving all objects either to the boundary or off of the supporting table. The robot succeeds for different tables of varying shape, size, and height. These results highlight the model's ability to ground the object-environment semantic concepts to the geometry of the observed scene. (Fig. 15 in the paper)

Goal: Boundary(all objects, table) = 1

Goal: Above(all objects, table) = 0

Goal: Boundary(all objects, table) = 1

Goal: Above(all objects, table) = 0

Given the same goal relation Contact(white, shelf) = 1 with the same environment, but different initial object pose, standing versus lying down, our framework can choose between different actions, picking versus pushing, to achieve the goal relations. Furthermore, for the same scene the robot understands how to manipulate an object to be above, under, or in contact with a shelf. The robot can also choose to use pick-and-place to achieve a desired object-environment contact relation when the shelf is high and chooses to push when the shelf is low. (Fig. 14 in the paper)

Goal: Contact(white, shelf) = 1

Goal: Contact(white, shelf) = 1

Goal: Below(white, shelf) = 1

Goal: Contact(white, shelf) = 1

Our planner performs diverse multi-object rearrangements including pushing objects to be in contact, deconstructing towers, and aligning objects spatially. (Fig. 7 in the paper)

Goal: Contact(red mug, spam box) = 1 and Contact(red mug, yellow mustard) = 1

Goal: right(coffee can, red mug) = 1 and front(coffee can, white cleaner) = 1 

Two test examples with more complex shelves and novel view points. (Fig. 9 in the paper)

Goal: Contact(white cleaner on the top layer, bottom layer) = 1

Goal: Contact(white cleaner on the bottom layer, top layer) = 1

For the same initial scene we show different valid states found by our planner and model for two different goal settings. The robot can either pick the green object or the red object to place atop the yellow object. (Fig. 3 in the paper)

Goal: Below(yellow, red) = 1

Goal: Below(yellow, red) = 1