Deep Visual Constraints:
Neural Implicit Models for Manipulation Planning from Visual Input

Jung-Su Ha, Danny Driess, Marc Toussaint

Learning & Intelligent Systems Lab, TU Berlin

[Paper] [Code] [Bibtex]

Abstract: Manipulation planning is the problem of finding a sequence of robot configurations that involves interactions with objects in the scene, e.g., grasping and placing an object, or more general tool-use. To achieve such interactions, traditional approaches require hand-engineering of object representations and interaction constraints, which easily becomes tedious when complex objects/interactions are considered. Inspired by recent advances in 3D modeling, e.g. NeRF, we propose a method to represent objects as neural implicit functions upon which constraint features are defined and jointly trained. In particular, the proposed pixel-aligned representation is directly inferred from images with known camera geometry and naturally acts as a perception component in the whole manipulation pipeline, thereby enabling long-horizon planning only from visual input.

I. Overview

Unlike static objects and a robot's own body, 3D models of objects that are manipulated are often unavailable. Deep Visual Constraints represent an object as a neural implicit function directly from color images, based on which task constraint functions are defined. The implicit representations can naturally describe object's rigid transformations in SE(3), enabling efficient optimization-based manipulation planning.

II. Network Architecture

1. Pixel-aligned Implicit Object Representation (PIFO)

  • U-net encodes images as pixel-wise feature images

  • A 3D point, p, is projected into pixel coordinate using known camera geometry (T, K)

  • The representation vector, y, is computed by extracting the local image features at the projected points

2. Deep Visual Constraints (DVCs)

  • Representation vectors are collected from the shared backbone (PIFO) at keypoints attached to the robot frame, e.g., gripper

  • The constraint value (interaction feasibility) is predicted based on the collected representation vectors

III. Training

1. Data Generation & Augmentation

1.1 Task Data

131 meshes of mugs are taken from ShapeNet and convexified. For each mug,

  • 11,000 samples and their signed distances are computed

  • 1,000 feasible grasping/hanging poses are obtained (using Bullet)

1.2 Posed Image Data

  • The image data consists of 100 images (128 by 128) and camera extrinsic/intrinsic matrices looking at the center of the object

  • Azimuth & elevation of cameras are uniformly sampled

  • Camera distance and lightings are randomized

1.3 Data Augmentation

  • Random image rotations & shifting on images => the camera intrinsic/extrinsic modified accordingly

  • Random cutouts to address occlusion

2. Training Loop and Loss Function

IV. Planning with DVCs

Two necessary preparations to plug DVCs in manipulation planning:

  1. transforming the entire scene images into object-centric ones

  2. rewriting DVCs as functions of a robot joint state and object's rigid transformation

1. Multi-view Processing

Two-step procedure of multi-view processing:

  1. Find bounding ball from object masks

  2. Warp raw images using Homography transformation and compute the corresponding camera intrinsic/extrinsic

2. Manipulation Planning with DVCs

  • Make DVCs as functions of a joint state and object's rigid transformation

  • Solve trajectory optimization over the path of the joint state and object's rigid transformations

where the learned DVC is included in the constraint set

V-I. Demo: Sequential Manipulation

Pick-and-Hang of Training mug (9/10 success)

Three-mug Hanging

Pick-and-Hang of Unseen mug (7/10 success)

Handover

V-II. Demo: Zero-shot Imitation

V-III. Demo: Real Robot Transfer

Questions?

Contact [jung-su.ha@tu-berlin.de] to get more information about the project