Deep Visual Constraints:
Neural Implicit Models for Manipulation Planning from Visual Input
Jung-Su Ha, Danny Driess, Marc Toussaint
Learning & Intelligent Systems Lab, TU Berlin
Abstract: Manipulation planning is the problem of finding a sequence of robot configurations that involves interactions with objects in the scene, e.g., grasping and placing an object, or more general tool-use. To achieve such interactions, traditional approaches require hand-engineering of object representations and interaction constraints, which easily becomes tedious when complex objects/interactions are considered. Inspired by recent advances in 3D modeling, e.g. NeRF, we propose a method to represent objects as neural implicit functions upon which constraint features are defined and jointly trained. In particular, the proposed pixel-aligned representation is directly inferred from images with known camera geometry and naturally acts as a perception component in the whole manipulation pipeline, thereby enabling long-horizon planning only from visual input.
I. Overview
Unlike static objects and a robot's own body, 3D models of objects that are manipulated are often unavailable. Deep Visual Constraints represent an object as a neural implicit function directly from color images, based on which task constraint functions are defined. The implicit representations can naturally describe object's rigid transformations in SE(3), enabling efficient optimization-based manipulation planning.
II. Network Architecture
1. Pixel-aligned Implicit Object Representation (PIFO)
U-net encodes images as pixel-wise feature images
A 3D point, p, is projected into pixel coordinate using known camera geometry (T, K)
The representation vector, y, is computed by extracting the local image features at the projected points
2. Deep Visual Constraints (DVCs)
Representation vectors are collected from the shared backbone (PIFO) at keypoints attached to the robot frame, e.g., gripper
The constraint value (interaction feasibility) is predicted based on the collected representation vectors
III. Training
1. Data Generation & Augmentation
1.1 Task Data
131 meshes of mugs are taken from ShapeNet and convexified. For each mug,
11,000 samples and their signed distances are computed
1,000 feasible grasping/hanging poses are obtained (using Bullet)
1.2 Posed Image Data
The image data consists of 100 images (128 by 128) and camera extrinsic/intrinsic matrices looking at the center of the object
Azimuth & elevation of cameras are uniformly sampled
Camera distance and lightings are randomized
1.3 Data Augmentation
Random image rotations & shifting on images => the camera intrinsic/extrinsic modified accordingly
Random cutouts to address occlusion
2. Training Loop and Loss Function
Typical L1 loss for the SDF data
The sign-agnostic L1 loss for grasping/hanging
IV. Planning with DVCs
Two necessary preparations to plug DVCs in manipulation planning:
transforming the entire scene images into object-centric ones
rewriting DVCs as functions of a robot joint state and object's rigid transformation
1. Multi-view Processing
Two-step procedure of multi-view processing:
Find bounding ball from object masks
Warp raw images using Homography transformation and compute the corresponding camera intrinsic/extrinsic
2. Manipulation Planning with DVCs
Make DVCs as functions of a joint state and object's rigid transformation
Solve trajectory optimization over the path of the joint state and object's rigid transformations
where the learned DVC is included in the constraint set
V-I. Demo: Sequential Manipulation
Pick-and-Hang of Training mug (9/10 success)
Three-mug Hanging
Pick-and-Hang of Unseen mug (7/10 success)
Handover
V-II. Demo: Zero-shot Imitation
V-III. Demo: Real Robot Transfer
Questions?
Contact [jung-su.ha@tu-berlin.de] to get more information about the project