Deep Visual Constraints

Deep Visual Constraints:
Neural Implicit Models for Manipulation Planning from Visual Input

Jung-Su Ha, Danny Driess, Marc Toussaint

Learning & Intelligent Systems Lab, TU Berlin

Abstract: Manipulation planning is the problem of finding a sequence of robot configurations that involves interactions with objects in the scene, e.g., grasping and placing an object, or more general tool-use. To achieve such interactions, traditional approaches require hand-engineering of object representations and interaction constraints, which easily becomes tedious when complex objects/interactions are considered. Inspired by recent advances in 3D modeling, e.g. NeRF, we propose a method to represent objects as neural implicit functions upon which constraint features are defined and jointly trained. In particular, the proposed pixel-aligned representation is directly inferred from images with known camera geometry and naturally acts as a perception component in the whole manipulation pipeline, thereby enabling long-horizon planning only from visual input.

I. Overview

Unlike static objects and a robot's own body, 3D models of objects that are manipulated are often unavailable. Deep Visual Constraints represent an object as a neural implicit function directly from color images, based on which task constraint functions are defined. The implicit representations can naturally describe object's rigid transformations in SE(3), enabling efficient optimization-based manipulation planning.

II. Network Architecture

1. Pixel-aligned Implicit Object Representation (PIFO)

U-net encodes images as pixel-wise feature images
A 3D point, p, is projected into pixel coordinate using known camera geometry (T, K)
The representation vector, y, is computed by extracting the local image features at the projected points

2. Deep Visual Constraints (DVCs)

Representation vectors are collected from the shared backbone (PIFO) at keypoints attached to the robot frame, e.g., gripper
The constraint value (interaction feasibility) is predicted based on the collected representation vectors

III. Training

1. Data Generation & Augmentation

1.1 Task Data

131 meshes of mugs are taken from ShapeNet and convexified. For each mug,

11,000 samples and their signed distances are computed
1,000 feasible grasping/hanging poses are obtained (using Bullet)

1.2 Posed Image Data

The image data consists of 100 images (128 by 128) and camera extrinsic/intrinsic matrices looking at the center of the object
Azimuth & elevation of cameras are uniformly sampled
Camera distance and lightings are randomized

1.3 Data Augmentation

Random image rotations & shifting on images => the camera intrinsic/extrinsic modified accordingly
Random cutouts to address occlusion

2. Training Loop and Loss Function

Typical L1 loss for the SDF data
The sign-agnostic L1 loss for grasping/hanging

IV. Planning with DVCs

Two necessary preparations to plug DVCs in manipulation planning:

transforming the entire scene images into object-centric ones
rewriting DVCs as functions of a robot joint state and object's rigid transformation

1. Multi-view Processing