Learning 6-DOF Grasping Interaction with Deep Geometry-aware 3D Representations
Xinchen Yan*Jasmine Hsu1Mohi Khansari2Yunfei Bai2
Arkanath Pathak1Abhinav Gupta1James Davidson1Honglak Lee1
Affiliations1Google, 2X, *University of Michigan (during internship with Google Brain)


Abstract
This paper focuses on the problem of learning 6-DOF grasping with a parallel jaw gripper in simulation. We propose the notion of a geometry-aware representation in grasping based on the assumption that knowledge of 3D geometry is at the heart of interaction. Our key idea is constraining and regularizing grasping interaction learning through 3D geometry prediction. Specifically, we formulate the learning of deep geometry-aware grasping model in two steps: First, we learn to build mental geometry-aware representation by reconstructing the scene (i.e., 3D occupancy grid) from RGBD input via generative 3D shape modeling. Second, we learn to predict grasping outcome with its internal geometry-aware representation. The learned outcome prediction model is used to sequentially propose grasping solutions via analysis-by-synthesis optimization. Our contributions are fourfold: (1) To best of our knowledge, we are presenting for the first time a method to learn a 6-DOF grasping net from RGBD input; (2) We build a grasping dataset from demonstrations in virtual reality with rich sensory and interaction annotations. This dataset includes 101 everyday objects spread across 7 categories, additionally, we propose a data augmentation strategy for effective learning; (3) We demonstrate that the learned geometry-aware representation leads to about 10 percent relative performance improvement over the baseline CNN on grasping objects from our dataset. (4) We further demonstrate that the model generalizes to novel viewpoints and object instances.

Resources
[Paper][ArXiv][VR Grasping Dataset][TensorFlow Code]
* Note: we use pybullet for data collection in VR and simulation. 

Network Architecture
As shown in the figure, we introduce a deep grasping network, composed of a shape generation network and an outcome prediction network. The shape generation network has a 2D convolutional shape encoder and a 3D de-convolutional shape decoder followed by a global projection layer. Our shape encoder network takes RGBD images of resolution 128 × 128 and corresponding 4-by-4 camera view matrices as input; the network outputs identity units as an intermediate representation. Our shape decoder is a 3D de-convolutional neural network that outputs voxels at a resolution of 32 × 32 × 32. We implemented the projection layer (given camera view and projection matrices) that transforms the voxels back into foreground object silhouettes and depth maps at an input resolution (128 × 128). Here, the purpose of generative pre-training is to learn viewpoint invariant units (e.g., object identity units) through object segmentation and depth prediction. The outcome prediction network has a 2D convolutional state encoder and a fully connected outcome predictor with an additional local shape projection layer. Our state encoder takes RGBD input (the pre-grasp scene) of resolution 128 × 128 and corresponding actions (position and orientation of the gripper end-effector) and outputs state units as intermediate representation. Our outcome predictor takes both current state (e.g., the pre-grasp scene and gripper action) and geometry features (e.g., viewpoint-invariant global and local geometry from the local projection layer) into consideration. Note that the local dense-sampling transforms the surface area around the gripper fingers into a foreground silhouette and a depth map at resolution 48 × 48.

Analysis-by-synthesis Grasping Planning
Once we train the deep geometry-aware grasping network, we run analysis-by-synthesis optimization using predictions from network. As shown in the figure, in each row, we selected three representative steps in grasping optimization (in sequential order from left to right). Red box represents a failure grasp while green box represents a successful grasp.

Video Demos
* Data Collection in VR

YouTube Video


* Data Augmentation from Human Demonstrations

YouTube Video


* Visualizations on Shape Reconstruction and Grasping Evaluation

YouTube Video