Learning Continuous 3D Reconstructions for Geometrically Aware Grasping

Mark Van der Merwe, Qingkai Lu, Balakumar Sundaralingam, Martin Matak, and Tucker Hermans

University of Utah

Abstract

Deep learning has enabled remarkable improvements in grasp synthesis for previously unseen objects from partial object views. However, existing approaches lack the ability to explicitly reason about the full 3D geometry of the object when selecting a grasp, relying on indirect geometric reasoning derived when learning grasp success networks. This abandons explicit geometric reasoning, such as avoiding undesired robot object collisions. We propose to utilize a novel, learned 3D reconstruction to enable geometric awareness in a grasping system. We leverage the structure of the reconstruction network to learn a grasp success classifier which serves as the objective function for a continuous grasp optimization. We additionally explicitly constrain the optimization to avoid undesired contact, directly using the reconstruction. We examine the role of geometry in grasping both in the training of grasp metrics and through 96 robot grasping trials.

[PDF]

Source Code:

PointSDF: [code and data]

Supplementary Information

Additional Grasping Results

Our main grasping results and analysis are covered in the paper. Here, we provide additional information on the failure cases of both our approach ("Reconstruction-Grasping") and the baseline ("Partial-View-Grasping"). A total of 96 grasps were attempted, 48 per approach. Each approach was tested on two camera settings, "High" and "Low", where the camera was moved vertically to test in heavy occlusion. We separate out the failure cases for each approach on each scenario to better understand performance.

As mentioned in our paper, we observe that our approach failed to meet a sufficiently good grasp (as defined in the paper) 11 times, as compared to only 4 such failures for the partial view approach. This is perhaps unsurprising, as the reconstruction induces significant constraints on the solution, while the partial view allows solutions in contact with the object (see Fig. 8 in paper). This indicates that while our reconstruction grasping formulation is the desired optimization, the problem is significantly more difficult to solve. Improving the optimization performance, perhaps by combining local updates with a higher number of samples, could increase the efficacy of our proposed approach.

It is difficult to disentangle the effect of the motion planner from the effect of the grasp planner; the partial view approach saw 2 more failures caused by hitting the object than the reconstruction approach, though it is not clear that this is due to the lack of geometric data. In fact, as mentioned in the paper, only one grasp failed due to clearly planning in contact with the object. This being said, grasps in contact with the object will clearly not act as expected; further analysis is required to better understand when contact should and should not be permitted. This is best demonstrated by the following qualitative examples (re-created from the paper) which shows how the reconstruction avoids planning errors caused by only viewing partial object information.

Qualitative grasps planned by the "Partial-View-Grasping" approach on the left, and our "Reconstruction-Grasping" approach on the right. Our approach avoids planning errors that place the final grasp in collision with the occluded region of the target object.

Data Collection and Training Details

Here we record additional information on the data processing and training of our PointSDF and grasp success networks. Please read the paper for relevant details.

PointSDF:

Our reconstruction algorithm, PointSDF, is trained using synthetically rendered point clouds. Specifically, we use 590 meshes from the Grasp Database and 76 meshes from YCB. Each mesh is placed 1.5m from a simulated camera and rotated randomly to 200 different orientations. We render the mesh using pyrender to get a point cloud at each orientation, randomly perturbing each depth measurement to get a noisy point cloud. We use trimesh to sample ground truth Signed Distance Function (SDF) values from the mesh, yielding point cloud, SDF pairs. This gives us a training dataset of 133,200 SDF reconstruction examples. We split the Grasp Database examples into a set of 84,960 training examples, 21,240 validation examples, and 11,800 test examples. All the YCB examples are added to the test set.

Example of Point Cloud, SDF sample pairs used to train PointSDF.

We implement and train PointSDF in Tensorflow, and use the Adam optimizer, with initial learning rate of 1e-3. We train until the validation error does not improve for 10 epochs, for a total of 72 epochs. The training took about 34 hours on a single GeForce GTX 1060.

Grasp Success Prediction:

We collected simulated grasps data using the Allegro hand mounted on the Kuka LBR4 arm inside the Gazebo simulator with the DART physics engine (https://dartsim.github.io/). We generate pointclouds using pyrender.

We use the same objects (from the Bigbird data-set) and the same grasp data collection system used in Lu et. al (2017) to collect both multi-fingered side and overhead grasps for our training data-set. We generate a preshape by randomly sampling joint angles for the first two joints of all fingers within a reasonable range, fixing the last two joints of each finger to be zero. There are 14 parameters for the Allegro hand preshape, 6 for the palm pose and 8 relating to the first 2 joint angles of each finger proximal to the palm. Given a desired pose and preshape we use the RRT-connect motion planner in MoveIt! to plan a path for the arm. We execute all feasible plans moving the robot to the sampled preshape. After moving the hand to the desired preshape, a grasp controller is applied to close the hand. The grasp controller closes the fingers at a constant velocity stopping each finger independently when contact is detected by the measured joints velocities being close to zero. The grasp controller closes the second and third joints of the non-thumb fingers and the two distal joints of the thumb. Note the proximal joint of all non-thumb fingers rotates the finger about its major axis causing it to change the direction of closing. As such we maintain the angle provided by the grasp planner for these joints.

If the robot grasps and lifts the object to a height of 15cm without the object falling, the simulator automatically labels the grasp as successful. In total, we train with 7290 grasp examples, and test on a leave-out set of 1821 grasps.

We implement and train the variations of our network (see paper for details) in Tensorflow, and use the Adam optimizer with initial learning rate of 1e-3. We train each approach for 100 epochs, which took 4.5 hours for "PointSDF-Scratch" and 3 hours for "PointSDF-Fixed" on a single GeForce GTX 1060.

Errata

In Fig. 6, the qualitative grasp examples contain grasps from both approaches ("Reconstruction-Grasping" and "Partial-View-Grasping"). Generally speaking, the grasps from both methods looked qualitatively similar. For examples of grasps performed specifically by "Reconstruction-Grasping" please see our accompanying video above.