The Volumetric Grasping Network was trained using a collection of differently shaped blocks that were spawned in a pybullet scene. The model was trained with synthetically generated data for a packed scene with the following hyper-parameters:
Learning rate: 0.0003
Epochs: 30
Batch size: 32
Shown below are the training and validation loss and accuracy plots respectively: (orange: training, blue: validation)
The following videos demonstrate the working of the volumetric grasping network for some test objects:
Generating grasp trials for training
Running the decluttering experiment with a mug
RViz visualization for a grasp with the TSDF and multiple generated grasps
The TSDF based approach allows for good object generalization within the same object class and between different object classes. The GIF below shows the decluttering experiment working on for unseen bowl instance.
Decluttering experiment with Bowl
Leveraging the series of steps mentioned here, we used generated grasps as a strong prior for optimization that eliminated the need for demonstrations. The following results show the series of steps taken to sequentially eliminate all pick and place demonstrations.
We first started by eliminating all the pick demonstrations, while retaining the place demonstrations. The GIFs below show the actions taken by the robot to achieve the desired target, without having any pick demonstrations for different types of mugs.
Test-cases for different types of mugs without picking demonstrations
It can be seen from the outcomes in the GIFs above that despite not having the picking demonstrations, the robot is still able to leverage the grasp point from the placing demonstration, transform that to a suitable picking point in the source and execute the sequence of motions successfully most of the time. However, since the grasping demonstrations were for a different set of mugs, eliminating the picking demonstrations also meant that the grasping success reduced. This is because the sampling point quality degrades due to the variation in the types of mugs being used in the demonstration to those seen during testing. This further reinforced our belief that having a more optimized grasp location at the target using a grasp detector would improve these results.
As a next, step we proceeded to remove both picking and placing demonstration. However, this time, determine the picking location, we used a predetermined grasping pose at the target configuration rather than using the grasping framework to get the picking location. This grasping pose was transformed to a point at the source using the relative transformation of the mug between the source and target frame as described in step 1 of our idea. Finally, we will integrate the grasping network into the framework, and the resulting pose-aware grasps can be generalized to different objects.
Results are shown for a cup spawned at 10 random configurations below:
Grasping performance for 10 executions
The graph below compares the grasping success of our approach with the base-line NDF implementation.
Overall grasping success comparison plot
The results we obtained with zero demonstrations however do not beat the base-line implementation in the NDF paper. We believe that this could be due to the following reasons:
The grasping point obtained is not optimal when the mugs spawned differ significantly in geometry - with full integration of VGN, we believe that the grasping success can be improved.
Sometimes, chosen grasps are diametrically opposite the handle of the mug. When the sampling of points is restricted around a symmetric region, there is greater ambiguity and error in the transformation obtained.
On similar grounds, when sampling about the grasp point does not result in sufficient inliers within the shape geometry, there is an increased likelihood of the grasp failing.
Our pipeline is able to generalize to new objects as well. To incorporate variability in the experiments, we have spawned a set of different bowls at arbitrary initial configuration, included a new static object in the environment and changed the pose of the target. The results for three trials are shown in the video below.
Grasping performance for 3 executions of a new object and target configuration