Our implementation extends from the following main prior works:
Neural Descriptor Fields (NDFs)
Volumetric Grasping Network (VGN)
The following sections describe some background of both these prior works. In addition, they describe the modifications made to be adopted into our pipeline.
Neural descriptor field is attractive since it guarantees performance across arbitrary initial and target poses, which is are key elements in our goal for functional generalization. Neural descriptor field description achieves translational equivariance by subtracting the center of mass of the point cloud from both the input point cloud and the input coordinate. Rotational equivariance is attained by the use of the architecture proposed by prior work on Vector Neurons. They introduce the Neural Descriptor Field as a non-linear function, f(x|P), where P is the pointcloud of the object and x is some 3D coordinate in space. This neural network encodes spatial relationships between x and geometric features in P. This network trained using 3D reconstruction, so that they can learn correspondences between some 3D point in space and geometric features of that object. Since the network learns reconstruction, it can indirectly infer that spatial distance of x from other geometric features of the object. The training objective learns descriptors that encode point-wise correspondences across a category of shapes using a 3D coordinate X and the point cloud of the object. The point cloud enables it to learn category-specific features. These are called point descriptors. Category-level 3D reconstruction objective trains Φ(x, E(P)) to be a hierarchical, coarse-to-fine feature extractor. In essence, every ReLU output gives a decision boundary that can be viewed geometrically as a ReLU hyperplane - a fancy way of saying how far is this point away from the geometric feature given out by that layer (could be rim, handle, base etc. of a cup). The further down the network you go, the finer the detail being encoded. These inferences are concatenated and the final output of this 3D reconstruction occupancy network tells you if a given point in space is occupied by the object or not. Along the way, you also end up learning about spatial relationships between a given point and the parts of the object. These descriptors have SE(3) equivariance. This basically means that the feature descriptors will remain the same regardless of what SE(3) rigid body transformation you perform on it. This formulation of finding the spatial relationship between a point and local geometric features is what is referred to as point descriptor fields. However, we are not just interested in finding correspondences between a point and some geometric features, but a set of points around some area of interest. The idea of point descriptor fields extended to many points forms the basis of pose descriptor fields. When a set of points are sampled from the same region on an object, they induce similar energy fields. This is key to finding out how similar or different two instances of the same object are. As the pose of that object changes, the energy field induced also changes. This is as shown in the image below. We use this important feature to solve for the pose of the end-effector in the source frame given a feasible grasp in the target frame.
Energy field visualized around an object to demonstrate the SE(3) equivariance property of NDFs
The goal of any grasping framework is to determine a feasible grasp which would allow to manipulate an object. The volumetric grasping network predicts 6 DOF grasps from 3D scene information. VGN accepts a Truncated Signed Distance Function (TSDF) representation of the scene and directly outputs the predicted grasp quality and the associated gripper orientation and opening width for each voxel in the queried 3D volume. The method uses a Fully Convolutional Network (FCN) to map the input TSDF to a volume of the same spatial resolution, where each cell contains the predicted quality, orientation, and width of a grasp executed at the center of the voxel. The network is trained on a synthetic dataset of cluttered grasp trials generated using physical simulation. The inclusion of 3D information of the full scene allows the neural network to capture collisions between the gripper and its environment. To adopt VGN into our framework, we carried out the following steps:
VGN was cloned and a conda environment was set up with the necessary dependencies.
We generated raw synthetic grasping trials using the pybullet simulator.
We chose the packed scene as we expect to pick objects that are upright as opposed to cluttered in a pile.
We generated 5000 synthetic grasp trials on blocks that were spawned in pybullet.
With the generated data, we generated the grasp targets or voxel grids required to train VGN.
Finally, we trained the network and simulated the clutter removal experiment with mugs.
Working of the Volumetric Grasping Network (adopted from the paper)