As described in the motivation of our work, our objective is to design a pose-aware grasping pipeline. A key ingredient to achieving this builds on the use of key point aware object descriptors. However, instead of using hand-picked key-points, we realized the importance of having a feature descriptor that is both translation and rotation invariant. On these lines, we make use of Neural Descriptor Fields, an SE(3) equivariant object descriptor. At its core, NDFs use an energy minimization function that encodes a continuous and differentiable energy field around an object. This energy field inherits the SE(3) equivariance property. This important property forms the workhorse of our pipeline. To learn more about the working of Neural Descriptor Fields, visit the Relation to Prior Work section.
Energy field around an object changes based on the point sampling, but remains constant regardless of the pose
The step-by-step process in our pipeline is as follows:
STEP 1: Start by spawning the object in its target configuration. Assuming that the object we want to manipulate has to be placed that such that it interacts with stationary artifacts in the environment, we start by sampling points on the object of interest and the stationary artifact. In this case, we consider the scenario of the cup having to be hung on the rack. As a result, we will sample points from the rack and cup as shown by red points. A set of points being sampled encodes a pose and thus will have an associated energy field when represented as a Neural Descriptor Field. Typically, the points will be sampled uniformly about the volumetric bounding box of the rack and mug.
STEP 2: Now, we spawn the object in its source frame. Using the NDF obtained in step 1, it is possible to perform energy optimization to minimize the energy and thus find out the relative pose between the mug in its source and the mug in its target pose. However, we still have no notion of where to grasp the cup such that the mug in its source configuration can be transferred to the mug in its target configuration. In the original NDF implementation, this information was coming from querying the demonstrations. This is also what makes this implementation category specific, i.e., a demonstration for a mug can only be used across mugs and not for any other category. We want to avoid this limitation by eliminating demonstrations altogether.
STEP 3: In order to replace the information that you would get from a demonstration about where to grasp an object, we instead employ a grasping framework to give us this information. We use the Volumetric Grasping Network (VGN) to give us a 6 DoF pose of the end-effector given the target configuration. The grasps obtained are scored based on their kinematic feasibility (shown in blue). We will further supplement this by sampling a set of functional grasps (shown in red), and choose one final grasp that is ideally both functional and kinematically feasible. The image shows this pruning being performed to choose one grasp that can be chosen and executed.
STEP 4: Now, given this target pose of the end-effector, we then sample points on the object around the center point of the grasp (i.e., between the fingers of the end-effector). Our objective is to now find out where these set of point exist in the source frame of the object. This is where we would have to grasp the object in the source frame.
Step 5: To find out the grasping pose in the source frame, we can leverage the information obtained in step 2, which gives the transformation between the mug in its source frame and target frame. By applying the same transform to the end-effector pose obtained in the target, we get the transform of the end-effector in its source frame. We then employ off-the-shelf inverse kinematics and planning modules to execute this pick and place motion.