Using a grasping network provides a strong prior for energy optimization and eliminates the need for demonstrations
We believe that our target-aware grasping pipeline is similar to how humans approach the grasping problem. Intuitively, human grasping has some key attributes:
Consider the task of hanging a mug on a rack. Humans don't use a pose estimator to have an exact quantitative measure of how the initial and target frames are translated and oriented with respect to some arbitrary frame of reference. Instead, our reasoning is more comparative - we know how the how the cup needs to look in its target frame in our imagination and we are also able to perceive how the cup looks in its current state. Instead of using a pose estimator, we use the NDF feature descriptor, and the energy optimization process that happens with the NDF framework is similar to the comparative process that happens seamlessly in our brains.
Humans use this intuition to hold the object by a feature that makes the target configuration feasible. We also ensure that the grasp chosen is kinematically feasible for us to achieve. This aspect is performed in our pipeline by the Volumetric Grasping Network which ranks grasps based on their kinematic feasibility. We then choose the most kinematically feasible grasp that makes functional sense. This is also in alignment with how humans approach grasping.
One of the key issues with a demonstration-dependent approach is that it restricts grasping to specific categories of objects - such as mugs, bowls etc. As humans, we are not limited to categories of objects either. We can pick up a wide variety of unseen objects. While these demonstration-driven grasps may generalize among different types of mugs, the demonstration of a mug cannot be used to pick up shoes, for example. Demonstration is a form of supervision which we would hope to eliminate and make the task unsupervised. By removing the difficulty of recording demonstrations, we make this framework more easily adaptable to variety of tasks and for picking up different types of objects they vary in categories.