Pose aware grasping is a multi-disciplinary problem combining efforts in grasping, planning, object descriptors, and pose estimation. The following section summarizes the most relevant prior works in related areas. [1] formalizes a long-horizon planning problem where a tool needs to be grasped at different locations based on the end goal. They argue that the classical definition of affordance to successfully grasp an object does not guarantee task success in the long run. However, it does not consider the task of object placement after grasping. [2] discusses the pose-aware placement problem using two manipulators to place objects on supermarket shelves such that the brand label is visible. However, it is limited by the fact that it depends on a set of objects having brand names that have been seen by the network previously. It is also constrained to work only with objects that have brand names that can be fitted into a rectangle. [3] is an end-to-end pose-aware object rearrangement network featuring a graph neural network for computing which objects to move and estimating where they can be placed to complete a rearrangement task. However, it simplifies the grasping problem to top-down grasps only and cannot handle complex configurations demanding 6DoF grasp planning.
One of the key elements for pose-aware grasping is to have a pose estimator or descriptor that can keep track of the object's similarity between the initial and target poses. [14, 15] are early examples of template-matching with coarse 3D primitives that generalize across changes in shape and pose, but do not work well when objects deviate significantly from the primitive. It also suffers if the test and reference environment used are too different. [12,13,14] eliminate the need for a pose estimator, and instead show that you can use key points to track objects between their initial and target poses. However, key points need to be hand-picked carefully and often requires several images from different orientation for them to be rotation invariant since these works are based on 2D convolutional neural networks. Addressing these challenges is Neural Descriptor Fields [16], which is the only prior work to our knowledge that provides an SE(3) equivariant object representation. This allows us to represent the initial and target poses with an equivariant representation and leverage a grasping network to find out how to optimally sample grasps on the target scenes that can be transferred to the initial scene.
Lately, there has been a lot of work in object-specific grasp synthesis. [4] demonstrates an end-to-end learning framework with dorsal and ventral models that simultaneously learn object detection and grasping from RGB images, thus developing an association between the two tasks. [5] extends this to 3D, by simultaneously learning multi-view object detection and 6DOF grasp synthesis. [6] uses the ContactDB dataset [7] to learn visual affordance maps and manipulate objects in the regions where humans would manipulate them. However, even though object semantics are important prior knowledge for grasping, these works do not consider the target object placement problem. [17] uses an object-agnostic grasping framework to map from visual observations to actions: inferring dense pixel-wise probability maps of the affordances for four different grasping primitive actions. It then executes the action with the highest affordance and recognizes picked objects with a cross-domain image classification framework that matches observed images to product images from the web. [18] uses the full 3D scene information to directly learn collision-free grasp proposals and enables 6 DOF grasp synthesis in real-time.
Therefore, instead of decoupling these problems and solving them individually, our goal is to combine target object pose and grasp synthesis together into an end-to-end network.
[1] Xu, Danfei, et al. "Deep affordance foresight: Planning through what can be done in the future." 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021.
[2] Su, Yung-Shan, et al. "Pose-Aware Placement of Objects with Semantic Labels-Brandname-based Affordance Prediction and Cooperative Dual-Arm Active Manipulation." 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2019.
[3] Qureshi, Ahmed H., et al. "Nerp: Neural rearrangement planning for unknown objects." arXiv preprint arXiv:2106.01352 (2021).
[4] Eric Jang, Sudheendra Vijayanarasimhan, et al. "End-to-end learning of semantic grasping" arXiv preprint arXiv:1707.01932 (2017).
[5] Hamidreza Kasaei, et al. "Simultaneous Multi-View Object Recognition and Grasping in Open-Ended Domains" arXiv preprint arXiv:2106.01866v2 (2021).
[6] Priyanka Mandikal, Kristen Grauman, et al. "Learning Dexterous Grasping with Object-Centric Visual Affordances" arXiv preprint arXiv:2009.01439 (ICRA 2021).
[7] Samarth Brahmbhatt, et al. "ContactDB: Analyzing and Predicting Grasp Contact via Thermal Imaging" arXiv preprint arXiv:1904.06830v1 (CVPR 2019).
[8] Jeffrey Mahler, et al. "Dex-Net 2.0: Deep Learning to Plan Robust Grasps with Synthetic Point Clouds and Analytic Grasp Metrics." arXiv preprint arXiv:1703.09312 (2017).
[9] H. -S. Fang, C. Wang, M. Gou and C. Lu, "GraspNet-1Billion: A Large-Scale Benchmark for General Object Grasping," 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 11441-11450, doi: 10.1109/CVPR42600.2020.01146.
[10] Chu, F., et al. “Real-World Multiobject, Multigrasp Detection.” IEEE Robotics and Automation Letters, vol. 3, 2018, pp. 3355–62, https://doi.org/10.1109/LRA.2018.2852777.
[11] Wei Gao and Russ Tedrake. “kPAM 2.0: Feedback Control for Category-Level Robotic Manipulation”. In: IEEE Robotics and Automation Letters 6.2 (2021), pp. 2962–2969
[12] Wei Gao and Russ Tedrake. “kPAM-SC: Generalizable Manipulation Planning using KeyPoint Affordance and Shape Completion”. In: arXiv preprint arXiv:1909.06980 (2019)
[13] Lucas Manuelli et al. “kpam: Keypoint affordances for category-level robotic manipulation”. In: arXiv preprint arXiv:1903.06684 (2019).
[14] Kensuke Harada et al. “Probabilistic approach for object bin picking approximated by cylinders”. In: 2013 IEEE International Conference on Robotics and Automation. IEEE. 2013, pp. 3742–3747
[15] Andrew T Miller et al. “Automatic grasp planning using shape primitives”. In: 2003 IEEE International Conference on Robotics and Automation (Cat. No. 03CH37422). Vol. 2. IEEE. 2003, pp. 1824–1829.
[16] Simeonov, Anthony, et al. "Neural Descriptor Fields: SE (3)-Equivariant Object Representations for Manipulation." arXiv preprint arXiv:2112.05124 (2021).
[17] Andy Zeng et al. "Robotic Pick-and-Place of Novel Objects in Clutter with Multi-Affordance Grasping and Cross-Domain Image Matching". In: arXiv preprint arXiv:1710.01330
[18] Michel Breyer et al. "Volumetric Grasping Network: Real-time 6 DOF Grasp Detection in Clutter". In: arXiv preprint arXiv:2101.01132