Ruinian Xu, Fu-Jen Chu, Chao Tang, Weiyu Liu and Patricio A. Vela
Georgia Institute of Technology, GA, U.S.A.
Abstract: This letter investigates the addition of keypoint detections to a deep network affordance segmentation pipeline. The intent is to better interpret the functionality of object parts from a manipulation perspective. While affordance segmentation does provide label information about the potential use of object parts, it lacks predictions on the physical geometry that would support such use. The keypoints remedy the situation by providing structured predictions regarding position, direction, and extent. To support joint training of affordances and keypoints, a new dataset is created based on the UMD dataset. Called the UMD+GT affordance dataset, it emphasizes household objects and affordances. The dataset has a uniform representation for five keypoints that encodes information about where and how to manipulate the associated affordance. Visual processing benchmarking shows that the trained network, called AffKp, achieves the state-of-the-art performance on affordance segmentation and satisfactory result on keypoint detection. Manipulation experiments show more stable detection of the operating position for AffKp versus segmentation-only methods and the ability to infer object part pose and operating direction for task execution.
RA-L with ICRA2021: paper link
Code: available on github
Affordance keypoint for robotic manipulation
Figure: Affordance segmentation and keypoint detection used for a manipulation task. (a) Task: use the tool with the pound affordance to hammer the nail. (b) Affordance mask and keypoint outputs provide operating positions and directions for the hammer. (c) The successfully executed task. The 2D directions from (b) are converted to 3D axes annoated in (a), with the red axis being the principle axis.
Supplementary video
Affordance keypoint representation
Figure: Affordance representation consisting of a label map and numbered keypoints. From (a) to (d), the annotated affordances are wrap-grasp, pound, contain, grasp, scoop and cut. The dots indicate keypoints 1-4 and the x'es indicates keypoint 5. A single arrow means that the direction matters. A double arrow means that it does not and keypoints flips are permitted.
Table I: Segmentation benchmark on UMD+GT dataset
Table II: Keypoint detection benchmark on UMD+GT dataset