A4T: Hierarchical Affordance Detection for Transparent Objects Depth Reconstruction and Manipulation 

★Best Student Paper Award Finalist, CASE2022★

Abstract

Transparent objects are widely used in our daily lives and therefore robots need to be able to handle them. However, transparent objects suffer from light reflection and refraction, which makes it challenging to obtain the accurate depth maps required to perform handling tasks. In this paper, we propose a novel affordance-based framework for depth reconstruction and manipulation of transparent objects, named A4T. A hierarchical AffordanceNet is first used to detect the objects and their associated affordances that encode the relative positions of an object’s different parts. Then, given the predicted affordance map, a multi-step depth reconstruction method is used to progressively reconstruct the depth map of transparent objects. Finally, the reconstructed depth map is employed for the affordance-based manipulation of transparent objects. To evaluate our proposed method, we construct a largest real-world affordance dataset that includes transparent objects and provides accurate depth maps for transparent objects. Extensive experiments show that our proposed methods can predict accurate affordance maps, and significantly improve the depth reconstruction of transparent objects compared to the state-of-the-art method, with the Root Mean Squared Error in meter significantly decreased from 0.097 to 0.042. Furthermore, we demonstrate the effectiveness of our proposed method with a series of robotic manipulation experiments on transparent objects. 

Overview

Our goal is to obtain accurate depth maps of transparent objects by leveraging affordance detection, so as to facilitate the manipulation of them, i.e., stacking two plastic cups in the above example. Top left: A plastic cup is in the robot's gripper and another is placed on the table, viewed by an RGB-D camera on its side. Bottom left: A depth map of the cup on the table is obtained from the camera, and its affordance map is predicted, in which the red and blue colours represent the affordance regions with deep cavities to hold liquid ("contain") and that can be held ("wrap-grasp"), respectively. Bottom right: The affordance map is used to improve the depth map and predict the center of the gripper for stacking. Top right: With the improved depth maps and the predicted gripping center, the robot can stack the plastic cups successfully.

Methodology 

(a)From Left to Right.Given an RGB-D image of a scene with transparent objects, A4T uses three networks to infer 1) affordance maps of transparent objects, 2) occlusion boundaries and contact edges, and 3) surface normals. Then based on the affordance map, the transparent object is progressively reconstructed with a global optimisation method and a plane fitting method. The colourful dash rectangles in the top right corner represent the input information for each reconstruction step.  

(b): From Left to Right. A deep Convolutional Neural Network (CNN) backbone is used to extract the features of RGB images. The Region Proposal Network (RPN) shares weights with the CNN backbone and outputs RoIs. For each RoI, two RoIAlign layers extract and pool its features to feature maps of a fixed size. Two fully connected layers are used for object classification, object location regression and affordance classification. Three convolutional-deconvolutional layers are used to obtain affordance maps that are fused with the affordance classification scores. Finally, a softmax layer is applied to output a multi-class affordance mask.