Visual-tactile Fusion for Transparent Object Grasping in Complex Backgrounds

Authors: Shoujie Li, Haixin Yu, Wenbo Ding, Houde Liu, Linqi Ye, Chongkun Xia, Xueqian Wang,  Xiao-Ping Zhang

Main Link

Abstract: The accurate detection and grasping of transparent objects are challenging but of significance to robots. In this article, a visual-tactile fusion framework for transparent object grasping in complex backgrounds is proposed, which synergizes the advantages of vision and touch, and greatly improves the grasping efficiency of transparent objects. First, we propose a multi-scene fully synthetic grasping dataset named SimTrans12K together with a Gaussian-Mask annotation method. Next, based on the TaTa gripper, we propose a grasping network named transparent object grasping convolutional neural network (TGCNN) for grasping position detection, which shows good performance in both synthetic and real scenes. Inspired by human grasping, a tactile calibration algorithm and  a visual-tactile fusion method are designed, which improve the grasping success rate by 36.7% compared to direct grasping and the classification accuracy by 39.1%. Furthermore, the tactile height sensing (THS) module and the tactile position exploration (TPE) module are added to solve the problem of grasping transparent objects in irregular and visually undetectable scenes. Experimental results demonstrate the validity of the proposed framework.

Introduction: Inspired by the grasping behavior of humans shown in the following figure, where visual and tactile sensations are collaboratively working towards complicated tasks, a visual-tactile fusion-based framework using the TaTa gripper is proposed in this paper for transparent object grasping in complex backgrounds. Here the tactile sensation is utilized to compensate for the limitation of vision. In addition, the framework can be extended to cover more challenging scenes such as irregular backgrounds or even visually undetectable scenes. 


The contributions of this work are fourfold:

Firstly, a  synthetic transparent object dataset named SimTrans12K is proposed containing different styles of backgrounds, lighting, and camera positions, which has more complex and abundant background information than the previous transparent object datasets, such as ClearGrasp and Dex-Nerf. Besides, to improve the performance of Sim2Real, we propose a Gaussian-Mask method for transparent object grasping position annotation, which can better represent the boundary information of transparent objects than the binary ground truth grasping position.

Secondly, for the TaTa gripper, a generative grasping network named transparent object grasping convolutional neural network (TGCNN) is proposed, which can achieve transparent object grasping position detection in complex backgrounds and lighting with training from the synthetic dataset only. Meanwhile, a tactile information extraction algorithm and a visual-tactile fusion-based transparent object classification algorithm are developed to compensate for the visual deviation.

Thirdly, to realize transparent object grasping in complex backgrounds, we propose a visual-tactile fusion-based transparent object grasping framework with tactile calibration. Besides, we add the tactile height sensing (THS) module and the tactile position exploration (TPE) module to this framework, which can achieve transparent object grasping in stacking, overlapping, or even visually undetectable scenes. Those scenes are extremely difficult and there are only a few studies before.

Finally, to test the effectiveness of the proposed visual-tactile fusion framework, we carefully design several experiments to extensively compare the performance with several state-of-the-art baseline methods, which indicates the proposed method has a considerable performance improvement for transparent object grasping and classification. Moreover, we also test the proposed method in some highly difficult scenes such as stacking, overlapping, undulating, and dynamic underwater environments, which greatly extends the application areas of transparent object grasping.

Hardware

The hardware setup is depicted in the following figure. A RealSense D435i camera is fixed on the top frame as the “eye”, which can acquire 480✖️640 image information, and the gripper is attached to the UR5 robotic arm. Two LEDs are used to provide lighting to the platform.

To realize transparent object grasping, a universal soft gripper named TaTa is used, as shown in the following figure, which has tactile perception on the whole surface. 

Visual-tactile fusion grasping experimental platform. 

TaTa gripper. (A) The structure of TaTa: (a) The schematic diagram of TaTa; (b) The layout of the inside LEDs; (c) The illustration of the inside light path. (B) Perceive a screwdriver. (C) Grasp an egg; (D) Grasp a tomato.

S1.TaTa grasping perfromance test.mp4

                                             S1.TaTa grasping perfromance test

To verify the capability of handling fragile objects, tests on egg and tomato grasping of TaTa are conducted, as shown in the left video, indicating that TaTa can realize the grasping of fragile objects. Meanwhile, we upgrade the problem of the small imaging range of the previous version of the TaTa gripper by using the camera with a larger imaging range and improving the waterproof ability to achieve better detection performance and durability. 


Methodology

To achieve transparent object grasping position detection, grasping, and classification, we propose a transparent object grasping position detection algorithm, a tactile information extraction algorithm, and a visual-tactile fusion classification algorithm, respectively. Besides, a Gaussian-Mask annotation method is designed together with a  synthesized transparent object dataset.

Transparent object visual-tactile fusion grasping framework. (A) Grasping position detection. (B) Tactile information extraction algorithm. (C) Visual-tactile fusion classification.

Dataset Generation and Annotation

We adopt Blender to make a transparent object multi-background grasping dataset containing 12,000 synthetic images and 160 real images. To improve the reliability of dataset annotation, we propose a transparent object grasping position annotation method based on Gaussian distribution and the transparent object mask (Gaussian-Mask).

SimTrans12K Dataset. (A) Scene setup for generating the transparent object synthetic dataset using Blender. (B) Synthetic dataset of transparent objects. (C) The real dataset in different backgrounds. (D) The real dataset in different brightnesses.

Grasping Strategy

Vision and touch are important ways for robots to perceive external information, and the fusion of vision and touch can greatly expand the application scenarios. In this section, we propose a  transparent object-grasping framework based on visual-tactile fusion, which combines the advantages of a large visual detection range and reliable tactile sensation. As shown in the left figure, to further analyze the role of visual-tactile fusion in different scenes, we divide transparent object grasping into three scenes: a plane with complex backgrounds, irregular scenes, and visually undetectable scenes. In addition, to solve the problem of transparent object grasping in irregular scenes and visually undetectable scenes, we adjust the visual and tactile functions on the basis of the planar grasping framework and add the THS module and TPE module.

Experiments

Exp. 1: Synthetic Data Detection

For the comparison of the effect of grasp position detection networks, we compare the currently more mainstream generative grasp networks such as GGCNN, Redmon, and GI-NNet. Since most of these networks are designed for parallel two-finger grippers based on RGBD input, we modified the output structure of the networks to enable them to operate on our proposed dataset. So we change their output from Qmap, width, and angle to position and radius, and change the input data to RGB.

In our experiment, we evaluate TGCNN in the following scenarios: (a) Image-wise evaluation with unseen backgrounds; (b) Object-wise evaluation with unseen objects; (c) Effect of Gaussian representation; (d) Multi-objects evaluation in cluttered scenes, respectively.

Synthetic dataset experiments. (A) Image-wise evaluation with unseen backgrounds. (B) Object-wise evaluation with unseen objects. (C) Multi-object evaluation in the cluttered scene.

The comparison between Gaussian and binary representations.

Exp. 2: Grasping Position Detection in Different Backgrounds

To verify the detection effectiveness of the transparent object grasping network in real scenes, we select 12 different backgrounds, including 6 color, 4 patterned, and 2 scenic ones. The 4,000 synthetic data of two transparent objects are selected as the training set and 110 real data of 6 transparent objects as the test set which contains about 600 labels.

S2.Transparent object grasping position detection experiment I Different backgrounds.mp4

S2.Transparent object grasping position detection experiment I Different backgrounds

Exp. 3: Grasping Position Detection under Different Brightness

Besides the background, the intensity of light is also an important factor affecting the detection of transparent objects. The 4,000 synthetic data of two transparent objects are selected as the training set, and 50 real data of 6 transparent objects at different brightness as the test set. We test the effect of lightness on transparent object grasping detection by changing the brightness, using a Lux meter to measure the light intensity.

S3.Transparent object grasping position detection experiment II Different lights.mp4

S3.Transparent object grasping position detection experiment II Different lights

S4.Transparent object grasping position detection experiment III Different heights and light positions.mp4

S4.Transparent object grasping position detection experiment III Different heights and light positions

Exp. 4: Grasping and Classification on Planes in Complex Backgrounds

To verify the effectiveness of the proposed object grasping and classification framework, grasping and classification experiments of common objects are carried out. The selected objects are shown in the left figure, such as an angled wine glass and a smooth wine glass, a girdled water glass and a normal water glass, and a medicine bottle with a textured bottom and a smooth medicine bottle, which have more similar shapes and are smaller in size with smooth surfaces, making them difficult to either grasp or classify.

S6.Transparent object grasping and classification experiment.mp4

S6.Transparent object grasping and classification experiment

Exp. 5: Transparent Fragment Grasping

Once a transparent object is broken, a large number of fragments will be produced, which are irregular in both shape and size and difficult to grasp. To test the effectiveness of the visual-tactile fusion grasping framework, transparent fragment grasping experiments are performed, suggesting that tactile sensing has an important enhancement to the grasping success rate. 

S7.Transparnet fragment grasping experiment.mp4

S7.Transparnet fragment grasping experiment

Exp. 6: Grasping in Irregular Scenes

Besides transparent object grasping on a plane, it is more challenging to grasp transparent objects in irregular scenes, where the position and height of where the transparent objects are located are difficult to obtain. To solve this problem, we added the THS module based on the previous framework to obtain the height where the object is located by tactile. For the overlapping scene, we designed a case where the objects are completely touching together. In this case, since the texture of transparent objects will be consistent with the background, it is difficult to separate them without a priori and depth information, so we need to assume that there is a certain height difference between the objects in the scene, otherwise, it is difficult to separate the objects touching together.

Transparent object grasping in special scenes. (A) Overlapping scenes. (B) Stacking scenes. (C) Undulating  scenes. (D) Sand scenes. (E) Underwater scenes. (F) High-dynamic underwater scenes.

S8.Transparent object grasping in stacking scenes.mp4

S8.Transparent object grasping in stacking scenes

S9.Transparent object grasping in overlapping scenes.mp4

S9.Transparent object grasping in overlapping scenes

S10.Transparent object grasping in sand scenes.mp4

S10.Transparent object grasping in sand scenes

S11.Transparent object grasping in undulating scenes.mp4

S11.Transparent object grasping in undulating scenes

S12.Transparent object grasping in underwater scenes.mp4

S12.Transparent object grasping in  underwater scenes

Exp. 7: Grasping in Visually Undetectable Scenes

Although the addition of the THS module can solve the problem of transparent object grasping in irregular scenes, it is still difficult to grasp transparent objects in scenes with visual interference, such as high-dynamic underwater environments. Still, the addition of the TPE module can solve this problem. To test the effectiveness of the modules, we conducted high-dynamic underwater transparent object grasping experiments , using tactile to search for transparent objects.

S13.Transparent object grasping in high-dynamic underwater scenes.mp4

S13.Transparent object grasping in high-dynamic underwater scenes

Conclusion

To solve the challenging problem of detecting, classifying, and grasping transparent objects, a visual-tactile fusion framework based on the synthetic dataset is proposed. First, we use the Blender simulation engine to render synthetic datasets instead of manually annotating datasets, solving the problem of difficult annotation of transparent object datasets. Besides, we use Gaussian-Mask instead of the traditional binarized annotation to make the generation of the grasping position more accurate. To achieve grasping position detection for transparent objects, an algorithm named TGCNN is proposed and multiple comparative experiments are conducted, which show that the algorithm can achieve good detection under different backgrounds and lighting conditions even when trained with only synthetic datasets.  Considering the grasping difficulty caused by the smooth surface of transparent objects, we propose a tactile calibration method combined with the soft gripper TaTa to improve the grasping success rate by adjusting the grasping position with tactile information. The method improves the grasping success rate by 36.7% compared to direct grasping. Furthermore, to solve the classification problem of transparent objects in complex scenes, a transparent object classification method based on visual-tactile fusion is proposed, which improves the accuracy by 39.1% compared to the classification with vision alone. Besides, to achieve transparent object grasping in irregular and visually undetectable scenes, we propose the THS and TPE modules, which can compensate for the problem of transparent object grasping in the absence of visual information. Extensive experiments are designed systematically and the results verify the effectiveness of the proposed framework in various complex scenarios, including stacking, overlapping, undulating, sand, underwater scenes, etc. We believe that the proposed framework can also be applied to object detection in low-visibility environments such as smoke and murky underwater, where tactile perception can compensate for the shortcomings of visual detection and improve classification accuracy by using visual-tactile fusion.

Additional Materials

We further compare the effect of depth information for the detection of undulating and underwater scenes, as shown in the following video . Firstly, for undulating scenes, reflections and folds may affect the detection of the depth cameras, and it is difficult to accurately detect the depth information of transparent objects with depth cameras due to their optical properties. Secondly, for underwater scenes, with the depth camera, transparent objects are almost integrated with water, and there will be a lot of reflections on the surface of the water under the illumination of light, which brings great influence to the detection. Therefore, in addition to the lack of images of transparent objects themselves, the interference of the environmental background is a tricky problem to solve, which is one of the reasons why we do not use depth images but use RGB images to achieve transparent grasping position detection in complex backgrounds.

S5.Experiment on detecting transparent objects using depth camera.mp4

S5.Experiment on detecting transparent objects using depth camera

Please see the paper for more details