Synthetic data enables faster annotation and robust segmentation for multi-object grasping in clutter

Dongmyoung Lee, Wei Chen, and Nicolas Rojas

 REDS Lab, Dyson School of Design Engineering, Imperial College London 

ABSTRACT

Object recognition and object pose estimation in robotic grasping continue to be significant challenges, since building a labelled dataset can be time consuming and financially costly in terms of data collection and annotation.

In this work, we propose a synthetic data generation method that minimizes human intervention and makes downstream image segmentation algorithms more robust by combining a generated synthetic dataset with a smaller real-world dataset (hybrid dataset). Annotation experiments show that the proposed synthetic scene generation can diminish labelling time dramatically.

RGB image segmentation is trained with hybrid dataset and combined with depth information to produce pixel-to-point correspondence of individual segmented objects. The object to grasp is then determined by the confidence score of the segmentation algorithm. Pick-and-place experiments demonstrate that segmentation trained on our hybrid dataset (98.9%, 70%) outperforms the real dataset and a publicly available dataset by (6.7%, 18.8%) and (2.8%, 10%) in terms of labelling and grasping success rate, respectively.

CHALLENGES OF MULTI-OBJECT GRASPING 

(1) Difficult to gather raw images with a wide range of settings and circumstances

(2) Obtaining pixel-level labels makes data preparation even more expensive


OVERALL PROCEDURE FOR MULTI-OBJECT GRASPING

The overall procedure for multi-object grasping consists of a synthetic scene generation algorithm to produce table-top synthetic dataset, an instance segmentation algorithm trained with this synthetic data and a few quantity of real-world images, and real-world pick-and-place experiment to demonstrate that segmentation trained on our hybrid dataset outperforms those trained on the real-world dataset and Fruit-360 dataset (publicly available dataset) in terms of labelling and grasping success rate.

OBJECT-WISE IMAGE GENERATION AND SELF-ANNOTATION METHOD

First, object-wise image generation is conducted with the use of WGAN-GP. WGAN-GP algorithm is trained for 5,000 epochs and extract 10,000 images for each fruit category (Training input is Fruit-360 dataset). Then, annotation information of each object is acquired by omitting a range of white values since the background of the generated object-wise image is white. Finally, table-top synthetic scenes are produced by placing these labelled object images randomly into the pre-defined background images.

THE PROPOSED SYNTHETIC DATASET GENERATOR

The advantage of self-annotated synthetic dataset generator can be proved by comparing the elapsed time to label the training dataset between the proposed algorithm and baseline methods. In terms of baseline, human-involved annotation methods are considered: Manual human annotation and Click-based interactive segmentation. The total annotation time of the proposed method is estimated based on the elapsed times of self-annotation and scene production algorithms. 

Click-based interactive segmentation can substantially accelerate the annotation particularly for uncluttered objects than manual annotation method, however, labelling for highly cluttered objects is relatively decelerated since multiple same-type instances are sometimes recognized as a single instance. 

Based on these results, overall dataset preparation time are estimated by the number of 10-fruit scenes, which are used as an actual training dataset of instance segmentation algorithm. This result shows that the proposed synthetic dataset generator can significantly diminish not only the human-annotation time but also the absolute wall-clock dataset preparation time when a large size of training dataset becomes necessary.

INSTANCE SEGMENTATION ALGORITHM TRAINED ON PROPOSED DATASET

Pytorch-based implementation of Mask R-CNN is employed and ResNet-50-FPN backbone is applied to improve the performance of instance segmentation algorithm. To overcome the limitation of network training only with synthetic datasets, the hybrid dataset is proposed consisting of synthetic and real-world scenes.

We found the best networks using hybrid data, both for synthetic and Fruit-360 datasets, in terms of AP and AR for different numbers of real images (e.g. 50, 100, 150, and 200). The optimal network for each number of real images is found via comparing the performance of networks trained on hybrid data across a range of synthetic images, from 50 to 400 (e.g. 50, 100, 150, 200, 300, and 400). 

The performance of network trained with synthetic and real-world data (Gen-hybrid) outperforms not only the real-only network but also the network trained on Fruit-360 and real-world input (CP-hybrid). The performance of Gen-hybrid network is significantly improved when the number of real image is limited (i.e., 50).

TARGET SELECTION AND GRASPING POINT SELECTION METHOD

The most graspable object should be chosen for the highly cluttered scenarios before finding the optimal grasping point. After applying the instance segmentation algorithm, graspable objects are detected based on Mask R-CNN confidence score, representing the likelihood that the prediction of an instance segmentation algorithm is correct.

Then, a geometry-based grasping method is applied to acquire the optimal grasping point. A perpendicular approach of an end-effector is required since other items can collide with the suction gripper during the grasping procedure. Furthermore, a grasping point should be placed adjacent to the centroid of the target object. In this case, 2D x-y centroid is considered since an end-effector approaches to the target instance vertically.

2D centroid of a target object can be predicted by the calculation of the average x-y position in the map of a point cloud. Then, a normal vector of each point within a certain region, which has a radius (R = 10mm) and its center is the average x-y position of object points, is calculated. The point with the closest normal vector to the perpendicular unit vector in the world coordinate system is chosen as the optimal grasping point among the candidates near the 2D centroid.

REAL-WORLD PICK-AND-PLACE DEMONSTRATION

Pick-and-place experiments in cluttered settings are carried out to demonstrate the accuracy of (i) the extracted label of target object (Labelling) and (ii) segmented pixels of the target object (Grasping). The objective of the pick-and-place operation is to identify and pick up the most easily graspable object and correctly place it into the corresponding labeled box for object sorting. UR5 robotic arm and the customized suction gripper are used to perform the pick-and-place operation. To assess the robustness of the proposed methods, different instances of each fruit type are considered. In terms of the experiment setup, highly cluttered scenarios are considered with 12 stacked real-world fruits in a box. The condition of grasp failure is determined by the grasping and holding ability until the robotic arm gets to the target box. On the other hand, the state of labelling success is decided by the extracted class label of the most graspable object. Average labelling and grasping success rate of 98.9 and 70 percents are achieved, respectively.

DETAILED PRESENTATION VIDEO AND EXPERIMENTAL TRIALS