Self-Supervised Interactive Object Segmentation Through a Singulation-and-Grasping Approach [arXiv, Appendix (with code)]
Our paper has been accepted to ECCV 2022
Instance segmentation with unseen objects is a challenging problem in unstructured environments. To solve this problem, we propose a robot learning approach to actively interact with novel objects and collect each object's training label for further fine-tuning to improve the segmentation model performance, while avoiding the time-consuming process of manually labeling a dataset. Given a cluttered pile of objects, our approach chooses pushing and grasping motions to break the clutter and conducts object-agnostic grasping for which the Singulation-and-Grasping (SaG) policy takes as input the visual observations and imperfect segmentation. We decompose the problem into three subtasks: (1) the object singulation subtask aims to separate the objects from each other, which creates more space that alleviates the difficulty of (2) the collision-free grasping subtask; (3) the mask generation subtask obtains the self-labeled ground truth masks by using an optical flow-based binary classifier and motion cue post-processing for transfer learning. Our system achieves 70% singulation success rate in simulated cluttered scenes. The interactive segmentation of our system achieves 87.8%, 73.9%, and 69.3% average precision for toy blocks, YCB objects in simulation, and real-world novel objects, respectively, which outperforms the compared baselines.
Vrep simulation singulation demo
Real robot objects singulation demo
Fig 1. Overview. The robot agent learns a singulation-and-grasping policy via deep Q-learning in simulation. We collect the RGB images before and after applying the actions and use coherent motion to create pseudo ground truth masks for the segmentation transfer learning.
Fig. 2. The interactive data collection pipeline. The deep Q-network takes the state representation s_t as inputs, which consists of the orthographically projected RGBD images and image segmentation masks. The initially cluttered objects are singulated and grasped via the well trained SaG policy. Both the interaction scenes and the task-relevant features are recorded, focusing on single object moving scenarios to provide accurate object annotations.