November 2018
We have developed a perception system for a robotic manipulator by implementing Mask R-CNN (Mask R-CNN have been made available open-source by Facebook AI Research and is available here). Mask R-CNN is a state of the art machine learning algorithm for object recognition and instance segmentation, meaning it can identify pixel by pixel location of any object. By training Mask R-CNN to recognize objects of interest (i.e. the task object to be manipulated and the robotic gripper), we present a system that detects the pose of the task object such that the robot knows where and how to execute the grasp, in addition to whether or not the object has been grasped by the robot.
A total of 80 instances including 68 instances for training and 12 instances for validation were collected (Note that testing was done on real time image captures). Images were manually collected and annotated using VGG Image Annotator. The small number of training instances was made possible due to transfer learning, meaning instead of training the model from scratch, we train the model starting from a pretrained model on COCO dataset. Although the COCO dataset does not consist of the class gripper, grasp, cube, or cylinder, it contains other classes implying that the weights have already learned common features seen in various objects.
Mask R-CNN was trained over 21 epochs, set at 100 training steps per epoch. The training was conducted on CPU resulting in an average training time of 31.8 minutes per epoch. The training performance is visualized and tabulated in Fig. 6 and Table 1. It can be seen that the training plateaued quickly after 10 epochs. The minimum validation loss can be seen in epoch 17 with a loss of 0.2740 which was utilized as our model weight. Note again that training was initialized on pre-trained weights, which made training on a small number of training instances and fast convergence possible.
Our software implementing automated robotic pick-and-place manipulation with grasp detection on the hardware platform is organized as ROS packages written in python. The high level structure of the software is shown in the figure below. The /mask_rcnn node publishes the post-processed data of the object classification and segmentation result to the /manipulation node. The /manipulation node then computes appropriate motions for the robot arm based on visual sensor feedback and sends a command to the /move_group node. The /move_group node ultimately devises a trajectory for the UR10 robot arm driver to execute. A simple proportional controller was implemented in the software package.