Yoojin Oh, Tim Schäfer, Benedikt Rüther, Marc Toussaint and Jim Mainprice
While intuitive, using a hand gesture user interface to teleoperate a robot is challenging due to the mismatch between the hand and the robot kinematics. In this work, we propose a complete system to address this problem. The core of the approach relies on using traded control instead of directly controlling the robot: we use hand gestures to specify the goals for a sequential manipulation task. The robot then autonomously generates a grasping or retrieve motion using motion using trajectory optimization. To implement traded control, we need a semantic decomposition of the workspace in terms of what objects are present and where they are located. To this end, our system relies on Mask R-CNN, which is a state-of-the-art identification and segmentation of the objects in the camera images. And we make use of the model-based tracker DBOT, which relies on the complete mesh models of the objects to precisely track the 3D pose of the objects. We additionally propose to identify the user intended goals from hand gestures, by training a multi-layer perceptron classifier. After presenting all the components of the system and their empirical evaluation, we present experimental results comparing our pipeline to a direct traded control approach (i.e., one that does not use prediction) which shows that using intent prediction allows to bring down the overall task execution time.