With the increasing interest to smart factory, a lot of factories are adopting collaborative robots to maximize their productivity. Most collaborative robots are designed to detect collision by measuring disturbance force in their joint. In case of human workers, however, it is annoying even when the robot just touches their skin. For this reason, human workers always have to pay attention to robot, which may cause productivity loss.
To maximize the performance of human-robot collaboration, it is desirable to install additional sensors to detect the possibility of collision before they really get into trouble. In this research, we propose a two-camera-based human and robot 3D pose estimation by measuring their joint positions in 3D coordinate. With this method, we can measure the distance of each object's joint and predict whether they are in danger of collision.
Human-Robot Collision Accident
In this section, we suggest a noble algorithm which estimates a robot's joint angle from a single image. We call it as Joint-PnP since this method is originated from PnP(Perspective-n-Points) mechanism. While PnP estimates 6D pose of a rigid body from an image, the Joint-PnP estimates an articulated robot's joint angles.
There are some assumptions for the Joint-PnP described as below -
(1) the robot's base position is fixed on a known 3D coordinate.
(2) the robot's kinematics information is known.
(3) we have one or more calibrated camera with both intrinsic and extrinsic parameters known.
The process of Joint-PnP is explained as below.
A deep neural network is trained to estimate 2D keypoints of the robot joint from an image. The estimated keypoints are used for target keypoints of Joint-PnP algorithm.
With the initial (or updated) joint angles, we can find 3D positions of all joints.
With the camera model, the 3D positions are projected into image plane to get 2D joint keypoints.
Check difference between the target keypoints and the estimated keypoints to calculate error.
If the error is not satisfied, update Jacobian matrix and find angular difference along with Jacobian and keypoint error to update new joint angle.
Iterate until the error is below the threshold.
Joint-PnP Mechanism
An Example of Joint-PnP Iterations to Find Joint Angles of a Robot from 2D Keypoints
To estimate the 3D pose of the robot in sequential images, we used a digital twin based closed-loop feedback network shown as below. The architecture is consist of two modules, (1) Deep Neural Network Module, (2) Digital Twin Module. In the deep neural network module, a beliefmap which describes a single position of joint keypoint on each channel is estimated. In the digital twin module, Joint-PnP converts 2D keypoints into joint angles followed by Kalman filter to predict next step state. The PyBullet-based digital twin robot is controlled along with the given angle prediction. After 3D to 2D joint keypoint projection, we can estimate a new beliefmap for next sequence. The new beliefmap is fed back into next step image as Image-Beliefmap stack.
Digital Twin based Closed-Loop Feedback Network for Sequential Robot Pose Estimation
The graph on the below shows robot joint angle estimation result in sequential images. As you can see, the robot joint angles are well estimated with very few perturbations. The last joint angle is not estimated since there is no more link in the end-effector to detect rotation. The Root Mean Square Error(RMSE) of five joint angles is resulted to 3.8 degree.
Robot 3D Pose Estimation Result from Two-Image Pair
Robot Joint Angle Estimation Result in Sequential Images
In case of human 3D pose, we cannot directly apply Joint-PnP since human has more that one degree of freedom on some joints. Instead, we can use stereo camera based triangulation method to convert 2D keypoint pair into 3D point. The triangulation method is based on Direct Linear Transform (DLT) which begins with the fact that a cross product of the same vectors produces zero vector. The solution of the vector X (3D point) in the below figure can be found by calculating Singular Value Decomposition (SVD) of matrix A (SVD(A)=UDV'), and select the last column vector of V matrix as the solution.
Direct Linear Transform based Triangulation
Points on Image Planes
The overall procedure of 3D human pose estimation is shown as below flow chart.
(1) Estimate 2D human keypoint from Mask R-CNN based Detectron2 architecture.
(2) Convert the 2D keypoint pairs from two images into 3D points by using Stereo Triangulation.
To verify the accuracy of 3D human position, we made additional experiment to compare the position difference between human's left wrist and robot's end-effector. For this test, a person gently put his wrist on the robot end-effector, then just follow the movement of the robot. The position comparison result shows about 57 mm difference. But considering the offset between the wrist and end-effector (about 30 mm), the actual error of 3D human joint position is expected to around 30 mm. This result shows that it is enough to use stereo triangulation method for human-robot collision prediction.
Wrist and End-Effector 3D Position Accuracy Test
Wrist and End-Effector 3D Position Accuracy Result
Since we have estimated both Robot and Human 3D position from the same camera images, their coordinate is also synchronized as the same. Therefore, it is very easy to integrate those two object's 3D joint position in the same plot. The figure below shows how to estimate both Robot and Human 3D positions in the same coordinate. In this study, each pose estimation algorithm is processed on separated PC.
Human-Robot 3D Pose Estimation Procedure
With the estimated 3D pose of both robot and human, we can calculate Euclidean distance of each object's joint position. In this case, we defined 30 cm as threshold distance for safety warning. The experiment result below shows that the person's right hand is too close to the robot's 3rd and 4th joint by indicating the neighboring link with red color.
Human-Robot Collision Detection Experiment
In this research, we proposed two-camera-based robot and human 3D pose estimation method. With the deep neural network followed by the post processing, we could find each object's joint position on 3D coordinate. In case of robot 3D pose estimation, we used an autoencoder based deep neural network and a noble Joint-PnP algorithm, while in case of human, we used Mask R-CNN based deep neural network and stereo triangulation method. After estimating each object's 3D pose, we could easily combine them into the same coordinate to calculate the distances.
The processing time of each method is shown below. Robot and Human pose estimation took 6.5 FPS and 13.6 FPS, respectively. It is expected to meet the target estimation speed (10 FPS) by reducing post processing time of robot pose estimation and using light-weighted deep neural network.
Processing Time of Robot and Human 3D Pose Estimation Algorithm