Tutian Tang*, Minghao Liu*, Wenqiang Xu, Cewu Lu
Our method conducts hand-eye calibration in both eye-in-hand and eye-on-base settings with only two basic prerequisites, the robot's kinematic chain and a predefined reference point on the robot. No need to prepare the calibration board or train any networks.
Hand-eye calibration aims to estimate the transformation between a camera and a robot. Traditional methods rely on fiducial markers, which require considerable manual effort and precise setup. Recent advances in deep learning have introduced markerless techniques but come with more prerequisites, such as retraining networks for each robot, and accessing accurate mesh models for data generation. In this paper, we propose Kalib, an automatic and easy-to-setup hand-eye calibration method that leverages the generalizability of visual foundation models to overcome these challenges. It features only two basic prerequisites, the robot's kinematic chain and a predefined reference point on the robot. During calibration, the reference point is tracked in the camera space. Its corresponding 3D coordinates in the robot coordinate can be inferred by forward kinematics. Then, a PnP solver directly estimates the transformation between the camera and the robot without training new networks or accessing mesh models. Evaluations in simulated and real-world benchmarks show that Kalib achieves good accuracy with a lower manual workload compared with recent baseline methods. We also demonstrate its application in multiple real-world settings with various robot arms and grippers. Kalib's user-friendly design and minimal setup requirements make it a possible solution for continuous operation in unstructured environments.
The whole pipeline starts by defining a reference point on the kinematic chain. The reference point tracking module can track its 2D position in the image frame while its 3D coordinates in the robot frame can be derived by forward kinematics Here the synchronization frame mechanism is introduced to balance precision and efficiency. Finally, the PnP module estimates the camera-to-robot transformation matrix, either for the eye-in-hand setting or for the eye-on-base setting ).
For each type, we select one representative method for comparison.
Markerless: We mark ticks if no fiducial marker is required.
Setting: Whether the method can work under eye-in-hand or eye-on-base setting, or both.
Prerequisites: Kinematics means the kinematic model is required, which is the basic requirement for calibration. Mesh model means the precise mesh model is also required.
No-Training: We mark ticks if there is no need to re-train neural networks on a new setup.
Occlusion: Whether the method can work under occlusion. / means not applicable or not reported.
Workload compares the relative amount of manual work.
Bold items are considered superior to others.
The center point of the tool flange (e.g., ISO9409-1-50-4-M6).
The center of the fingertips with the gripper closed.
The tip of a customized tool.
The cap on a Lumberg RKMV connector.
Qualitative results in the real world: We draw masks of the robot projected onto the camera frame with our method in red. The precise fits of the mask and the robot suggest an accurate calibration result. Our method works under various settings, for example, (a). when the background is noisy. (b). When traditional methods fail, indicated by the green masks, our method can work as a post-doc remedy thanks to its markerless nature. (c),(d). EasyHec (in blue masks) works well with a full view of the robot but may fail with a partial view. Our method can work in both conditions.
A 7-DoF robotic arm deployed in a kitchen for fruit-peeling tasks, equipped with a custom end-effector. The setup includes a third-person EoB camera and an EiH camera, both of which are Realsense D415. We demonstrate the use of Kalib to simultaneously calibrate both the EiH and EoB cameras. We rotate each joint along the kinematic chain within a certain range to ensure both informative movements and tracked point visibility.
Note: for clear illustration, we only show synced frames in EiH and EoB camera views.
Eye-in-Hand Camera View
Eye-on-Base Camera View
Third-person Camera view (~10x speed-up)
A mobile robot base featuring two xArm 6 robotic arms, an egocentric EoB Femto Bolt camera, and two 6-DoF force sensors. Both arms are calibrated simultaneously. We rotate each joint along the kinematic chain within a certain range to ensure both informative movements and tracked point visibility.
Note: for clear illustration, we only show synced frames in EoB camera views.
Egocentric (Eye-on-Base) Camera View
Third-Person Camera View (~10x speed-up)
A Shadow Dexterous Hand mounted on a UR10 robotic arm, with a third-person EoB Kinect Azure camera and a specialized tactile glove worn on the hand. We rotate each joint along the kinematic chain within a certain range to ensure both informative movements and tracked point visibility.
Note: for clear illustration, we only show synced frames in EoB camera views.
Eye-on-Base Camera View
Third-Person Camera View (~30x speed-up)
A Franka emika robot, with a third-person EoB Realsense-D415 camera. The robot's end effector's positions are sampled randomly in the workspace, and the robot is moved correspondingly using inverse kinematics.
Note: for clear illustration, we only show synced frames in EoB camera views.
Eye-on-Base Camera View
Third-Person Camera View (~10x speed-up)
A carefully printed charucoboard appears randomly in the camera view. With the estimation of transformation between the camera and the charucoboard, the dual arm then points to the two corners of the charucoboard simultaneously.
We analyze the influence of the number of synchronized frames (i.e. pairs of corresponding points over time) on the calibration results in simulation. Since each frame can be considered ideally synchronized in the simulator, we can simply sample different numbers of corresponding points from a long sequence. Estimated camera poses from sampled points can be compared with the ground truth. We repeat the experiment on 10 random scenes.
The left figure shows the mean error over number of frames. The x axis starts with 4 because with 3 pairs of points, although theoretically feasible, the PnP module fails to give reasonable results. The mean error quickly converges below 0.1 cm or 1 degree after 10 frames, and gradually improves with more frames. Since PnP algorithms are known to be sensitive to noise, more frames are always welcomed.
Recall that our method tracks the reference point on the shell, instead of those internal structural keypoints (i.e. joints), since the projected points of internal joints onto the robot's surface may vary with different movements, thus causing unreliable tracking results. To verify it, we try to use the tracking module to track those structural keypoints, namely, Joint 0 to Joint 6, over the 300 frames in the simulation environment, and compare the tracking error with the reference point on the surface. As shown in the left figure, the reference point indeed achieves the lowest and the most stable tracking error, which confirms that we should stick with the reference point over the structural keypoints in our method.
We analyze the sensitivity of the PnP module.
In the simulation environment, we first derive ground truth 2D positions and 3D coordinates of the reference point.
We then add random Gaussian noise to the ground-truth 2D positions.. We set the mean of the noise to 0, and the standard deviation as 2,4,6,.... to see how the accuracy is impacted by the noise. In the left figure, the mean error is below 1 cm when the standard deviation is less than 10 px. Considering the mean distance error of tracking stays within several pixels in the left figure, we can conclude that the PnP algorithm can handle the noise from the tracking module in our setting.