Mohith Sakthivel, Ryan Lee, Akshay Venkatesh
In order for robots to perform visually guided tasks, objects that are localized in the frame of the camera have to be cast into the frame of the robot. In order to perform this transformation, the hand-eye calibration procedure has to be carried out. This calibration involves estimating the relative transformation between the camera and a frame interpretable by the robot (which is usually the base frame in the case of a static camera or the end-effector frame in the case of cameras mounted on the robot arm). In this project, we focus on the case where the camera is mounted on the robot arm.
We are entering an era where robots are becoming everyday machines. Hence, there is immense value in replacing the traditional calibration routines that require specialized tools and technical expertise with procedures that can be performed in the wild by an unskilled end-user. Also, for field robotic systems that operate in demanding environments, there is a need to continuously refine the hand-eye transformation estimates after a prolonged period of operation or after accidents that cause misalignments.
Hence, in our project, we attempt to develop an end-to-end deep learning based method that allows robots with hand-mounted cameras to perform hand-eye calibration in completely unstructured environments.
There is very little prior work on performing the hand-eye calibration process without using an object of known geometry. The Structure-from-Motion (SfM) pipeline followed by the current state-of-the-art method for this task is shown below. This method is tested on synthetic and real data and shows results with errors less than 2 millimeters.
SfM pipeline for performing hand-eye calibration without using any object of known geometry [Zhi et al., IROS 2017]
As opposed to using a traditional pipeline that employs feature extraction and matching followed by bundle adjustment, we propose an end-to-end learning-based algorithm. Recent works have identified several advantages of deep learning based methods over SfM methods for multi-view geometry problems. Some of the major ones are
Learning based methods work better in textureless regions
Priors can be learned from data to help in degenerate cases
Low-level image descriptors in classical methods are more prone to outliers
In order to determine the hand-eye transformation, the network has to reason about different pairs of images and the associated robot end-effector displacement between those images. The final hand-eye transformation has to be in agreement with all the image pairs available. Hence, we formulate the problem as a Graph and try to predict the hand-eye transformation by using Graph Neural Networks (GNN).
The image or image features form the nodes of the graph while the relative robot end-effector displacements are encoded into the edges of the graph. We then perform several iterations of neural message passing to allow information to flow between the nodes. After the message passing step, we aggregate the information stored in all the edges of the graph using an attention mechanism to predict the hand-eye calibration parameters.
In order to provide additional supervision for the network to learn, we also learn to regress the relative camera motion between two neighboring nodes by using the edge features. The intuition is that by learning to predict relative camera pose the network would be forced to encode information useful for predicting the hand-eye transformation.
No other deep learning based method has been proposed to solve this problem as far as our survey goes. Hence, we build on top of ideas from similar visual recognition problems in the SLAM community to develop our model architecture. Our implementation is heavily borrowed from the relative pose regression network [Turkoglu et al., 3DV 2021]. However, we replace the MLP based architecture with a fully convolutional architecture to hold to any spatial notion.
We use a ResNet-34 encoder as the feature extractor and use the features from the last convolutional layer. These features are then processed using convolutional layers to generate the initial node features. The source node features, target node features, and the relative end-effector displacement between the two nodes are concatenated and fed into another convolutional network to generate the initial edge features. The edges of the graph are directed with a total of N*(N-1) edges in a fully connected graph.
The message passing step starts with an autoregressive update of the edge features using the features of its connecting nodes. The updated edge features and the source node features are then used to generate the message along this edge. Non-local self-attention is used to decide which features of the message are important. The attended messages from different source nodes are then aggregated using an average aggregation scheme. The aggregated message is then passed into a CNN to auto-regressively update the target node. This message passing step is performed twice.
Finally, the edge features are used to predict the relative camera motion between the two nodes. All the edges in the graph are then aggregated using a self-attention mechanism to predict the hand-eye transformation.
Graphical illustration of message passing step
The problem requires data containing images of scenes taken from different poses and the robot end-effector transforms between those poses. Note that this relative transform is not the camera motion between the two viewpoints but rather the robot end-effector motion between the two viewpoints.
For this purpose, we identified that we could use any existing multi-view stereo dataset commonly used for SfM problems. It was also important to use a dataset where the displacement between the camera poses is of the range that could be realistically accomplished by a robot arm. The DTU - MVS Dataset 2014 was chosen as it met the mentioned requirements. The images in this dataset were taken with a camera moved using a robot arm. The dataset also comes with the corresponding world frame camera pose for each image.
To generate data for testing, we first randomly sample a certain number of images from a particular scene. Each image would pertain to a node in the graph. We then synthesize a hypothetical hand-eye transform, bounded by realistic translation and rotation constraints as expected in a real robot. This hand-eye transform is then used to obtain the corresponding robot end-effector poses for each image given its world-frame camera pose. The relative transforms between each of these end-effector poses are then computed successively. This relative transform information is to be encoded into the edges of the graph.
An advantage of our approach is that even with a finite amount of data, we can effectively synthesize an infinite amount of training data by sampling different hand-eye offsets for a given set of images from the scene. We can also sample different sets of images from a particular scene to generate different data points.
The model was trained for 40 epochs over the entire dataset on an NVIDIA RTX2080 GPU. The implementation was done using PyTorch Geometric framework. Hand-eye offsets with a maximum translation of 150mm and a maximum rotation of 90 degrees were imparted into the generated data samples. The range of these offset values is representative of the typical hand-eye transforms in common robotic systems. The following results were achieved:
Rotation Error:
Mean: 0.84 degrees Median: 0.62 degrees
Translation Error:
Mean: 24.7mm Median: 17.8mm
The TensorBoard logs for both models can be found here.
As can be seen from the above results, we are able to estimate the hand-eye transformation with a mean translation error of about 24.7 mm and a mean rotation error of about 0.84 degrees. The achieved performance is not acceptable enough to replace existing methods.
However, we are optimistic that there is promise in such approaches, as we have successfully demonstrated that our network converges and is making sensible predictions. We believe this is an important problem to work on. Based on our initial results, we are optimistic that with further experimentation and investigation into the architecture, and with access to better computing infrastructure, such methods can provide results comparable to existing methods.
There are a number of interesting ideas that are yet to be tried. The following are the ones that we are most optimistic about:
Explicitly enforcing geometric consistency among different elements of the graph to provide additional supervision. This can be done through
Additionally predicting the depth and enforcing consistency between depth and motion
Enforcing consistency among all the predicted and known poses (hand-eye transformation, relative camera pose, relative robot end-effector displacement) using the differentiable Non-Linear optimization. [Tang et al., 2019].
Xiangyang Zhi and Soren Schwertfeger - Simultaneous Hand-Eye Calibration and Reconstruction, IROS 2017.
Mehmet Ozgur Turkoglu, Eric Brachmann, Konrad Schindler, Gabriel J. Brostow, Aron Monszpart - Visual Camera Re-Localization Using Graph Neural Networks and Relative Pose Supervision, 3DV 2021.
Rasmus Jensen, Anders Dahl, George Vogiatzis, Engin Tola, Henrik Aanæs - Large Scale Multi-view Stereopsis Evaluation, CVPR 2014.
Chengzhou Tang and Ping Tan - BA-Net: Dense Bundle Adjustment Network, ICLR 2019.
Xingkui Wei, Yinda Zhang, Zhuwen Li, Yanwei Fu, and Xiangyang - DeepSFM: Structure From Motion Via Deep Bundle Adjustment, ECCV 2020.