Our project and the subjects we investigate span across multiple fields of study.
The main goal is to develop a camera calibration model suitable for systems that can be used for tracking real world objects. To track an object seen by multiple cameras, one firstly needs to calibrate cameras.
Camera calibration process is essential for solving various problems in computer vision and robotics such as 3D object tracking, depth estimation, and simultaneous localization and mapping.
Once the calibration is completed, the image space points of the desired tracked object can be found using object recognition algorithms.
First, general information about the camera model is given below. Then, further details are explained.
We can model a finite projective camera as a 3x4 matrix P, that maps a 3D point X to homogeneous image coordinate x. The λ in equation refers to projective depth, and image point is found after dividing the product by λ.
The Mext = [R | t] is a simple 3 by 4 transformation matrix with no scaling. It is related to camera's position and orientation in relation to world frame.
Consequently, camera matrix P = K [R | t] where [R | t] transforms the 3D point X defined in world coordinate frame to Xcam which is in camera coordinate frame; and where K transforms the 3D point in camera coordinate frame to 2D image point x expressed in homogeneous coordinates.
Considering that we know the intrinsic parameters beforehand, the overview of the algorithms we implemented to achieve multi-camera extrinsics calibration can be summarized as follows:
Estimate the fundamental matrix between first two views using normalized 8-point algorithm.
Using the known intrinsics, obtain the essential matrix between the first two views.
Recover up-to-scale relative motion between two views by decomposing the essential matrix into one orthogonal matrix (rotation), and one skew-symmetric matrix (translation, matrix representation of cross product).
Using two camera matrices, triangulate 2D correspondences to setup 3D structure.
Add the rest of the cameras one by one, and estimate their relative poses using the perspective n-point algorithm from 2D-3D correspondences.
Align the resulting camera structure with the desired world frame.
How to extract relative pose between two cameras?
Essential matrix can be decomposed into an orthogonal matrix and a skew-symmetric matrix
There are 4 possible solutions (chirality condition)
We can choose the most probable pose using triangulation
Given an image point correspondence on two calibrated cameras, how can we find the 3D position of the point?
Various triangulation methods exist
Our implementation uses DLT method
Our starting point of research and implementation was based on developing a multi-camera calibration model suitable for systems that can be used for tracking real world objects. After calibration, we used a pre-trained model from Detectron2 library for this purpose and detected several keypoints on targets.
After finding the image space points of a target object for each camera, the calibration parameters are used to cast rays to the 3D space, and the intersection would be the 3D point of the tracked object.
We prepared an offline Blender demo of our project. We setup 5 randomly placed cameras in a scene, and placed a 3D human model that walks on a path. We calibrated the cameras using our multi-camera calibration module. After that, we took 250 frames from 5 cameras, with the human subject walking around the scene. Then we fed these frames into our 2D human detection module and produced 2D keypoints. Using corresponding 2D keypoints in our multi-camera calibration module, we were able to track the 3D motion of the subject.
Blue dots represent the reconstructed object points; in this case, 3D positions of human body key points. 3-axis figures represent the reconstructed cameras, and red dots represent the reconstructed 3D points of the 2D point correspondences used for calibration.