This short article will describe informally how we calibrate the multi-camera rig used in the research lab I work at. I worked on this exact problem during my end-of-studies internship and you can download my report on the main page if you want to dive in the mathematical details of the matter.
In order to reconstruct an object, we first have to calibrate the cameras. This means that we want to know with a high accuracy where the cameras are located in relation to each other, and where they are pointing to. We also need to know how the rays of light are bent by the lenses so that we can reverse the projection process: that is to say, we want to compute the direction of the ray of light which landed on a given pixel. The simplest way to do this is to use a calibration pattern such as a checkerboard, as shown below.
We can easily detect the location of the corners of the pattern in the image using OpenCV. Since we know the geometry of the pattern we can fully calibrate the camera with a few 2D-3D correspondences. You can have a look at this OpenCV tutorial if you want to know the details. We can use this knowledge to create a simple reconstruction setup at home as shown in the images above. The calibration pattern is printed on a sheet of paper placed on a turntable made using Lego pieces. The phone is placed on a stand to keep it still. All we have to do is take a picture of the calibration pattern at several angles to calibrate the cameras. We can then place a figurine and take pictures at the same angles to obtain calibrated images as well as background images for the reconstruction. This works well as long as the turntable is stable enough to support the weight of the figurine.
However, we cannot apply the same idea for a multi-camera rig: not all cameras are looking downwards so we cannot print a calibration pattern on the ground. Instead, we use a calibration tool with bright LEDs that we move in the dark. As with the checkerboard pattern, we first want to establish 2D-3D correspondences between the projection of the LEDs in the images and their physical location on the calibration tool. Since we don't know the global orientation of the calibration tool yet, we work in its own planar reference frame.
The very first step is to find the location of the LEDs in the images. To do so, I decided to use an iterative approach. I process the image in 16x16 squares with some overlap. After the first detection pass, I only keep the ones containing sufficiently bright pixels. Then, I re-center each square on the location of the bright spot found previously and recompute its location based on the new bounds of the square. I also fit an ellipse, depicted with the white band in the images below. A spot is removed if its ellipse is too elongated, which can happen due to motion blur. I iterate this procedure five times in total. The location found at the first iteration is very noisy since the squares may be initially missing some of the pixels of the spot. During the following iterations, the squares are moved and converge to the location of the brightest spot in their neighborhood.
iteration 0
iteration 1
iteration 2
iteration 3
iteration 4
final result
Then, I keep only the 7 brightest spots and fit two lines on them in order to identify the vertical and horizontal bars of the calibration tool. I then call the OpenCV function calibrateCamera with the 2D-3D matches that have been established. This gives a calibration of each camera independently of the others: each camera knows its orientation with respect to the calibration tool but has no idea where the other cameras are in space. We can obtain a rigid transform from one camera to another by chaining the rigid transforms: from one camera to the calibration tool and then from the calibration tool to another camera. We can obtain a better estimate by averaging over multiple frames. We can then bring all the cameras in the same reference frame by chaining the transforms between pairs of cameras.
Finally, we can boost the precision by optimizing all the parameters together with a non-linear optimization called bundle adjustment. At the end, we obtain the position and rotation of all the cameras in single reference frame, as well as the trajectory of the calibration tool, as shown below. Importantly, we also obtain the intrinsic parameters of the cameras: their focal length as well as an estimation of the distorsion induced by the curved lenses.
Using this procedure, we can obtain an average reprojection error of about 0.2 pixels, much smaller than a pixel. Unfortunately, the calibration is not stable in time and must be performed often, typically everyday. This is mainly because of the thermal expansion of the steel beams holding the roof which sag and contract as temperature changes, moving the cameras mounted to them. Another source of error comes from the lenses of the cameras which are also affected by thermal expansion. The cameras tend to heat up when in use so we have to wait about half an hour before doing the calibration.
During my internship, I also worked on self-calibration (see my report for more details), also called auto-calibration, which attempts to do the same as the above without relying on an object of known geometry. The goal is to calibrate the cameras using only 2D matches between the images, instead of 2D-3D matches between an image and an object. My conclusion was that it is as accurate as the other method but obtaining sufficiently many robust matches between two images can be problematic, which is why we still rely on a calibration tool. The added advantage of the calibration tool is that it is easier the obtain a uniform coverage of the field of view of the cameras, whereas auto-calibration can only rely on the 2D matches that have been found.