Very briefly, a surface reconstruction algorithm extracts a triangle mesh from pictures of an object. Note that many variations of this problem have been considered over the decades of computer vision research:
A point cloud may be sufficient for many applications, instead of a mesh.
The number of pictures may be as low as one and as large as several thousands.
The object may be known or unknown.
The object may be moving or static.
The object's surface appearance can be shiny, metallic, or even semi-transparent.
The images may be taken 'in the wild' with a cheap smartphone camera or taken in a controlled environment.
The images can be taken in the infrared spectrum and thus only contain grayscale data, instead of the classic RGB triplets.
Time-of-flight cameras can measure the distance to the object, and directly give RGBD quadruplets per pixel.
The 68 cameras of the multi-camera rig
In my PhD, the images are obtained with a multi-camera rig filming one or several people, as shown in the picture above. All the cameras are filming synchronously, which means they all take a picture at the same. This way, the situation is analogous to an inanimate object imaged by moving a single camera around it. The cameras take classic RGB pictures at a resolution of 2048x2048. This controlled environment provides a few benefits compared to an 'in the wild' scenario. First, we can segment the person from the background since we now what's behind. In this way we can quickly know which parts of an image contain interesting things. Second, we can calibrate the cameras beforehand. Calibration is the process by which we compute were the cameras are located and oriented with respect to each other. Using a measuring tape and a protractor wouldn't be practical so we have to use a dedicated tool, I explain this process in more detail here.
Despite these two conveniences, there are also many challenges. The main one is that the reconstruction algorithm must be as fast as possible to be able to process an impressive amount of data in a reasonable amount of time. At a rate of 30 images per second, that's 2048x2048x3x68x30 = 25.7 GB/s of uncompressed data! Processing this in real-time remains completely out of reach. Some reconstruction algorithms take up to several hours, which would be highly impractical since we have to do one reconstruction per frame. Instead, we are currently targeting about one minute of computations per frame. The other difficulties come from the type of data we are trying to reconstruct. For instance, fast movements create a lot motion blur which must be handled gracefully. Additionally, hair is notoriously hard to reconstruct: it is both very shiny and semi-transparent, and is not well-approximated by a surface due to its volumetric nature.
For all these reasons, we decided to adapt existing differentiable volumetric algorithms to our needs. You can check my first publication on this topic on the main page for more details.
[Work in progress] Come back later for the rest :)