The visual hull is defined as the intersection of the view cones of a set of cameras. Put another way, it is the region of space that projects inside the silhouettes seen by all the cameras. The silhouette is the outline of the projection of an object or person. The picture on the left shows an exemple of a visual hull.
Computing the visual hull helps to accelerate 3D reconstruction since the true surface is contained inside the visual hull. This way, we know with good confidence that there is no need to cast rays (for rendering and optimization) outside the silhouettes since they won't intersect the surface. We also now that there is no need to allocate voxels outside the visual hull, which reduces the memory usage by a lot. In short, the visual hull can be used as a good initialization for a 3D reconstruction algorithm.
The very first step is to extract the silhouettes from the background. We can do that by selecting all the pixels which are sufficiently different from the background with a simple threshold. Here, I'm assuming that images of the background have been obtained by taking pictures when the acquisition platform is empty. This thresholding is very noisy so we can take averages of the binary decisions at several resolutions. We can also smooth the binary decisions with a gaussian blur.
From this, we obtain a segmented image separating the pixels in two categories: background or foreground. The pixels on the boundary can be chained and converted to small line segments to define the silhouette. The segmented image is rather unreliable due to the shadows projected on the ground as well as noisy regions on the background. This is not a problem at all since the visual hull will be able to remove these unwanted regions very effectively. At this stage, it is better to have background pixels labeled as foreground pixels than the opposite since the visual hull computation cannot promote background pixels to foreground pixels.
The second step is to build a bounding volume hierarchy (BVH) from the line segments of the silhouette, as shown on the right. The BVH is computed on the CPU for simplicity in my implementation. We have to compute the segmentation and the BVH for all the cameras before we can move on to the visual hull computation.
The visual hull can be computed in many ways. One approach is to build a polygonal representation of the visual hull by extruding the silhouettes forward in space to form visual cones and computing their boolean intersection (think of the boolean modifier in Blender). Another way is to use space carving: create a voxel grid / octree and remove all the voxels whose projection is outside of the silhouette, subdivide the remaining ones and repeat until the desired precision is reached. Instead, I chose to use another way which uses 2D ray-tracing (hence the 2D BVH).
The idea is to physically move the foreground pixels along their corresponding ray until they are considered inside the silhouette of all the other cameras. The reconstruction volume is used as a starting point (top left). We then project the points into another image. If they are outside the silhouette, we make them move forward along the projection of their 3D ray (which is a 2D ray) until they hit the closest silhouette line segment. The 2D BVH accelerates this process, just like a 3D BVH accelerates 3D ray tracing. We iterate over all the cameras a few times to obtain a 3D point cloud (top right). The pixels that have been mistakenly labeled as 'foreground' in the first step will continue moving forward until they leave the reconstruction volume, at which point they are removed.
The point cloud that we obtain by applying this process on the pixels of a single camera is shown above (left). We repeat this process on the foreground pixels of all the cameras to obtain a complete point cloud (right). The point cloud can be converted to a mesh if necessary using poisson surface reconstruction or marching cubes for instance. This is however not necessary in my case. The point cloud is just used for the initialization of my reconstruction algorithm by building a coarse octree around it. This complete process is actually quite cheap and takes less than a second since the 2D ray tracing process maps well to the GPU. I don't need much precision so I use images at half resolution. The important thing is that the pixels that have been mislabeled have been removed for the most part and that we have a good idea of the location of the true surface.