Members:
Shawn Recker, Jason Mak, Mauricio Hess-Flores, John D. Owens, Kenneth I. Joy
Overview
The work in [ReckerWACV13] presents a framework for N-view triangulation of scene points, which improves processing time and final reprojection error with respect to standard methods, such as linear triangulation. The framework introduces an angular error-based cost function, which is robust to outliers and inexpensive to compute, and designed such that simple adaptive gradient descent can be applied for convergence. Our method also presents a statistical sampling component based on confidence levels, that reduces the number of rays to be used for triangulation of a given feature track. It is shown how the statistical component yields a meaningful yet much reduced set of representative rays for triangulation, and how the application of the cost function on the reduced sample can efficiently yield faster and more accurate solutions. Results are demonstrated on real and synthetic data, where it is proven to significantly increase the speed of triangulation and optimize reprojection error in most cases. This makes it especially attractive for efficient triangulation of large scenes given the speed and low memory requirements.
As input, we assume a set of feature tracks across N images and their respective 3x4 camera projection matrices Pi are known. Figure 1 displays the angular error measure for a candidate 3D position p. For each camera center Ci, a unit direction vector vji is first computed between it and the 3D position. A second unit vector, wi, is obtained by casting a ray from each Ci through the 2D projection of the evaluation point on each image plane (blue image plane dot in Figure 1). Finally, the average of the errors (dot products) across all cameras is obtained. Our cost function is an L1 sum of error terms derived from dot products between rays in 3D space. Each dot product can vary from [-1,1], but in practice we only deal with points that lie in front of the cameras, and hence the range [0,1]. The function is positive-definite, with an absolute minimum at zero. Figure 2 shows a scalar field, consisting of our cost function measured for a dense set of test positions near a known ground-truth position. Notice the smooth variation in a large vicinity surrounding this position, which is key to the success of our cost function. Simple gradient descent can be applied for optimization, which avoids the matrix storage and operations involved in solving normal equations, such as in bundle adjustment with Levenberg-Marquardt [Hartley04].
Figure 1: Angular error calculation for an evaluation 3D position, with respect to its corresponding feature track.
Figure 2: Multi-view reconstruction of the Dinosaur dataset [Oxford09] (left), where the camera follows a circular path above the scene. A volume view of a scalar field representing our angular cost function evaluated at a dense grid of sample points around a ground-truth position (black dot in both images) is also shown (right). Redder values are closer to zero cost.
A second component to our triangulation framework is the use of statistically-meaningful samples of rays as opposed to the entire available set, N. For this, we make use of confidence levels in selecting a random set of rays from the full set. For instance, a 95% confidence level with a 5% margin of error implies that a point computed from a reduced sample will have a 95% probability of being within a 5% margin of error of the position obtained with the entire set of rays. With this procedure, final reprojection errors are on par with those obtained when using full sets of rays, and a significant speed-up is achieved. In the Results section of this paper, it is determined that the best combination of speed and accuracy is to use a 90% confidence level and a simple midpoint start, instead of the more-expensive linear triangulation [Hartley04]. In summary, this triangulation framework presents excellent behavior both in real and synthetic testing, and is proven to outperform the only proven and standard method for more than three views, N-view linear triangulation, while also behaving excellently for two and three views. If guaranteed optimality is required, optimal algorithms specifically designed for two and three views may give slightly more accurate results, but at potentially much higher processing times. Our framework is very flexible, such that for fast but not very accurate results, a lower confidence and a midpoint start can be used, but for better accuracy (though with a slower result), a linear start and a higher confidence level can be used. Figure 3 displays reconstructions obtained with our triangulator. Shawn Recker's triangulation code corresponding to this work can be found here.
Figure 3: Aerial scenes reconstructed with our triangulation framework. A midpoint start and 90% confidence level were used.
Due to the very encouraging yet non-optimal result in the previous paper, we developed a further algorithm in [ReckerICCV13]. This paper presents a novel framework for practical and optimal N-view triangulation of scene points. The algorithm is based on applying swarm optimization inside a robustly-computed bounding box, using the described angular error-based L1 cost function which is more robust to outliers and less susceptible to local minima than cost functions such as L2 on reprojection error. Extensive testing on synthetic data with ground-truth has determined an optimal position over 99.9% of the time, on thousands of camera configurations with varying degrees of feature tracking errors. Opposed to existing polynomial methods developed for a small number of cameras, the proposed algorithm is at best linear in the number of cameras and does not suffer from inaccuracies inherent in solving high-order polynomials or Gröbner bases. In the specific case of three views, there is a two-to-three order-of-magnitude performance increase with respect to such methods. Results on real data prove that reprojection error is improved with respect to other methods in the literature. Figure 4 shows a summary and results for this algorithm. Figure 5 displays a reconstruction obtained with our algorithm for the Notre Dame dataset [Snavely06].
Figure 4: Algorithm summary. A multi-view reconstruction of the ET dataset [Snavely06] is shown in the upper-left image, with camera positions rendered in dark blue, with a zoomed-in version of a bounding box surrounding a point to be optimized in the upper-right image. Inside the bounding box (middle left), L1 cost function variation is smooth. Swarm optimization is then applied inside the bounding box (middle right). It was shown that processing time stabilizes despite increasing error (lower left), and that processing time is nearly linear with feature track length given small errors; a very desirable property (lower right).
Figure 5: Reconstruction of the Notre Dame dataset [Snavely06], with input images on the left and our obtained point cloud on the right. The triangulation took 10.34 minutes, with 73427 tracks spanning 290 cameras, on a MacBook Pro with an Intel Core i7 processor at 2.66 GHz with 4 GB of RAM, running Mac OS X Lion 10.7.3.
Finally, [MakWACV14] presents a framework for GPU-accelerated N-view triangulation in multi-view reconstruction, inspired by the algorithm in [ReckerWACV13], based on optimizing the angular error-based L1 cost function using adaptive gradient descent. The algorithm is mapped onto the GPU, a non-trivial task, and two approaches for parallelization are compared: one thread per track and one thread block per track. The better-performing approach depends on the number of tracks and the lengths of the tracks in a given dataset. The algorithm also implements statistical sampling based on confidence levels to successfully reduce the quantity of feature track positions needed to triangulate an entire track. Furthermore, sampling aids in load balancing for the GPU's SIMD architecture and for exploiting the GPU's memory hierarchy. When compared to a serial implementation, a typical performance increase of 3-4x can be achieved on a 4-core CPU. On a GPU, large track numbers are favorable and an increase of up to 39x can be achieved. Results on real and synthetic data prove that reprojection errors are on par with those in [ReckerWACV13] but costing only a fraction of the computation time, allowing for efficient and accurate triangulation of large scenes. Figure 6 shows perfomance data for this framework.
Figure 6: GPU runtime performance with and without sorting for datasets with an increasing number of tracks (upper left). Track lengths in a dataset vary from 1 to 100. Performance of a multi-core CPU vs. a GPU on datasets with an increasing number of tracks (upper right). Track lengths are fixed at 100. GPU performance with increasing track error for different camera configurations (bottom left). Comparison of the one-thread-per-track (OTPT) and one-block-per-track (OBPT) implementations on a 20000-track dataset and a 100000-track dataset (bottom right). Scaling is measured as track length is increased.
Related Publications
[Hartley04] R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, 2nd edition, 2004.
[MakWACV14] Jason Mak, Mauricio Hess-Flores, Shawn Recker, John D. Owens, Kenneth I. Joy, "GPU-Accelerated and Efficient Multi-View Triangulation for Scene Reconstruction", in "WACV 2014: IEEE Winter Conference on Applications of Computer Vision", Steamboat Springs, Colorado.
[Oxford09] Oxford Visual Geometry Group. Multi-view and Oxford Colleges building reconstruction. http://www.robots.ox.ac.uk/~vgg/, August 2009.
[ReckerICCV13] Shawn Recker, Mauricio Hess-Flores, Kenneth I. Joy, "Fury of the Swarm: Efficient and Very Accurate Triangulation for Multi-View Scene Reconstruction", accepted for publication in "International Conference on Computer Vision (ICCV) 2013 Workshop: Big Data in 3D Computer Vision (BigData3DCV)", to be held December 8, 2013 in Sydney, Australia.
[ReckerWACV13] Shawn Recker, Mauricio Hess-Flores, Kenneth I. Joy, "Statistical Angular Error-Based Triangulation for Efficient and Accurate Multi-View Scene Reconstruction", in "WACV 2013: IEEE Winter Conference on Applications of Computer Vision", pp. 68-75, 2013.
[Snavely06] N. Snavely, S. M. Seitz, and R. Szeliski. Photo tourism: exploring photo collections in 3D. In SIGGRAPH '06: ACM SIGGRAPH 2006 Papers, pages 835-846, New York, NY, USA, 2006. ACM.