Members:
Daniel Knoblauch, Mauricio Hess-Flores, Mark A. Duchaineau, Kenneth I. Joy, Falko Kuester
Abstract
This paper introduces a non-parametric sequential frame decimation algorithm for image sequences in low-memory streaming environments. Frame decimation reduces the number of input frames to increase pose and structure robustness in Structure and Motion (SaM) applications. The main contribution of this paper is the introduction of a sequential low-memory work-flow for frame decimation in embedded systems where memory and memory traffic come at a premium. This approach acts as an online preprocessing filter by removing frames that are ill-posed for reconstruction before streaming. The introduced sequential approach reduces the number of needed frames in memory to three in contrast to global frame decimation approaches that use at least ten frames in memory and is therefore suitable for low-memory streaming environments. This is moreover important in emerging systems with large format cameras which acquire data over several hours and therefore render global approaches impossible.
In this paper a new decimation metric is designed which facilitates sequential keyframe extraction fit for reconstruction purposes, based on factors such as a correspondence-to-feature ratio and residual error relationships between epipolar geometry and homography estimation. The specific design of the error metric allows a local sequential decimation metric evaluation and can therefore be used on the fly. The approach has been tested with various types of input sequences and results in reliable low-memory frame decimation robust to different frame sampling frequencies and independent of any thresholds, scene assumptions or global frame analysis.
Overview
The main topic is structure from motion, in particular scene reconstruction from streaming video sequences. The scenario we're concerned with is when there exist several minutes or hours of streaming video collected at high frame rates, for example from a UAV, and scene reconstruction is being attempted. A number of algorithms exist in the literature for scene reconstruction from sequential images or video, but one aspect which is often overlooked is which images should actually be used. There are issues with streaming video reconstruction. In these environments, memory and network bandwidths may be restricted. It becomes not only prohibitive to process all the information due to such restrictions, but trying to achieve reconstruction from all frames actually leads to pose and structure estimation degeneracies and mathematical errors. This leads to the need for a frame decimation algorithm to filter out frames which are not appropriate for multi-view reconstruction and also reduce the amount of information going into the reconstruction infrastructure. There are a small number of frame decimation algorithms in the literature, though the number of published papers is small. Also, some of these algorithms are global, such that they filter given the entire set of acquired images, which is not suitable for streaming environments. Others rely on thresholds and assumptions that make them scene-dependent. There are a number of causes for pose and structure estimation inaccuracies. Feature matching is affected especially by occlusions and lighting changes with larger baselines. Even with good feature matching, a purely rotational camera motion or having all scene points be coplanar lead to pose estimation degeneracies, as will be discussed in more detail. In structure estimation, numerical errors in triangulation are introduced by small baselines in relation to the depth of the viewed scene. This is because near-parallel rays yield triangulated points that have a potentially large uncertainty in their computed positions. Frame decimation must take into account all of these factors, since not even bundle adjustment can fix estimates that are very inaccurate.
The main contribution of our work is to provide a new frame decimation metric optimized for non-parametric, sequential frame decimation. Compared to decimation metrics introduced in previous work the new metric is designed to have one global maxima representing a good keyframe pair at each evaluation step, with at most three buffered frames, such that decimation decisions are made sequentially and with no need for scene-dependent thresholds. The goal of this approach is to avoid input frames that result in error-prone pose and structure estimates and additionally reduce the total amount of streamed data by filtering non-suitable frames as soon as possible, by finding reliable consecutive keyframes. To give a visual example of what is intended with our algorithm, the imagery in Figure 1 was acquired from a UAV, where images were streaming in at a certain frame rate. For illustrative purposes, we show a set of 12 images where the ones filtered by our sequential frame decimation are marked with a red 'X'. Given this pre-processing, the remaining images were used for pose (yellow cameras) and structure (point cloud) estimation, as shown below.
Figure 1: Concept of frame decimation.
Figure 2 shows a flowchart for the proposed frame decimation algorithm. Given the k-th keyframe, SURF feature matches are obtained between the keyframe and the present candidate frame, k+i. The algorithm then compares the value of the frame decimation cost function fG at that current frame with respect to that of the previous frame. The counter 'i' is increased as long as the value of fG is increasing. As soon as the first frame pair with a positive value for fG and a decrease in its fG value with respect to the previous pair is found, the previous frame for which a positive fG was obtained with respect to the previous keyframe is chosen as the next keyframe. To start the sequential frame decimation the first suitable frame of the input sequence is used as the first keyframe, which can be initialized based on Beder and Steffen's algorithm [Beder06] to find the best initial keyframe pair for scene reconstruction. Every frame pair with fG(k, k + i) which results in a local maximum is a keyframe and is used for subsequent SfM calculations. Our proposed algorithm presents a number of properties and advantages. The cost function, as will be explained, consists of two components: a degeneracy/instability detection component and a match-based component. The combined action of these multiplicative values allow our frame decimation to filter frames that result in degenerate camera poses and numerically unstable triangulations. It also avoids large baselines, which causes less-accurate matching. One global cost function maximum is reached at each keyframe evaluation. As will be seen in the results section, the framework reduces reprojection error in pairwise reconstructions as well as in the final multi-view reconstruction. Finally, the keyframe sequence found is locally optimal. If a different starting image is used, a different decimation sequence is obtained.
Figure 2: Flowchart for the proposed algorithm.
We will first discuss the degeneracy and instability detection capacity of the cost function we designed. There are two conditions for non-general camera motion and non-general position of structure known as degenerate cases when the epipolar geometry is not defined and methods based on estimation of the fundamental matrix will fail, known as motion degeneracy and structure degeneracy. A motion degeneracy occurs when the camera only rotates about its center, in which case the epipolar geometry is undefined. A structure degeneracy occurs when matches are obtained only between scene points that are coplanar. In this case, the epipolar geometry, namely the fundamental matrix, cannot be defined uniquely as there is a 2-parameter family of solutions. However, in both degeneracy cases a homography can be used to describe relative scene changes. Therefore, the comparison of residual error between these two scene representations gives an insight into the frame quality for pose and structure estimation. This comparison is performed based on Torr's geometric robust information criterion (GRIC) [Torr98]. The value obtained for a GRIC evaluation also allows to detect potential structural instabilities for small baselines. The lower the GRIC score, the better a model fits to the given data. Therefore, the relative GRIC (relGRIC) metric is a relative measure of how good epipolar geometry describes the scene compared to a homography. A frame pair is good for pose and structure estimation the higher its relGRIC value. However, for longer baselines there must be enough good matches to reliably compute the epipolar geometry. In fact, the main error source with large baselines comes from matching quality. we must analyze when the baseline possibly gets too large and take this into account when computing relGRIC. First, the ratio cW between the total number of image features NF in the source keyframe and the resulting number of inlier feature matches NI from RANSAC-based fundamental matrix calculation are used to calculate cW. In general cW tends to decrease if the baseline grows. Furthermore, to make sure that the matches are a good representation of the given scene and cover as much of the scene as possible, a ratio aR is introduced, between the inlier match bounding box area cA and the total area of the image, iA. For cA, the area is approximated by the axis-aligned bounding box of all the given inlier matches. To get a good reconstruction representative of the scene, this ratio aR should stay as large as possible. The resulting cost function is essentially relGRIC weighted by match-based metrics. The match-based component is used to avoid large baselines and to favor a larger and more representative scene coverage, while relGRIC is used for avoiding pose degeneracies and structural mathematical instability. It can be seen that fG will have a high value if the relGRIC has a high value and the match weight is high, but will decrease when the baseline grows, which provides a 'sweet spot' in baseline size based on all the possible error sources discussed.
Results
The presented frame decimation algorithm has been tested on a number of real and synthetic scenes, consisting of different scene types, camera motions and varying baseline-to-depth ratios. Examples of real tested sequences, some publicly-available, can be seen here. The Stockton dataset for example consists of aerial imagery taken while flying in a circle around a city. The Leuven Castle dataset consists of a hand-held camera going around a building. The Sagalassos Medusa Head dataset represents an image sequence taken by a hand-held camera moving in a half-circle around an archeological artifact, where the movement is very jittery at times. Figure 3 shows the keyframes obtained with our algorithm, for the castle-P30 [Strecha08], Stockton, Leuven Castle [Pollefeys04], Sagalassos Medusa Head [Pollefeys04] and Corridor [Fitzgibbon98] datasets, respectively.
Figure 3: Obtained keyframes for the 'castle-P30', 'Stockton', 'Leuven Castle', 'Sagalassos Medusa Head' and 'Corridor' datasets, respectively, from top to bottom.
To show in a more visual way the result obtained with our frame decimation, in the 'undecimated' video (below left) we show the first few seconds of the Sagalassos Medusa Head sequence [Pollefeys04], with a total of 110 frames. On the right-hand side, we show the decimated sequence, which has a total of 24 frames. Notice how in the undecimated video at times the camera position seems to stall, and this lack of relative movement is precisely one of the issues that can affect multi-view pose and structure estimation. Notice how in the decimated version transitions are much more noticeable, and baselines are better for reconstruction purposes. A total of 23% of frames are left after decimation. The original and complete Sagalassos Medusa Head video can be found at http://www.cs.unc.edu/~marc/.
An experiment was conducted to show that our approach is independent of the frame rate of the input data. For this, the Sagalassos Medusa Head sequence was decoded using different frame rates: 5, 10, 15 and 20 fps. Frame decimation was applied on each of the resulting frame sequences. Figure 4 shows that the amount of frames remains nearly constant despite different decodings of an input video stream. There is a small change which can be explained by the fact that the introduced algorithm does not find a unique frame decimation, as it is performed locally and sequentially, and by the fact that with lower frame rates possible keyframes are already "excluded" by the lack of spatial information.
Figure 4: Number of keyframes versus number of frames at different decodings, in frames-per-second.
To show how our frame decimation approach is well-suited for sequential multi-view reconstruction, the average reprojection error (in pixels) resulting after every addition of a new keyframe to the reconstruction was measured for the Sagalassos Medusa Head sequence, as shown in Figure 5. Here, keyframes are represented by crosses. All other frames are decimated by the algorithm. It can be seen that overall the reprojection error is small and also constant over time. This shows clearly that all the introduced keyframes are well-suited for reconstruction.
Figure 5: Average reprojection error when applying sequential scene reconstruction using only extracted keyframes.
For further details, the ISVC oral presentation corresponding to this paper can be found here.
Related Publications
Daniel Knoblauch, Mauricio Hess-Flores, Mark A. Duchaineau, Kenneth I. Joy, Falko Kuester. Non-Parametric Sequential Frame Decimation for Scene Reconstruction in Low-Memory Streaming Environments. ISVC 2011, Part I, LNCS 6938, pp. 363-374. Springer, Heidelberg (2011).
[Beder06] Beder, C., Steffen, R.: Determining an Initial Image Pair for Fixing the Scale of a 3D Reconstruction from an Image Sequence. In: Pattern Recognition. Volume 4174 of Lecture Notes in Computer Science. Springer Berlin / Heidelberg (2006) 657-666.
[Fitzgibbon98] Fitzgibbon, A.W., Cross, G., Zisserman, A.: Automatic 3D Model Construction for Turn-table Sequences. In Koch, R., Van Gool, L., eds.: 3D Structure from Multiple Images of Large-Scale Environments, LNCS 1506. Springer-Verlag (1998) 155-170.
[Pollefeys04] Pollefeys, M., Van Gool, L., Vergauwen, M., Verbiest, F., Cornelis, K., Tops, J., Koch, R.: Visual Modeling with a Hand-held Camera. International Journal of Computer Vision 59 (2004) 207-232.
[Strecha08] Strecha, C., von Hansen, W., Van Gool, L., Fua, P., Thoennessen, U.: On Benchmarking Camera Calibration and Multi-view Stereo for High Resolution Imagery. CVPR 2008.
[Torr98] Torr, P.H.: Geometric Motion Segmentation and Model Selection. Philosophical Transactions: Mathematical, Physical and Engineering Sciences 356 (1998) 1321-1340.