Conventional photo and
video cameras were constructed to capture a view of the world that is
similar to the view we capture of the world. It has been found out that
these pinhole cameras are not necessarily the optimal cameras for
processing visual information using a machine. Inspired by nature's
task-specific eye design, we define a framework for camera design with
regard to 3D motion estimation.
When we think about vision, we usually think of interpreting the images taken by (two) eyes, such as our own, that is images acquired by planar eyes. But these are clearly not the only eyes that exist; the biological world reveals a large variety of designs. An eye or camera is a mechanism that forms images by focusing light onto a light sensitive surface (retina, film, CCD array, etc.). Different eyes or cameras are obtained by controlling three elements:
Evolutionary considerations tell us that the design of a system's eye is related to the visual tasks the system has to solve. The way images are acquired determines how difficult it is to perform a task and since systems have to cope with limited resources, their eyes should be designed to optimize subsequent image processing as it relates to particular tasks.
We can model a generalized camera as a combination of a filter and sampling pattern in the space of light rays. The filter models the effects of the optical system and the sampling pattern that is determined by the by geometric properties of the camera. Such a model allows us to phrase the problem of camera design in terms of finding the filter and sampling pattern in light ray space that will optimally facilitate the task at hand.
To evaluate and compare different eye designs in a scientific sense by using mathematical considerations we chose the recovery of descriptions of space-time models from image sequences as our problem. More specifically, we want to determine how we ought to collect images of a (dynamic) scene to best recover the scene's shapes and actions from video sequences. This problem has wide implications for a variety of applications not only in vision and recognition, but also in navigation, virtual reality, tele-immersion, and graphics. At the core of this capability is the celebrated module of structure from motion, and so our question becomes: What eye should we use, for collecting video, so that we can subsequently facilitate the structure from motion problem in the best possible way?
By examining the differential structure of the space of time varying light rays (as described in the framework of plenoptic video geometry), we relate different known and new camera models to the spatio-temporal structure of the observed scene.
The field of view of a camera determines how robustly we can estimate the 3D Motion of this camera. For example, the estimation for a small FOV camera is ill-posed which manifests itself in ambiguities that are explained and demonstrated. These ambiguities disappear when we increase the field of view.
Accurate ego-motion estimation is essential if one wants to build accurate models of the world from video. As can be seen in the following movie (AVI, 3,2Mb) .
that demonstrates how small changes in the localization of the feature points and camera positions and orientations can have dramatic effects on the accuracy of the reconstruction. The maximum localization error in this movie was 5 pixels for the correspondendence and two percent relative error for the camera position relative to the object distance.
It is a well known fact that the stability of 3D motion estimation for a pinhole camera strongly depends on the size of the field of view. To demonstrate the effect, please take a look at the following movie (AVI, 5,2Mb) that illustrates the confusion of rotation and translation for a small field of view camera.
On the left the camera is undergoing translational motion, on the right rotational motion. While the movie is playing, examine the top and side views of the cameras and try to decide only based on the image information which views are from the translating camera and which views are from the rotating camera.
We see that if we have only access to the top view then the estimation is very ambiguous. In contrast, if we have also access to a side view, the confusion between rotation and translation disappears.
The reason for the sensitivity can easily be explained. If we examine the following two illustrations, we see that the measurements in the images, here we show image gradients, but they might as well be optical flow vectors or feature tracks, can only be made in the plane perpendicular to the image location vector (left illustration below).
Usually the rigid motion parameters are determined by fitting a parameterized instantaneous motion model to the observed measurements. That means we are trying to the find the motion parameters that explain the image measurements most accurately according to some error criteria. In the illustration to the right above, we see that we cannot determine the component of motion parameter vectors that are parallel to r. If we have a small field of view, that means that the vectors r span only a small part of the sphere of directions, then the motion estimation will be subject to the so called line ambiguity.
The line ambiguity can be seen in the following example where we compare the motion estimation for the individual cameras of the Argus Eye with the estimation that uses information from all the cameras simultaneously and thus is a large field of view camera.
There is a noticable valley in the error surface for the individual cameras due to the line ambiguity, while the ambiguity vanishes when we use all the cameras.
This allows us to define a hierarchy of camera designs, where the order is determined by the stability and complexity of the computations necessary to estimate structure and motion. The dioptric axis (number and spacing of view points) determines if the 3D motion estimation is scene-independent or scene-independent and the field of view axis determines the noise sensitivity of the estimation. At the low end of this hierarchy is the standard planar pinhole camera for which the structure from motion problem is scene dependent and ill-posed. At the high end is a camera , which we call the full field of view polydioptric camera, for which the problem is scene independent and stable. In between are multiple view cameras with a large field of view which we have built, as well as catadioptric panoramic sensors and other omni-directional cameras. This classification is summarized in the following two figures: