Non-rigid Shape Modeling and Tracking by Factorization in Stereo-Motion
Tracking the object and recovering its 3D shape (especially non-rigid shape) from sequences of images is a fundamental problem in computer vision community, they have various applications such as scene modeling, robot navigation, object recognition and virtualized reality. Traditionally there exist two vision-based methods for 3-D reconstruction: visual motion [Ullman79] and stereo vision [Dhond89]. Visual motion, also called structure from motion, is inherently to recover 3D information from an image sequence acquired under a relative motion between the camera and the object. Stereo vision (structure from stereo) is to estimate 3D structure by triangulation from two widely displaced views of the same object. Both methods depend on how to solve the notorious correspondence problem.
In structure from motion, Tomasi and Kanade [Tomasi92] proposed one of the most influential approaches as the factorization method for rigid objects and orthographic projection. Many extensions have been put forward: Poelman and Kanade [Poelman94] extended it to para-perspective or affine camera, Triggs [Triggs96] tried to generalize this technique for a full perspective, sequential versions were given in [Morita97] [Fujiki00], utilization of various features in factorization could be found in [Morris98] [Quan97] [Ma98] [Ke01]; Costeira and Kanade relaxed the rigidity constraints in a multi-body factorization method [Costeria98], Han and Kanade [Han00] proposed a similar method for linearly moving objects where the camera intrinsic parameters are required in advance, but soon they extended it to uncalibrated views [Han01]; Irani and Anandan [Irani00] proposed a good method to covariance-weighted factorization, which factor noisy feature correspondences with high degree of directional uncertainty into the structure and motion.
Previously 3D non-rigid motion and structure recovery is based on some assumptions about the scene, for example in the form of a deformable model [Pentland91] [Metaxas93] or the assumption that the motion minimizes the deviation from a rigid body motion [Ullman84], See [Penna94] for a survey of monocular non-rigid motion estimation. Zhou and Kambhamettu [Zhou00] used an extended super-quadrics as global motion models to recover structure and non-rigid motion of elastic objects such as human faces. The object is segmented into many small areas assumed to undergo similar non-rigid motion and local analysis is performed to recover small details for each small area. Bregler et. al [Bregler00] [Torresani01] [Torresani02] proposed a first framework for 3D reconstruction of non-rigid or deformable objects based on factorization. Shortly afterwards Brand [Brand01a] [Brand01b] proposed a flexible factorization approach which minimizes the deformations relative to the mean shape by introducing an optimal correction matrix.
Researchers have tackled this topic of augmenting "structure from motion" with stereo information. Some work are feature-based, the other are called "the direct method" using the spatial and temporal image gradient information. The problem is how to fully utilize the redundant information in the stereo-motion analysis, but practically the more important issue would be how to make the two basic cues incorporate with each other.
Unfortunately nearly all stereo-motion algorithms assume that the scene is rigid. A few stereo-motion papers ever considered non-rigid motion.[Liao1997] used a relaxation-based algorithm to co-operatively match features in both the temporal and spatial domains. [Malassiotis97] used a grid which acts as a deformable model in the 3D non-rigid structure recovery. [Vedula99] discussed the scene flow that is 3D motion field of points in the world. They have considered computation of dense non-rigid scene flow from optic flow of stereo image sequences, but the knowledge of stereo correspondence must be known. Recently Carceroni and Kutulakos [Carceroni01] presented an algorithm that computes motion and shape of dynamic surface elements to represent the spatio temporal structure of a scene observed by multiple cameras under known lighting conditions. [Gokturk01] ever used a stereo system to capture the training data for human facial shape learning, unfortunately the stereo correspondences are not guaranteed obviously. [Huang02] proposed a facial tracking method in stereo sequences, where the problem of stereo correspondence is still not completely solved. Neumann and Aloimonos [Neumann02] put forward a method to compute three-dimensional (3D) motion fields on a non-rigid face using multiple (11) calibrated cameras. They fit the head in a global optimization step to the spatial-temporal contour and multi-camera stereo data derived from all images. The non-rigid displacement of the mesh, the 3D motion field, is initialized from the spatial-temporal derivatives in all the images and then refined together with the shape using spatial-temporal stereo information.
Although recently there appear some papers discussing human face tracking from a stereo rig [Xu98] [Matsumoto00] [Gorodnichy01] [Yang02] [Morency02], they only tracked the head pose/orientation with the implicit assumption that the human head is a rigid body.
We propose a stereo-motion analysis framework of recovering 3D non-rigid shape and 3D non-rigid tracking based on factorization. After performing singular value decomposition (SVD) on the well-organized stereo-motion measurement matrix, we could factorize it into 3D basis shapes and configuration weights, stereo geometry and rigid motion parameters. Based on it, we could infer stereo correspondences from motion correspondences only requiring that at least 3K point stereo correspondences (where K is the dimension of shape basis space) are established by some classical stereo matching techniques, such as epipolar constraints and disparity ranges. This framework naturally has the usage potential of rank constraints as [Bregler00]. It also has advantages such as simpler correspondence and accurate reconstruction even with short sequences.
Below we also utilize the rank constraints to help stereo correspondence.
Applying the non-rigid motion model to the two cameras separately, one obtains two image measurement matrices respectively,Because the shape is centered at the origin, we can omit the translation component in the relationship between two rigid motion representations for two camera coordinate frames. It is obvious, it is of rank at most 3K. Below based on this rank property, we can infer stereo matching from motion correspondences.
Bregler et.al ever formulated an extension of the factorization technique to multiple views. The additional constraints are assumptions that all views share the same deformation coefficients for a particular time instant. But unfortunately this assumption is much strict. We observe that, shape basis weights have coded the weak perspective scaling implicitly. It is not easy to guarantee the scaling factors of multiple views are approximately equivalent. And, prior to factorization stereo correspondences must be known.
A stereo rig of cameras are constructed, and they observe the object undergoes a rigid motion and a deformed motion during which N pairs of images are taken. Distinct feature points are then extracted from the stereo image sequences, and in each sequence they are tracked separately using the motion correspondence method. Now the stereo correspondences are not established yet while the estimated dense motion correspondences are assumed to be mostly correct. With such motion correspondences, the measurement matrixes for both views can be constructed, here different from previous matrixes, their columns have not been properly ordered. As it is of rank at most 3K, a basis of the 3K-dimensional subspace could be set up as long as a minimum of 3K linearly independent columns are available. Then all the other columns are inferred from the set of basis.
But the predicted result may not be exact due to noise. A measure for feature matching could be count on: normally we calculate the least-mean-squares-error (LMSE) in all the positions over the entire image sequence with reference to the prediction results; However, even this measure is small enough, we can not guarantee it is a correct pair of stereo matching; An additional measure related to windowed template matching is probably taken into account, i.e. the average normalized correlation must be high enough. If not, the image feature is ignored. Finally all the inferred stereo correspondences are grouped together to re-estimate the basis A, which is supposed to be more accurate. This process could be iterated till convergence.
However, we still reconstruct 3D deformable shape via triangulation from views of the calibrated stereo cameras once all the stereo correspondences are obtained. Consequently we can calculate by factorization the 3D shape basis from the measurement matrix of 3D point positions, then extract the pose parameters and shape basis configuration weights by rank-1constraints. Different from Bregler's framework, this time we can extract all nine components of the rotation matrix rather than only the top two rows. Recovering the pose and original configuration weights actually has realized 3-D non-rigid tracking.
In our experiments, we randomly generate 3D basis shapes, their corresponding weights and rigid (rotation) motions to create ground-truth data of multiple features in multiple stereo frames. The object basis shapes are generated randomly by sampling points uniformly inside a cube of size 60cm centered at the origin. The first basis shape is given the largest weight (where the sum of all weights is equal to one) in order to guarantee the overall shape has a strong rigid component. The three Euler angles for each rotation matrix are created by sampling uniformly in [-30°, 30°], and the generated object points are imaged from the stereo cameras under full perspective and quantized to pixels in each image.
To make the weak perspective model feasible, we let depth variance between the closest and furthest object points is much smaller with respect to the distance away from the stereo camera rig. Meanwhile the two optic axes of the stereo cameras converge to the 3-D centroid of object points (i.e. the fixation point). Gaussian noises of zero mean and 1 pixel variance (as a matter of fact, there is quantization error too) are added to the final measurement matrix. In a stereo vision configuration shown in Figure 1, the deformable object is located about 4.5m away from the stereo cameras which are 60cm apart. Each camera has a focal length of 15m and a resolution of 512x512.
Thank Dr. Zhengyou Zhang at Microsoft Research for allowing us to use the test stereo sequences. Their experimental setup is different: the two digital video cameras are mounted vertically and connected to a PC through 1394 links. It is assumed the first camera's coordinate system is regarded as the world coordinate system and therefore its extrinsic parameters turn to be (R,T)=(I, 0). The human face recordings in the collected videos are captured with resolution 320x240 at 30 frames per second. They contain rigid head motions, and non-rigid eye/eyebrow/mouth facial motions.
It is difficult to estimate optical flow from facial motions using traditional gradient-based or template matching methods because the facial surface is smooth and its motion is non-rigid, subject to environmental illuminance change. We choose to use a Bazier Volume model-based face tracker to obtain the optical flow around the face area [Tao98]. For each camera, we track the facial motion using independent face trackers with a dense 3D geometrical mesh model. The first experiment we did is to reconstruct the facial structure from rigid facial motions. In the videos, the human head moves up and backward within 30 frames. A pair of stereo images with depicted tracking points is shown in Figure 1.
Figure 1: Tracking result for rigid motion.
As the face trackers are applied independently to the video sequences of the two cameras. We don’t know whether there is correspondence between the mesh points of the face models used by the two face trackers, except those points at the eye corners and mouth corners. We identify these points as distinct feature points (shown in red) and the correspondences of the rest points are inferred using the bases factorized from the optical flow vectors of these distinct feature points. In the rigid motion case, we take the number of bases K=3. Figure 2 shows the found correspondences of optical flows estimated from the two face trackers. The red trajectories are the mapping of the optical flow of the mesh points from upper camera view to lower camera view. The green trajectories show the found correspondent trajectory of mesh points from video of lower camera. After the correspondence is established, the 3D face geometrical structure in each time instant can be reconstructed. Figure 3 shows the reconstructed mesh points in the 3D space.
Figure 2: Correspondent optical flow trajectories. Figure 3:The reconstructed points in 3D space.
In order to verify our theories with non-rigid motion, we further identified a stereo video sequences in which the subject opens mouth within 8 frames. As shown in Figure 7, the distinct facial features (depicted in red) are the eye corners, mouth corners, nostrils, and the center of the upper and lower lip. As the non-rigid motion only contains the opening mouth, we take K=6 in this case.
Figure 4: Tracking result for nonrigid motion.
The found correspondences of optical flow trajectories are shown in Figure 4. It is shown that most of the found correspondences of the optical flow trajectories are caused by the opening mouth. The reconstructed 3D face geometric structure is shown in Figure 9, where the purple dots are the reconstructed 3D points.
Figure 5: Correspondent optical flow trajectories. Figure 6:The reconstructed 3d face.