Research‎ > ‎

Spatio-temporal Models from Video


Using a multi-resolution hierarchy of temporally-varying subdivision surfaces as a modeling primitive we recover the shape and non-rigid 3D motion of a talking head that is observed from multiple view points. The results capture the shape and motion accurately.


We demonstrate a method to compute three-dimensional (3D) motion fields on a face. Eleven synchronized and calibrated cameras are positioned around a talking person (see Fig.4), and observe its head in motion (Figs.1-3). We represent the head as a multi-resolution subdivision surface, which is fitted in a global optimization step to the spatio-temporal contour and multi-camera stereo data derived from all images. The non-rigid displacement of the mesh from frame to frame, the 3D motion field, is initialized from the spatio-temporal derivatives in all the images and then refined together with the shape using spatio-temporal stereo information. We integrate these cues over time, thus producing an animated representation of the talking head. Our ability to estimate 3D motion fields points to a new framework for the study of action. The 3D motion fields can serve as an intermediate representation, which can be analyzed using geometrical and statistical tools for the purpose of extracting representations of generic actions.

 Fig.1: Camera 34 (223kb mpg) Fig.2: Camera 35 (290kb mpg)
 Fig.3 :Camera 36 (171kb mpg) Fig.4 :Complete Camera SetUp

Results of the Spatial and Temporal Structure Reconstruction.

The final result of the reconstruction algorithm can be seen in the 360 rotation around the model (Fig.5), unfortunately only 2 color cameras were available, thus we have this mixture of gray-scale and color textures.
The animations in Figs.6-8 demonstrate the accurate recovery of the 3d motion flow for each point on the face (Fig.6).
The animation in Fig.7 shows the reconstructed spatio-temporal model if we texture map it with the image information from the current frame which gives an accurate recreation of the action from arbitrary new view points, while the propation of the texture map solely extracted from frame 1 results in animation Fig.8 which demonstrates the accuracy of the extracted 3D motion flow.

 Fig.5: 360 degree spin around head model (1.5MB avi) Fig.6: Full motion flow field computed
(773kb mpg)
 Fig.7: Full flow animation of face over 80 frames (1.5Mb avi) Fig.8: Full flow animation by propagating the texture map from the first frame (1.5Mb avi)

Further Information.

For more details about the multi-resolution mesh representation and the reconstruction algorithm please read the accompanying paper
Spatio-temporal stereo using multi-resolution subdivision surfaces.
Jan Neumann and Yiannis Aloimonos.
International Journal of Computer Vision, 47(1/2/3):181-193, 2002.