Human Body Movement Learning and Tracking in the Monocular Sequence of Images
Single Arm Movements Learning: see demo for single arm.
Upper Body Movements Learning: see demo for upper body.
Because of many potentially important applications, "Looking at People" is currently one of the most active application domains in computer vision [Aggarwal97] [Gavrila99]. This trend is motivated by a wide spectrum of applications, such as smart surveillance, virtual reality, HCI, content-based video indexing, model-based image coding and video conferencing. From a technical point of view, this domain is rich and challenging because of the need to segment rapidly changing scenes in natural environments involving non-rigid motion and camera projection singularity, large number of DOFs, (self) occlusion and image noises from background and human clothing.
The human body models can be simple as the articulated body skeleton with connected segments [Cham99] [Song00] or 2-D patches-based models [Ju96] [Bregler98][Jojic99], and even be complex as volumetric models with combinations of elementary volumes [Deutscher00] [Sidenbladh00] [Yamamoto00]. The statistical techniques have been exploited to infer a generic template form from a labeled training set associated to representative deformation models [Baumberg94]. Some works are put on learning the human body movement using PCA [Deutscher00] [Howe99], the learned body motion modes, such as walking, running or jumping, will help model matching during on-line body tracking.
Methods for full body tracking typically use sparse cues such as background difference images, color and edges [Wren97]. Motion or optic flow give rich information, but can cause the tracking model to “drift off?the target [Bregler98][Ju96]. The use of template avoids this problem, but template tracking is sensitive to changes in view and illumination [Cham99]. Multiple camera views are often employed to reduce ambiguity and problems due to self-occlusion [Delamarre99][Yamamoto98][Jojic99]. Recently some researchers regard body tracking as an inference problem [Sidenbladh01] [Ioffe01], they adopt the Bayesian formulation and estimate the model parameters over time using particle filtering [Cham99] [Deutscher00] [Sidenbladh00].
In Figure 1 the defined 2-D articulated cardboard body model is illustrated. Its kinematics is similar to that of the 2-D PSM [Morris98]. We define each body part as an isosceles trapezoidal planar patch either from the front view 1(a) or from the side view 1(b), and those patches are linked together by the kinematic chains in a hierarchical manner shown in 1(c). In the front view and side view, the joints are labeled in yellow and the links are illustrated in red, the origin of the body coordinate system is located at the joint of the torso depicted in black in 1(a), overlapped with the joint of thigh in 1(b). Here because of occlusion, we only consider half of the body model in the side view (the dashed lines depict the occluded body parts).
Figure 1: A 2-D Human Body Model: Scaled Prismatic Planar Model.
A general framework for model-based tracking consists of four main components involved: prediction, synthesis, image analysis and state estimation. Different approaches have been discussed in this framework. One possibility is to use a “divide-and-conquer technique?where an articulated object is decomposed into a number of primitive (rigid or articulated) sub-parts; one solves for motion and depth of the sub-parts and verifies whether the parts satisfy the necessary constraints. Instead other approaches use parameterized models where the articulation constraints are encoded in the representation itself, thus they take advantage as much as possible of prior knowledge and rely as little as possible on error prone 2-D image segmentation. This group can be divided into two types of methods: one is using such parameterized model to update poses by inverse kinematics, another does not attempt to inverse a non-linear measurement equation, instead it uses the measurement equation directly to synthesize the model and then uses a fitting measure between synthesized and observed features for feedback.
An articulated object’s posture can be parameterized in many different ways. One possibility is to use a set of parameters such as the position and orientation parameters of each part and impose the constraints in a Lagrangian form [Ju96]. Another approach is to use the kinematic chain equations and select parameters, such as the orientations of each part and the position of a reference point [Bregler98][Yamamoto98].
Our body model is like a 2-D cardboard model [Ju96], only its kinematic constraints are generated similar to [Morris98]. [Morris98] assumes a template is attached to each link which rotates and scales with the link. The rotational DOF captures the effect on the link orientation, and the translational DOF models the foreshortening that occurs when 3D links rotate into and out of the image plane. But we can observe that, when the link rotates into and out of the image plane the template width is also changing. So we consider further a translational component in the direction orthogonal to the link axis, which will effectuate change of the planar patch in width. Since a column of the Jacobian matrix, maps the body state velocity to a image velocity, by finding the image velocity in terms of this state we can obtain an expression for Jacobian matrix.
Torso motion is special, as the base link we estimate independently its 5-D affine motion: two translational parameters, rotation angles around its joint, and two scaling parameters.
If we substitute the inverse kinematic equation into the optic flow constraint equation, we can solve the articulated motion parameters in a LS framework. If we have a chain of K+1 segments linked with K joints, then we derive respective IWLS estimation formulas.
We account for self-occlusion through the use of the support maps, the support maps are used to weight the LS problem above to calculate articulated motion parameters.Then we construct the EM framework for object tracking based on the support-layered representation of environments: We start with an initial guess of the support maps, 1) (M-step) calculate the motion parameters ; 2) (E-step) given the motion parameters adjust the state of the body model,determine the spatial prior (Gaussian for body part and uniform for background) and recalculate the support maps at the next frame.
While initializing the side body model, we must assume the occlusion order is known. We move our 2-D model’s origin, joints and four corners for each planar patch to fit the body view in the image, so the initial support map for each body part is also determined. Then the state variables for the body posture are calculated, i.e. joint angles, lengths and widths of body parts (For convenience, we fix first the origin of the torso, then the joint of each part, and at last the fours corners. The length and width is calculated from four corners).
Figure 2: Initialization of 2-D Model-based Human Body Tracking.
Figure 3: Sample Video Clip of Human Walking.
Figure 4: Tracking Illustration of Human Walking (model fitting and layers).