Abstract: Human pose estimation using monocular vision is a challenging problem in computer vision. Past work has focused on developing efficient inference algorithms and probabilistic prior models based on captured kinematic/dynamic measurements. However, such algorithms face challenges in generalization beyond the learned dataset.
In this work, we propose a model-based generative approach for estimating the human pose solely from uncalibrated monocular video in unconstrained environments without any prior learning on motion capture/image annotation data. We propose a novel Product of Heading Experts (PoHE) based generalized heading estimation framework by probabilistically-merging heading outputs (probabilistic/non-probabilistic) from time varying number of estimators. Our current implementation employs motion cues based human heading estimation framework to bootstrap a synergistically integrated probabilistic-deterministic sequential optimization framework to robustly estimate human pose. Novel pixel-distance based performance measures are developed to penalize false human detections and ensure identity-maintained human tracking. We tested our framework with varied inputs (silhouette and bounding boxes) to evaluate, compare and benchmark it against ground-truth data (collected using our human annotation tool) for 52 video vignettes in the publicly available DARPA Mind’s Eye Year I dataset 1 . Results show robust pose estimates on this challenging dataset of highly diverse activities.
Summary of optimization framework implemented for pose estimation on each frame.
Research Contributions:
1. Product of Heading Experts - We model the heading estimation task independent of features/types of individual estimators and focus on optimally fusing the information from all the available estimators. Hence, we propose a Product of Heading Experts (PoHE) based generalized heading estimation framework which probabilistically merges heading outputs from time varying number of estimators to produce robust heading estimates under varied conditions in unconstrained scenarios.
2. Motion Cues Based Heading Estimation - We propose a novel generative model for estimating heading direction of the subject in the video using motion-based cues thus, significantly reducing the pose search space.
3. Decoupled Pose Estimation - We propose a sequential optimization based framework optimizing the uncoupled pose states (camera/body location, body joint angles) separately using a combination of deterministic and probabilistic optimization approaches to leverage the advantages associated with each.
4. Probabilistic-Deterministic Optimization Scheme - We achieve faster convergence to the global minima by obtaining initial guesses using population based global optimization technique for deterministic convex optimization scheme.
5. Identity Maintained Pose Evaluation Metric - We introduce the notion of pose evaluation for videos with multiple humans by defining identity maintained pose evaluation metrics.
Ground Truth Collection:
Human Annotation GUI
Results:
Raw Unfiltered Pose Estimates
Quantitative Metrics:
Using Background Subtraction and Tracked Humans
Using Background Subtraction and Detected Humans
Quantitative Metrics:
Using Background Subtraction Alone
Using Manually Labelled Silhouette
Heading Angle Estimation
Previous Results on KTH Dataset
The two hypothes are obtained for different input images. The first column shows the input images, the second and the third columns show the output model images for the estimated pose.
Related Publications:
Conference/Journal
P. Agarwal, S. Kumar, J. Ryde, J. Corso, and V. Krovi, "Estimating Dynamics On-the-fly Using Monocular Video For Vision-Based Robotics", IEEE/ASME Transactions on Mechatronics, 2013. [IEEE Xplore]
P. Agarwal, S. Kumar, J. Ryde, J. Corso, and V. Krovi, "An Optimization Based Framework for Human Pose Estimation in Monocular Videos", International Symposium on Visual Computing, Rethymnon, Crete, Greece, July 16-18, 2012. [PDF] [Springer] [Videos]
P. Agarwal, S. Kumar, J. Ryde, J. Corso, and V. Krovi. Estimating Human Dynamics On-the-fly Using Monocular Video for Pose Estimation. Robotics: Science and Systems Conference, University of Sydney, Sydney, Australia, July 9-13, 2012. [PDF] [RSS12] (National ICT Australia Student Fellowship, Robotics Science and Systems 2012)
P. Agarwal, S. Kumar, J. Corso, and V. Krovi. Estimating Dynamics On-the-fly Using Monocular Video. Dynamic Systems and Control Conference, California, October 12-14, 2011. [PDF] [ASME]
Book Chapters
P. Agrawal, S. Kumar, J. Ryde, J. Corso and V. Krovi, “Estimating Human Dynamics On-The-Fly Using Monocular Video For Pose Estimation,” in Robotics: Science and Systems VIII, P. Newman, N. Roy and S. Srinivasa (Eds.), MIT Press, pp. 1-8, August 2013. [IEEE Xplore]
Theses
P.Agarwal, "Dynamics-based Human Pose Estimation Using Monocular Vision", Master's Thesis, State University of New York at Buffalo (SUNY Buffalo), 2012. [PDF]
Reports
S. Kumar, P. Agarwal, J. Corso, and V. Krovi. Product of Tracking Experts for Human Tracking. European Conference on Computer Vision, Firenze, Italy, October 7-13, 2012.
P. Agarwal, "An Optimization Framework for Pose Estimation of Human Lower Limbs from a Singe Image", Project Report, Optimization in Engineering Design. [Report] [Poster]
Interim Report [PDF]