TEMP3D: Temporally Continuous 3D Human Pose Estimation Under Occlusions

Rohit Lal*, Yash Garg*, Arindam Dutta, Calvin-Khang Ta, Dripta S. Raychaudhuri,
M. Salman Asif, Amit K. Roy-Chowdhury

University of California, Riverside Amazon AWS AI Labs

Code

Arxiv Link

Existing 3D human pose estimation methods perform remarkably well in both monocular and multi-view settings. However, their efficacy diminishes significantly in the presence of heavy occlusions, which limits their practical utility. For video sequences, temporal continuity can help infer accurate poses, especially in heavily occluded frames. In this paper, we aim to leverage this potential of temporal continuity through human motion priors, coupled with large-scale pre-training on 3D poses and self-supervised learning, to enhance 3D pose estimation in a given video sequence. This leads to a temporally continuous 3D pose estimate on unlabelled in-the-wild videos, which may contain occlusions, while exclusively relying on pre-trained 3D pose models. We propose an unsupervised method named TEMP3D that aligns a motion prior model on a given in-the-wild video using existing SOTA single image-based 3D pose estimation methods to give temporally continuous output under occlusions. To evaluate our method, we test it on the Occluded Human3.6M dataset, our custom-built dataset which contains significantly large (up to 100%) human body occlusions incorporated into the Human3.6M dataset. We achieve SOTA results on Occluded Human3.6M and the OcMotion dataset while maintaining competitive performance on non-occluded data

Results on Occluded Human3.6M

Results on OCMotion

Architecture

The presented figure illustrates our temporally continuous pose estimation or TEMP3D pipeline. Initially, we pre-train a motion prior model, denoted as M, using a diverse set of 3D pose data sourced from various public datasets. The primary objective of this motion-prior model is to generate a sequence of poses that exhibit temporal continuity when provided with a sequence of initially noisy poses. Moving into the single video training stage, we acquire a sequence of noisy poses using a 3D pose estimation model, P. The weights of P are held constant during this phase. Subsequently, we pass this noisy pose sequence through the motion prior to model M and retrain it using various supervised losses, The end result of this training process is a model capable of producing temporally continuous 3D poses

Samples with
Natural Occlusions

This figure shows how our method works when tested in natural occlusion cases. The translucent blue color in the second column, third column, and fourth column represents the ground truth. Blue, red, and green similarly represent Ground Truth, PoseformerV2 and TEMP3D results, respectively.

We compare our method against an existing state-of-the-art 3D pose estimation method named PoseFormerV2. We observe that TEMP3D’s skeleton is best aligned with the actual ground truth pose, even when there is significant occlusion

License: