STRIDE: Single-video based Temporally Continuous Occlusion-Robust 3D Pose Estimations
Rohit Lal, Saketh Bachu, Yash Garg, Arindam Dutta, Calvin-Khang Ta, Hannah Dela Cruz, Dripta S. Raychaudhuri, M. Salman Asif, Amit K. Roy-Chowdhury
[Oral, WACV 2025, Tucson, AZ]
University of California, Riverside
Accurately estimating 3D human poses under severe occlusions is crucial for tasks like action recognition, gait analysis, and AR/VR. Current models struggle with heavy occlusions due to limited temporal context or prolonged occlusions across frames. To address this, we introduce STRIDE (Single-video TempoRally contInuous occlusion-robust 3D Pose Estimation), a novel Test-Time Training (TTT) approach that refines noisy initial pose estimates into accurate, temporally coherent predictions. STRIDE is model-agnostic and enhances robustness and temporal consistency using any off-the-shelf 3D pose estimator. Experiments on challenging datasets show STRIDE significantly outperforms single-image and video-based methods, especially under substantial occlusions.
Results on Occluded Human3.6M
Results on OCMotion
Results on different Domains like Occluded H36M, BRIAR and OCMotion
Architecture
This figure illustrates the pipeline of STRIDE, our temporally continuous pose estimation model. Initially, we pre-train a motion prior model, denoted as M, using a diverse set of 3D pose data sourced from various public datasets. The primary objective of this motion-prior model is to generate a sequence of poses that exhibit temporal continuity when provided with a sequence of initially noisy poses. Moving into the single video training stage, we acquire a sequence of noisy poses using a 3D pose estimation model, P. The weights of P are held constant during this phase. Subsequently, we pass this noisy pose sequence through the motion prior to model M and retrain it using various supervised losses. The end result of this training process is a model capable of producing temporally continuous 3D poses.
Samples with
Natural Occlusions
Natural Occlusions
This figure shows how our method works when tested in natural occlusion cases. The translucent blue color in the second column, third column, and fourth column represents the ground truth. Blue, red, and green similarly represent Ground Truth, PoseformerV2 and STRIDE results, respectively.
We compare our method against an existing state-of-the-art 3D pose estimation method named PoseFormerV2. We observe that STRIDE's skeleton is best aligned with the actual ground truth pose, even when there is significant occlusion
This figure demonstrates how our method incorporates temporal continuity into video sequences under occlusion. The second row represents 3D poses predicted by CycleAdapt. The third row represents 3D poses predicted by STRIDE. Note: The 3D poses shown in translucent red color in the second and third rows represent the ground truths.
CycleAdapt (second row) fails to generalize in cases when there is complete occlusion. STRIDE (third row) produces temporally coherent pose infilling due to test time training. Note that the translucent red color represents the ground truth poses.