OCR-Pose: Occlusion-aware Contrastive Representation for Unsupervised 3D Human Pose Estimation

ABSTRACT

Occlusion is a significant problem in 3D human pose estimation from 2D counterpart. On one hand, without explicit annotation, 3D skeleton is hard to be accurately obtained from occluded 2D pose lacking reliable 2D joints (local joint ambiguity). On the other hand, One occluded 2D pose might generate multiple 3D skeletons with low confidence parts lacking reliable 3D skeleton (global pose ambiguity). To address these issues, we propose an occlusion-aware contrastive representation based scheme (OCR-Pose) consisting of Topology Invariant Contrastive Learning module (TiCLR) and View Equivariant Contrastive Learning module (VeCLR). Specifically, TiCLR drives invariance to topology transformation, i.e., bridging the gap between occluded 2D pose and unoccluded one. While VeCLR encourages equivariance to view transformation, i.e., capturing the geometric similarity of the 3D skeleton in two views. Both modules optimize occlusion-aware constrastive representation with pose filling and lifting networks via an iterative training strategy in an end-to-end manner. OCR-Pose not only achieves superior performance against state-of-theart unsupervised methods on unoccluded benchmarks, but also obtains significant improvements when occlusion is involved.

Overall Framework

Overview of the proposed OCR-Pose. OCR-Pose exploit the backbone network (i.e., the pose filling and lifting networks) to obtain occlusion-aware contrastive learning representation (i.e., rotation feature and occlusion-aware feature), which can be regressed to 3D skeleton directly. Notably, the occlusion-aware feature is optimized with TiCLR module and VeCLR module, simultaneously. Both modules encourage the latent space generated by positive pairs to lie close to each other, and push the one produced by negative pairs apart.

Contrastive Learning Modules

The basic cycle structure in the left contains two pose lifting network, one pose filling network and a discriminator, with one mask operation, two rotation operations, and two projection operations. Two contrastive representation learning modules in the right consist of TiCLR and VeCLR modules.

Qualitative Results

Qualitative Results on 4 different dataets. Notably, predictions (red) along with ground truth (green) are illustrated in each part (A/B). Top Left: Human3.6M. Top Right: 3DHP. Bottom Left: 3DPW. Bottom Right: 3DOH50K.

Effectiveness under Occlusion Conditions

A qualitative comparison of 2D completion and 3D estimation on Human3.6M and 3DOH50K. Origin 2D (a) and refined 2D (c) are captured by off-the-shelf detectors and the proposed filling network, respectively. [Yu et al., 2021] (b) and ours (d) are the corresponding estimated 3D skeletons of two methods.

Effectivenss of Contrastive Learning Modules

The t-SNE visualization of feature embeddings. A/B correspond to w/ and w/o TiCLR. C/D indicate w/ and w/o VeCLR.