View-Invariant Probabilistic Embedding for Human Pose

ECCV 2020 Spotlight

[paper] [talk] [blog] [code] [data processing script]

IJCV Paper (with temporal pose embeddings and occlusion-robustness):

[paper] [code]

If you found our work useful, please consider citing:

Abstract

Depictions of similar human body configurations can vary with changing viewpoints. In this paper, we propose an approach for learning a compact view-invariant embedding space from 2D joint keypoints alone. Since 2D poses are projected from 3D space, they have an inherent ambiguity, which is difficult to represent through a deterministic mapping. Hence, we use probabilistic embeddings to model this input uncertainty.

Experimental results show that our embedding model achieves higher accuracy when retrieving similar poses across different camera views, in comparison with 2D-to-3D pose lifting models. We also demonstrate the effectiveness of applying our embeddings to view-invariant action recognition and video alignment.

Introduction

We embed 2D poses such that our embeddings are:
(a) view-invariant (2D projections of similar 3D poses are embedded close)
(b) probabilistic (embeddings are distributions that cover different 3D poses projecting to the same input 2D pose).

Approach

We use multi-view pose data to train our model. Our model maps input 2D poses to mean and variance of Gaussian embeddings. During training, we form triplets (anchor, positive, negative) of 2D poses. The anchor and positive are 2D poses from different views of the same 3D pose, and the negative is from a different 3D pose. We define matching probability as the probability that two 2D poses are projected from the same, or similar 3D poses.

Our training losses are:
(a) Triplet Ratio Loss: maximize ratio between matching probability of positive and negative pairs, since anchor & positive should have high matching probability and anchor & negative should have lower matching probability.
(b) Positive Pairwise Loss: push matching probability of anchor & positive towards 1.
(c) Gaussian Prior Loss: regularize probabilistic embedding mean and variance.

Results

Our model works with any 2D pose detector and the learned embeddings can be used for a variety of downstream tasks where view-invariance is useful, such as pose retrieval, video alignment and action recognition.

Pose Retrieval
Our embeddings can retrieve images with the same pose across views using nearest neighbors, without the need for camera parameters, including in-the-wild images.

Action Recognition

Video Alignment
These embeddings can also be used to align videos from different views. [More results]

Original videos from Penn Action (Zhang et al.)

Discussion

Our work demonstrates that 2D poses can be mapped to view-invariant embeddings, and the learned embeddings can be directly used for pose retrieval, video alignment, and action recognition. We hope that our work can encourage more explorations into pose representations and their applications.

Acknowledgements

We thank Yuxiao Wang, Debidatta Dwibedi, and Liangzhe Yuan from Google Research, Long Zhao from Rutgers University, and Xiao Zhang from University of Chicago for helpful discussions. We appreciate the support of Pietro Perona, Yisong Yue, and the Computational Vision Lab at Caltech for making this collaboration possible. The author Jennifer J. Sun is partially supported by NSERC #PGSD3-532647-2019.

Correspondence to jjsun (at) caltech.edu.