Out-of-domain Human Mesh Reconstruction

Out-of-Domain Human Mesh Reconstruction via Dynamic Bilevel Online Adaptation

Shanyan Guan, Jingwei Xu, Michelle Z. He, Yunbo Wang†, Bingbing Ni, Xiaokang Yang

MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, China

[Paper] [Code]

Abstract

We consider a new problem of adapting a human mesh reconstruction model to out-of-domain streaming videos, where the performance of existing SMPL-based models are significantly affected by the distribution shift represented by different camera parameters, bone lengths, backgrounds, and occlusions. We tackle this problem through online adaptation, gradually correcting the model bias during testing. There are two main challenges: First, the lack of 3D annotations increases the training difficulty and results in 3D ambiguities. Second, non-stationary data distribution makes it difficult to strike a balance between fitting regular frames and hard samples with severe occlusions or dramatic changes. To this end, we propose the Dynamic Bilevel Online Adaptation algorithm (DynaBOA). It first introduces the temporal constraints to compensate for the unavailable 3D annotations, and leverages a bilevel optimization procedure to address the conflicts between multi-objectives. DynaBOA provides additional 3D guidance by co-training with similar source examples retrieved efficiently despite the distribution shift. Furthermore, it can adaptively adjust the number of optimization steps on individual frames to fully fit hard samples and avoid overfitting regular frames. DynaBOA achieves state-of-the-art results on three out-of-domain human mesh reconstruction benchmarks.

Adapting to Long Videos

We trained a source model on Human 3.6M, and then adapt it to novel target domains. Below is the results of DynaBOA on videos from BiliBili.

seq07_c01.mp4

seq02_c01.mp4

seq04_c01.mp4

seq10_c01.mp4

seq09_c01.mp4

Adapting to Short Videos

What if adapting to short videos? Short videos are popular on many social platforms, such as Tiktok and Instagram. One concern is what if the target video only has a few seconds. To investigate this case, we collect some videos from Tiktok. We evaluate two kinds of adaptation forms: (1) Independent: adapting to each video independently. That is, restarting from the source model when a new video comes. (2) Streaming: continuously adapting the model on each video. That is, for a new video, the model is initialized with the model obtained from the last video.

The results are shown below. The 2D keypoints are detected by Alphapose, and are drawn on the raw images. When using the streaming adaptation scheme, the videos are sequentially input from left to right and from top to bottom.

000.mp4

001.mp4

002.mp4

003.mp4

004.mp4

005.mp4

006.mp4

007.mp4

008.mp4

009.mp4

010.mp4

011.mp4

012.mp4

013.mp4

014.mp4

015.mp4