Out-of-Domain Human Mesh Reconstruction via Dynamic Bilevel Online Adaptation

Shanyan Guan, Jingwei Xu, Michelle Z. He, Yunbo Wang, Bingbing Ni, Xiaokang Yang

MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, China

Abstract

We consider a new problem of adapting a human mesh reconstruction model to out-of-domain streaming videos, where the performance of existing SMPL-based models are significantly affected by the distribution shift represented by different camera parameters, bone lengths, backgrounds, and occlusions. We tackle this problem through online adaptation, gradually correcting the model bias during testing. There are two main challenges: First, the lack of 3D annotations increases the training difficulty and results in 3D ambiguities. Second, non-stationary data distribution makes it difficult to strike a balance between fitting regular frames and hard samples with severe occlusions or dramatic changes. To this end, we propose the Dynamic Bilevel Online Adaptation algorithm (DynaBOA). It first introduces the temporal constraints to compensate for the unavailable 3D annotations, and leverages a bilevel optimization procedure to address the conflicts between multi-objectives. DynaBOA provides additional 3D guidance by co-training with similar source examples retrieved efficiently despite the distribution shift. Furthermore, it can adaptively adjust the number of optimization steps on individual frames to fully fit hard samples and avoid overfitting regular frames. DynaBOA achieves state-of-the-art results on three out-of-domain human mesh reconstruction benchmarks.

Adapting to Long Videos

We trained a source model on Human 3.6M, and then adapt it to novel target domains. Below is the results of DynaBOA on videos from BiliBili.

seq07_c01.mp4
seq02_c01.mp4
seq04_c01.mp4
seq10_c01.mp4
seq09_c01.mp4

Adapting to Short Videos

What if adapting to short videos? Short videos are popular on many social platforms, such as Tiktok and Instagram. One concern is what if the target video only has a few seconds. To investigate this case, we collect some videos from Tiktok. We evaluate two kinds of adaptation forms: (1) Independent: adapting to each video independently. That is, restarting from the source model when a new video comes. (2) Streaming: continuously adapting the model on each video. That is, for a new video, the model is initialized with the model obtained from the last video.

The results are shown below. The 2D keypoints are detected by Alphapose, and are drawn on the raw images. When using the streaming adaptation scheme, the videos are sequentially input from left to right and from top to bottom.

000.mp4
001.mp4
002.mp4
003.mp4
004.mp4
005.mp4
006.mp4
007.mp4
008.mp4
009.mp4
010.mp4
011.mp4
012.mp4
013.mp4
014.mp4
015.mp4