Out-of-Domain Human Mesh Reconstruction via Dynamic Bilevel Online Adaptation
MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, China
Abstract
We consider a new problem of adapting a human mesh reconstruction model to out-of-domain streaming videos, where the performance of existing SMPL-based models are significantly affected by the distribution shift represented by different camera parameters, bone lengths, backgrounds, and occlusions. We tackle this problem through online adaptation, gradually correcting the model bias during testing. There are two main challenges: First, the lack of 3D annotations increases the training difficulty and results in 3D ambiguities. Second, non-stationary data distribution makes it difficult to strike a balance between fitting regular frames and hard samples with severe occlusions or dramatic changes. To this end, we propose the Dynamic Bilevel Online Adaptation algorithm (DynaBOA). It first introduces the temporal constraints to compensate for the unavailable 3D annotations, and leverages a bilevel optimization procedure to address the conflicts between multi-objectives. DynaBOA provides additional 3D guidance by co-training with similar source examples retrieved efficiently despite the distribution shift. Furthermore, it can adaptively adjust the number of optimization steps on individual frames to fully fit hard samples and avoid overfitting regular frames. DynaBOA achieves state-of-the-art results on three out-of-domain human mesh reconstruction benchmarks.
Adapting to Long Videos
We trained a source model on Human 3.6M, and then adapt it to novel target domains. Below is the results of DynaBOA on videos from BiliBili.
Adapting to Short Videos
What if adapting to short videos? Short videos are popular on many social platforms, such as Tiktok and Instagram. One concern is what if the target video only has a few seconds. To investigate this case, we collect some videos from Tiktok. We evaluate two kinds of adaptation forms: (1) Independent: adapting to each video independently. That is, restarting from the source model when a new video comes. (2) Streaming: continuously adapting the model on each video. That is, for a new video, the model is initialized with the model obtained from the last video.
The results are shown below. The 2D keypoints are detected by Alphapose, and are drawn on the raw images. When using the streaming adaptation scheme, the videos are sequentially input from left to right and from top to bottom.