Summary:
We present a novel 3D global HMR method, DiffOpt. Our key insight is that recent advances in human motion generation, such as the motion diffusion model (MDM), contain a strong prior of coherent human motion. The core of our method is to optimize the initial motion reconstruction using the MDM prior. This step can lead to more globally coherent human motion. Finally, our optimization jointly optimizes the motion prior loss and reprojection loss to correctly disentangle the human and camera motions.
We validate DiffOpt with video sequences from the Electromagnetic Database of Global 3D Human Pose and Shape in the Wild (EMDB) and demonstrate superior global human motion recovery capability over state-of-the-art global HMR methods such as GLAMR and SLAHMR.
Optimization Framework:
DiffOpt uses neural motion fields to predict the pose, root orientation, and global root translation for each frame. We regress these parameters using the SMPL body model to get the 3D joint and vertex positions. Our predicted motion is then constrained by 3D loss against initial predictions from off-the-shelf HMR models, 2D re-projection loss against predictions from 2D keypoint detection models, and motion prior loss from the motion diffusion model
Qualitative Results
We present body mesh renderings from GLAMR, SLAHMR, and DiffOpt on five distinct EMDB motion sequences. For each motion sequence, we present renderings in both the original monocular video as well as the empty static world view.
Outdoor Walk
GLAMR struggles to cover the human body fully, and the occluded right arm suffers from heavy jittering
SLAHMR shows minor inaccuracies in root orientation by the end of the sequence, but feet position may be the most accurate out of the three
DiffOpt has the most accurate root orientation throughout the entire sequence, but feet position is slightly off after the body turns
GLAMR's entire mesh jitters heavily, and the global translation is inconsistent with the joint articulations, causing the human to "moonwalk" due to global translation that is unaccounted for
SLAHMR shows minor inaccuracies in root orientation after the body turns
DiffOpt has the most accurate root orientation throughout the entire sequence, and the walking motion looks highly realistic and accurate
Outdoor Warm-up
GLAMR has highly inaccurate joint articulations, and the mesh shows poor coverage of the human body
SLAHMR demonstrates a heavy failure mode where the entire body jitters, causing highly implausible instantaneous global translations
DiffOpt clearly has the most accurate and realistic joint articulations and replicates the human motion the best
GLAMR suffers from heavy jittering and does not accurately capture the global translation of the human body
SLAHMR's failure mode is much more apparent in the static world rendering, as the rendered mesh shifts instantaneously multiple times
DiffOpt, although the magnitude of the global translation is left to be desired when the body moves to its right, depicts the smoothest and realistic global motion by far
Outdoor Lunges
GLAMR struggles with jittering in the leg region despite both legs firmly planted on the ground while performing a lunge, and the mesh often fails to cover the upper body entirely
Both SLAHMR and DiffOpt accurately and realistically portray the lunge motion depicted in the sequence; distinguishing the quality of the two renderings is difficult
GLAMR demonstrates a failure mode where it completely fails to capture the global translation of the body, as we see the mesh levitate and fall back on the ground
SLAHMR recovers a highly accurate and realistic global motion of a lunge
DiffOpt, while the joint articulations are highly similar to that of SLAHMR, does not capture the downward motion of the torso during a lunge as accurately as SLAHMR
Outdoor Soccer Warm-up
GLAMR has the least accurate joint articulation throughout the entire sequence, causing the mesh to rarely cover the full body
SLAHMR shows minor inaccuracies in feet position occasionally, but the quality of the recovered motion is similar to that of DiffOpt for the most part
DiffOpt's mesh throughout the motion sequence looks quite flawless in terms of the body's orientation (captures the twisting of the body well), feet position, and joint articulations
GLAMR recovers a global translation that is completely inaccurate from the actual motion depicted in the video
SLAHMR has the least feet-sliding issue, but the body doesn't translate enough
DiffOpt and SLAHMR are level in terms of recovering accurate joint articulations; however, DiffOpt has a more accurate global translation while simultaneously having more feet-sliding
Outdoor Circular Run
GLAMR once again struggles to cover the human body fully due to inaccurate joint articulations
SLAHMR and DiffOpt demonstrate comparable accuracies in joint articulation and root orientation, but DiffOpt does a better job portraying the downward ducking/crouching motion in the end
GLAMR once again demonstrates a similar failure mode as the previous motion above, as it suffers from heavy jittering and completely inaccurate global translation
SLAHMR shows minor inaccuracies in joint articulation for its left arm that is partially occluded throughout the sequence, and the final crouching motion isn't accurate at all
DiffOpt has the most accurate joint articulation throughout the entire sequence
BibTeX:
@misc{heo2024motiondiffusionguided3dglobal,
title={Motion Diffusion-Guided 3D Global HMR from a Dynamic Camera},
author={Jaewoo Heo and Kuan-Chieh Wang and Karen Liu and Serena Yeung-Levy},
year={2024},
eprint={2411.10582},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2411.10582},
}