Redefining Data Pairing for Motion Retargeting
Leveraging a Human Body Prior
Xiyana Figuera*, Soogeun Park*, and Hyemin Ahn
*These authors equally contributed to this work.
Xiyana Figuera*, Soogeun Park*, and Hyemin Ahn
*These authors equally contributed to this work.
Our paper was accepted to IROS 2024 !! 🎉
We propose MR.HuBo (Motion Retargeting leveraging a Human Body prior), a cost-effective and convenient method to collect high-quality upper body paired <Human, Robot> pose data, which is essential for data-driven motion retargeting methods. Unlike existing approaches which collect <Human, Robot> pose data by converting human MoCap poses into robot poses, our method goes in reverse. We first sample diverse random robot poses, and then convert them into human poses. However, since random robot poses can result in extreme and infeasible human poses, we propose an additional technique to sort out extreme poses by exploiting a human body prior trained from a large amount of human pose data. Our data collection method can be used for any humanoid robots, if one designs or optimizes the system's hyperparameters which include a size scale factor and the joint angle ranges for sampling. In addition to this data collection method, we also present a two-stage motion retargeting neural network that can be trained via supervised learning on a large amount of paired data. Compared to other learning-based methods trained via unsupervised learning, we found that our deep neural network trained with ample high-quality paired data achieved notable performance. Our experiments also show that our data filtering method yields better retargeting results than training the model with raw and noisy data.
MR. HuBo is a novel framework for generating paired <Human, Robot> pose dataset utilizing VPoser [2], which is a variational autoencoder that learned a human body prior distribution.
Conventional pairing methods start with a human pose and find a corresponding robot pose. These methods, however, are limited in diversity because they require expensive equipment, such as a MoCap suit, to collect human poses, as well as a human to perform the motion during data collection. Moreover, they are time-consuming, as they solve the inverse kinematics in the robot pose domain using optimization-based methods.
In contrast, MR. HuBo addresses this problem by using VPoser, starting with a robot pose and finding a corresponding human pose. You can see what VPoser is and how we used it in the next section.
The following describes the process of collecting <q, R, H> data pairs, where q is a vector of robot joint angles, R is a matrix representing the robot link orientations in a 6D representation of orientation [3], and H is a matrix representing the human joint orientations in a 6D representation:
Randomly sample a robot pose q from the valid min-max joint angle range
Calculate XYZ position P and (6D) orientation R of robot links by solving Forward Kinematics.
Obtain Human Pose H using  VPoser_{IK} as an inverse kinematics solver. Â
Obtain the denoised human pose by passing H to the encoder and decoder of VPoser.
Calculate reconstruction error between original and denoised human pose. Then, Append <q, R, H> if the error is less than the threshold.
Thanks to the work of Pavlakos et al. [2], we were able propose MR. HuBo.
VPoser is a Variational Autoencoder that has learned a prior distribution of SMPL [4] pose parameters from massive human pose data [5]. We use VPoser as (1) an IK solver and (2) a denoising network for human body pose data. VPoser can return SMPL pose parameters corresponding to given joint positions. We can also input a human pose into VPoser, and the encoder and decoder will output a human pose that is close to the learned prior distribution of VPoser. If we input a natural human pose, it will return almost the same pose because there is a similar pose in the prior distribution, but if we input an extreme pose, it will return a pose with a large difference from the input.
With the given position of body joints, VPoser returns human pose H (joint orientation in 6D Representation).
Fast performance enabled by GPU-accelerated parallel processing.
Obtain denoised output (i.e. human pose close to the prior distribution) by passing encoder and decoder.
With a large amount of high-quality paired data, we can train a motion retargeting network in supervised learning. Therefore, we propose Two-Stage Network to solve the motion retargeting task in real-time.
The Two-Stage network consists of Pre-network (F_pre) and Post-network (F_post). The Pre-network learns a mapping between human joint orientations and robot link orientations in 6D representation (H -> R), and the Post-network learns a mapping between link orientations and joint angles in robot space (R -> q).
By applying PyMAF-X [6], which can obtain SMPL parameters from a RGB image (F_smpl), in front of the Two-Stage network, we can generate the robot's motion from the human image input using the RGB camera.
Please cite our paper with below bibtex information.
@inproceedings{MR_HuBo:2024,
title={Redefining Data Pairing for Motion Retargeting Leveraging Human Body Prior},
author={Figuera, Xiyana and Park, Soogeun and Ahn, Hyemin.},
year=2024,
month=october,
booktitle={2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
event_place ={Abu Zaby (Abu Dhabi), United Arab Emirates},
month_numeric=10
}
[1] S. Choi, M. J. Song, H. Ahn, and J. Kim, “Self-supervised motion retargeting with safety guarantee,” in 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021, pp. 8097–8103.
[2] G. Pavlakos, V. Choutas, N. Ghorbani, T. Bolkart, A. A. A. Osman, D. Tzionas, and M. J. Black, “Expressive body capture: 3D hands, face, and body from a single image,” in Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 10 975–10 985.
[3] Y. Zhou, C. Barnes, J. Lu, J. Yang, and H. Li, “On the continuity of rotation representations in neural networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 5745–5753.
[4] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black, “SMPL: A skinned multi-person linear model,” ACM Trans. Graphics (Proc. SIGGRAPH Asia), vol. 34, no. 6, pp. 248:1–248:16, Oct. 2015.
[5] N. Mahmood, N. Ghorbani, N. F. Troje, G. Pons-Moll, and M. J. Black, “Amass: Archive of motion capture as surface shapes,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 5442–5451.
[6] H. Zhang, Y. Tian, Y. Zhang, M. Li, L. An, Z. Sun, and Y. Liu, “Pymaf-x: Towards well-aligned full-body model regression from monocular images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.