Wenshuai Zhao, Yi Zhao, Joni Pajarinen, Michael Muehlebach
CoRL 2024
Imitation learning from human motion capture (MoCap) data provides a promising way to train humanoid robots. However, due to differences in morphology, such as varying degrees of joint freedom and force limits, exact replication of human behaviors may not be feasible for humanoid robots. Consequently, incorporating physically infeasible MoCap data in training datasets can adversely affect the performance of the robot policy. To address this issue, we propose a bi-level optimization-based imitation learning framework that alternates between optimizing both the robot policy and the target MoCap data. Specifically, we first develop a generative latent dynamics model using a novel self-consistent auto-encoder, which learns sparse and structured motion representations while capturing desired motion patterns in the dataset. The dynamics model is then utilized to generate reference motions while the latent representation regularizes the bi-level motion imitation process. Experiments conducted on a simulated realistic humanoid robot demonstrate that our proposed method enhances the robot policy by modifying reference motions to be physically consistent.
The core idea of BMI is that we can leverage the latent space learned via SCAE to optimize the provided reference motions to be physically feasible for the robots while the robot is imitating the reference motions.
The above figure shows the structure of BMI, where we first sample from the learned latent space p(z) and decode the latent sample into motions as the target reference for robot imitation. The loss function of the decoder consists of two parts indicated by red arrows: (1) the MSE loss between the robot trajectory and the decoded trajectory. (2) the latent reconstruction error between sampled latent embeddings and the embeddings of decoded trajectories. The green arrows denote the frozen encoder networks while the blue arrows represent the steps to be optimized. Specifically, we optimize both the robot policy and the decoder alternatively as a bi-level optimization problem.
We extend the dataset used in FLD by including four challenging motions. In total, our dataset consists of 13 motions and each motion consists of 10 trajectories. The left video samples one trajectory from each motion in the dataset. Note that each trajectory contains multiple motion trails. For example, in kick, the robot has 7 trials.
Note that motions in the video are shown kinematically ignoring physics. Some motions can be physically infeasible for the robot.
Our proposed self-consistent auto-encoder (SCAE) learns sparse and structured latent representations of human motions.
Our method BMI improves the baseline on several challenging motions. We qualitatively annotate the motions where our method shows improved behaviors. Note that the video is 1:43 long and the annotation starts from 0:24.
We conduct zero-shot transfer experiments on two modified robots, each with an additional mass block of 1 kg and 5 kg placed on their back. For comparison, the upper leg of the robot weighs 2.63 kg, and the total weight of the robot is 24.25 kg. The video demonstrates that our policy exhibits sufficient robustness to these changes, as the robot successfully executes the targeted motions without significant performance loss.
Although with additional mass, the robot completes all the motions without obvious performance loss.
The increased inertia poses challenges when the robot is required to execute a sharp transition from a stride to a step. However, the robot demonstrates an agile braking maneuver, enabling it to effectively halt and make the transition (see 1:35).
We consider the stride motion as an example. The right video illustrates the trajectories refined by BMI. Thanks to the latent space regularization, the fine-tuned decoder maintains most motions largely unchanged, making only minor adjustments to certain aspects of the movement to improve physical consistency with the robot dynamics. Note that the states (position and rotation) of the robot base are manually configured, as the latent dynamics model is trained exclusively on proprioceptive information.
One of the primary advantages of employing structured latent dynamics is the ability to synthesize new motions by interpolating the latent parameters. The subsequent videos demonstrate adjustments to the run motions. In the video on the left, the robot exhibits smaller step sizes as we decrease the latent amplitude parameters in the SCAE. Similarly, the right video shows that the robot steps less frequently when the latent frequency parameters are reduced. Note that it is possible to try more sophisticated methods for interpolating the latent space.
This ablation study investigates the sensitivity of the coefficient of latent reconstruction error, denoted as β, in the context of learning latent dynamics models. This coefficient is the only hyperparameter introduced by SCAE. It should be noted that FLD corresponds to β=0, whereas SCAE takes β=1.
The left figure illustrates the latent reconstruction error during training. It is evident that within a wide range of β values (0.1-5), the latent reconstruction loss effectively improves the latent reconstruction while maintaining a similar motion reconstruction capability, as demonstrated in the right figure. However, it is also observed that when β=10, there is a slight increase in the motion reconstruction error. In general, we can conclude that the proposed SCAE is robust to a large range of β.