Constraint Style Learning from Imperfect Demonstrations under Task Optimality

Kehan Wen, Chenhao Li, Junzhe He, Marco Hutter

Robotics System Lab, ETH AI Center, ETH Zurich

Given a reference dataset, ConsMimic calculates style reward R^s using either (a) motion clip tracking or (b) adversarial imitation learning methods. Such style rewards R^s are combined with task rewards R^g within a constrained optimization framework illustrated in the red frame. Separate critic networks estimate task and style advantages, which are subsequently weighted by a self-adjustable Lagrangian multiplier and finally used to optimize the policy.

Abstract:

Learning from demonstration has proven effective in robotics for acquiring natural behaviors, such as stylistic motions and lifelike agility, particularly when explicitly defining style-oriented reward functions is challenging. Synthesizing stylistic motions for real-world tasks usually requires balancing task performance and imitation quality. Existing methods generally depend on expert demonstrations closely aligned with task objectives. However, practical demonstrations are often incomplete or unrealistic, causing current methods to boost style at the expense of task performance. To address this issue, we propose formulating the problem as a constrained Markov Decision Process (CMDP). Specifically, we optimize a style-imitation objective with constraints to maintain near-optimal task performance. We introduce an adaptively adjustable Lagrangian multiplier to guide the agent to imitate demonstrations selectively, capturing stylistic nuances without compromising task performance. We validate our approach across multiple robotic platforms and tasks, demonstrating both robust task performance and high-fidelity style learning. On ANYmal-D hardware we show a 14.5 % drop in mechanical energy and a more agile gait pattern, showcasing real-world benefits.

Results:

we visualize the GR1 & ANYmal locomotion performance in simulation & real world trained by ConsMimic and the baseline that policy is trained solely based on task rewards.

For GR1 on flat ground:

Compared to task-only baseline, ConsMimic achieves more natural motions such as coordinated arm-leg movements, less unnecessary knee movements and less drift when following a straight line velocity command.

For GR1 on challenging terrains (stairs & stones)

ConsMimic demonstrates more agile motions when traversing terrains. We furthermore shows the visualization results of our fixed imitation weight baselines as follows:

The small imitation weight baseline show poor style fidelity while large imitation weight prevent the task completion. We test ConsMimic on ANYmal-D hardware:

We command the robot to do eight return episodes and we find motions achieved by ConsMimic are more energy efficient with more agile gaits.

Page updated

Google Sites

Report abuse