Learning Agile Skills via Adversarial Imitation of Rough Partial Demonstrations

Chenhao Li, Marin Vlastelica, Sebastian Blaes, Jonas Frey, Felix Grimminger, Georg Martius

Autonomous Learning Group, Max Planck Institute for Intelligent System, Germany

Robotic Systems Lab, ETH Zurich, Switzerland

Best Paper Award Finalist, CoRL 2022

Abstract

Learning agile skills is one of the main challenges in robotics. To this end, reinforcement learning approaches have achieved impressive results. These methods require explicit task information in terms of a reward function or an expert that can be queried in simulation to provide a target control output, which limits their applicability. In this work, we propose a generative adversarial method for inferring reward functions from partial and potentially physically incompatible demonstrations for successful skill acquirement where reference or expert demonstrations are not easily accessible. Moreover, we show that by using a Wasserstein GAN formulation and transitions from demonstrations with rough and partial information as input, we are able to extract policies that are robust and capable of imitating demonstrated behaviors. Finally, the obtained skills such as a backflip are tested on an agile quadruped robot called Solo 8 and present faithful replication of hand-held human demonstrations.

WASABI

In this work, we present a novel adversarial imitation learning method named Wasserstein Adversarial Behavior Imitation (WASABI).

Oral Presentation

This work was presented at the Conference on Robot Learning (CoRL) 2022 in Auckland, New Zealand. It was awarded to the best paper finalist.

Overview

Given a reference dataset defining the desired base motion, the system trains a discriminator that learns an imitation reward for the policy training. This imitation reward is then combined with a regularization reward and termination penalty to train a policy that enables the robot to replicate the demonstrated motion while maintaining feasible and stable joint actuation.

WASABI Overview

Evaluation

We evaluate WASABI on the Solo 8 robot, an open-source research quadruped robot that performs a wide range of physical actions, in simulation and on the real system. For evaluation, we introduce 4 different robotics tasks.

SOLOLEAP

Moving forward with a jumping motion.

SOLOWAVE

Producing wave-like locomotion behavior.

SOLOSTANDUP

Standing up on the hind legs.

SOLOBACKFLIP

Generating a full backflip.

Cross-Platform Imitation

Using the recorded reference motion from Solo, we apply WASABI to ANYmal. As only base information is used, WASABI allows cross-platform skill imitations. With a manual offset on the base height dimension addressing different sizes of the platforms, we achieve decent wave behaviors on ANYmal.

ANYMALWAVE

The corresponding wave motion learned by ANYmal, yet from the reference data recorded from Solo 8.

ANYMALBACKFLIP

The corresponding backflip motion learned by ANYmal, yet from the reference data recorded from Solo 8.

Extensions

We show some futher extensions of our work and its potential to learn cross-platform skills.

SOLOWAVE - Velocity Control

Diversity in the reference dataset allows more freedom in motion control. The variance in demonstration speed offers the possibility of active velocity control in locomotion tasks such as SOLOWAVE.

SOLOBACKFLIP - Single Flip

With an additional variable in the observation space of the control policy indicating whether the robot has finished a backflip, active execution of a single flip is achieved.

Reward Ablation

We present motion comparisons to highlight the importance of the regularization and termination rewards.

Regularization Reward

Regularization terms guarantee stable and smooth robot motions, as with many reinforcement learning control methods for robotics. For the best performance on the real robot, they are specifically adjusted for different robot platforms, different tasks, and different quality of reference motions.

Termination Reward

The termination reward is implemented to prevent the learning agent from ending the episode actively. As the value of the imitation reward can be negative due to the unbounded discriminator output, we use this term to make early termination a worse option than continuing learning from the reference motion.

Policy Observation Ablation

We present motion comparisons to highlight the importance of the base information in the observation space of the control policies.

Only Joint Information

For highly dynamic motions, the robot fails to distinguish states where it has to react differently from only the joint information. For instance, perceiving only the joint states does not provide enough information about whether the robot has successfully taken off in SOLOBACKFLIP.