Learning Robust, Agile, Natural Legged Locomotion Skills in the Wild

Yikai Wang*, Zheyuan Jiang*, Jianyu Chen

*joint first authors

Tsinghua University 

CoRL 2023 Workshop on Robot Learning in Athletics 

[paper]


Video:  Overview

Video: Demonstration of Single Gait Pattern 

Turn

Gallop

High speed.

trot

Normal speed.

Pace

Low speed.

Video: Adaptive Transition/Integration of Gait Patterns 

When the need to overcome obstacles arises, some new gait styles emerge.

When running is disrupted, balance is restored through high-frequency short steps of the front legs.

Video: Ablation study

Without dataset mirroring

When the speed increases, it will tilt towards the right side sometimes.

Without applied torques as privileged information during training

It can not restore balance when gallop on the soccer pitch.

Video: Real World Experiments

Outdoor Experiments

Outdoor experiments on soccer pitch, grassland and rocks. 

Indoor experiments

Implementation Details

State and Action Spaces

The output action a_t comprises a 12-dim target joint angle vector. The observation o_t is a 46-dim vector containing the 3-dim velocity command, 12-dim joint positions, 12-dim joint velocities, 3dim projected gravity, 4-dim binary foot-contact states, and 12-dim last actions. The privileged information x_t is a 233-dim vector that includes the linear and angular velocity in the base frame (6-dim), friction coefficient, measured heights of some surrounding points (187-dim), external torque applied to the base (2-dim), stiffness and damping of each motor (24-dim), added mass to the base, and foot contact forces (4-dim). The encoder takes x_t as input, while the predictor takes the history observation (o_{t-50}, o_{t-49}, o_{t-48}, ... o_t) as input, 

Network Architecture

The teacher encoder is a 2-layer multi-layer perceptron (MLP) that takes the privileged information x_t (233-dim) as input and outputs the latent vector z_t (8-dim) . The hidden layers have dimensions [256,128].

The base policy is a 3-layer multi-layer perceptron (MLP) that takes the current observation o_t (46-dim) and the latent vector z_t as input and generates a 12-dimensional target joint angle output. The hidden layers have dimensions [512,256,128].

The student predictor begins by encoding each observation from recent steps into a 32-dimensional representation. Next, a one-dimensional convolutional neural network (1-D CNN) convolves these representations along the time dimension. The layer configurations, such as input channel number, output channel number, kernel size, and stride, are set to [32, 32, 8, 4], [32, 32, 5, 1], and [32, 32, 5, 1]. The flattened output from the CNN is then passed through a linear layer to predict \hat{z}_t.

The discriminator employs an MLP with hidden layers of size [1024, 512].

Learning Algorithm

We utilized Proximal Policy Optimization (PPO) as the reinforcement learning algorithm to train both the base policy and teacher encoder concurrently. The training process was composed of 50,000 iterations, with each iteration involving the collection of a batch of 131,520 state transitions. These transitions were evenly divided into 4 mini-batches for processing. To maintain a desired KL divergence of K L_ desired =0.01, we automatically tuned the learning rate using the adaptive LR scheme proposed by [35]. The PPO clip threshold was set to 0.2 . For the generalized advantage estimation [36], we set the discount factor γ to 0.99 and the parameter λ to 0.95 .

To optimize the objective defined in Eq (2), we trained the discriminator using supervised learning. We set the gradient penalty weight to w_gp=10. The style reward weight is w_s=0.65 and the task reward weight is w_g=0.35.

The student encoder was trained with supervised learning, minimizing the mean squared error (MSE) loss between the latent vector z_t output by the teacher encoder and the predicted latent vector \hat{z}_t output by the student encoder.

Throughout all training phases, we utilized the Adam optimizer with β values set to (0.9,0.999), and ε set to 1 e-8.

Command Range for Different Terrains

Domain Randomization & Noise