Learning Robust, Agile, Natural Legged Locomotion Skills in the Wild
Yikai Wang*, Zheyuan Jiang*, Jianyu Chen
*joint first authors
Tsinghua University
CoRL 2023 Workshop on Robot Learning in Athletics
Video: Overview
Video: Demonstration of Single Gait Pattern
Turn
Gallop
High speed.
trot
Normal speed.
Pace
Low speed.
Video: Adaptive Transition/Integration of Gait Patterns
When the need to overcome obstacles arises, some new gait styles emerge.
When running is disrupted, balance is restored through high-frequency short steps of the front legs.
Video: Ablation study
Without dataset mirroring
When the speed increases, it will tilt towards the right side sometimes.
Without applied torques as privileged information during training
It can not restore balance when gallop on the soccer pitch.
Video: Real World Experiments
Outdoor Experiments
Outdoor experiments on soccer pitch, grassland and rocks.
Indoor experiments
Implementation Details
State and Action Spaces
The output action a_t comprises a 12-dim target joint angle vector. The observation o_t is a 46-dim vector containing the 3-dim velocity command, 12-dim joint positions, 12-dim joint velocities, 3dim projected gravity, 4-dim binary foot-contact states, and 12-dim last actions. The privileged information x_t is a 233-dim vector that includes the linear and angular velocity in the base frame (6-dim), friction coefficient, measured heights of some surrounding points (187-dim), external torque applied to the base (2-dim), stiffness and damping of each motor (24-dim), added mass to the base, and foot contact forces (4-dim). The encoder takes x_t as input, while the predictor takes the history observation (o_{t-50}, o_{t-49}, o_{t-48}, ... o_t) as input,
Network Architecture
The teacher encoder is a 2-layer multi-layer perceptron (MLP) that takes the privileged information x_t (233-dim) as input and outputs the latent vector z_t (8-dim) . The hidden layers have dimensions [256,128].
The base policy is a 3-layer multi-layer perceptron (MLP) that takes the current observation o_t (46-dim) and the latent vector z_t as input and generates a 12-dimensional target joint angle output. The hidden layers have dimensions [512,256,128].
The student predictor begins by encoding each observation from recent steps into a 32-dimensional representation. Next, a one-dimensional convolutional neural network (1-D CNN) convolves these representations along the time dimension. The layer configurations, such as input channel number, output channel number, kernel size, and stride, are set to [32, 32, 8, 4], [32, 32, 5, 1], and [32, 32, 5, 1]. The flattened output from the CNN is then passed through a linear layer to predict \hat{z}_t.
The discriminator employs an MLP with hidden layers of size [1024, 512].
Learning Algorithm
We utilized Proximal Policy Optimization (PPO) as the reinforcement learning algorithm to train both the base policy and teacher encoder concurrently. The training process was composed of 50,000 iterations, with each iteration involving the collection of a batch of 131,520 state transitions. These transitions were evenly divided into 4 mini-batches for processing. To maintain a desired KL divergence of K L_ desired =0.01, we automatically tuned the learning rate using the adaptive LR scheme proposed by [35]. The PPO clip threshold was set to 0.2 . For the generalized advantage estimation [36], we set the discount factor γ to 0.99 and the parameter λ to 0.95 .
To optimize the objective defined in Eq (2), we trained the discriminator using supervised learning. We set the gradient penalty weight to w_gp=10. The style reward weight is w_s=0.65 and the task reward weight is w_g=0.35.
The student encoder was trained with supervised learning, minimizing the mean squared error (MSE) loss between the latent vector z_t output by the teacher encoder and the predicted latent vector \hat{z}_t output by the student encoder.
Throughout all training phases, we utilized the Adam optimizer with β values set to (0.9,0.999), and ε set to 1 e-8.