Grow Your Limits: Continuous Improvement with Real-World RL for Robotic Locomotion

Laura Smith*, Yunhao Cao*, Sergey Levine

[arxiv] [code]

Overview

We present APRL, a policy regularization framework that aims to modulate the robot’s exploration over the course of training, striking a balance between flexible improvement potential and focused, efficient exploration. We demonstrate that APRL enables a quadrupedal robot to efficiently learn to walk in the real world, resulting in a substantially more capable policy for navigating challenging terrains and ability to adapt to changes in dynamics. 

Real-World Results

APRL vs. Restricted Training

In this video, we show a side-by-side comparison of the restricted method and APRL both learning from scratch on flat foam tiled mats. We see that the restricted method was not able to improve further after reaching a velocity of 0.44m/s while our method was able to continue training and reach a peak velocity of 0.62m/s after training for 80k steps.

APRL vs. Vanilla Exploration

In this video, we show a side-by-side comparison of running vanilla RL on the full action space (PD targets) and APRL both learning from scratch on flat foam tiled mats. 

Final Gait Comparison

In this video, we show a side-by-side comparison of the final gaits learned by restricted method and APRL, on two terrains: foam tiled mats and grass.

Experiments on Irregular Terrains

After initial training on flat foam tiled mats, we evaluate the learned policies in 4 new scenarios shown in the diagram above.

Additional Experimental Details

Pseudocode

Hyperparameters

Hyperparameters for base algorithm DroQ and regularization by APRL as described in Algorithm 1.

Reward Function

Definition of reward terms and their weight in the overall reward function. Terms that indicate "reward" are added and terms that indicate "penalty" are subtracted. Below we define the variables and functions referenced in the table above.

Variables

Functions

We  define a helper function nqr (which stands for near quadratic reward) as defined above. qdist is another function between 0 and 1, decaying exponentially until a threshold which is the robot's action limits.