Overview
We present APRL, a policy regularization framework that aims to modulate the robot’s exploration over the course of training, striking a balance between flexible improvement potential and focused, efficient exploration. We demonstrate that APRL enables a quadrupedal robot to efficiently learn to walk in the real world, resulting in a substantially more capable policy for navigating challenging terrains and ability to adapt to changes in dynamics.
Real-World Results
APRL vs. Restricted Training
In this video, we show a side-by-side comparison of the restricted method and APRL both learning from scratch on flat foam tiled mats. We see that the restricted method was not able to improve further after reaching a velocity of 0.44m/s while our method was able to continue training and reach a peak velocity of 0.62m/s after training for 80k steps.
APRL vs. Vanilla Exploration
In this video, we show a side-by-side comparison of running vanilla RL on the full action space (PD targets) and APRL both learning from scratch on flat foam tiled mats.
Final Gait Comparison
In this video, we show a side-by-side comparison of the final gaits learned by restricted method and APRL, on two terrains: foam tiled mats and grass.
Experiments on Irregular Terrains
After initial training on flat foam tiled mats, we evaluate the learned policies in 4 new scenarios shown in the diagram above.
Additional Experimental Details
Pseudocode
Hyperparameters
Hyperparameters for base algorithm DroQ and regularization by APRL as described in Algorithm 1.
Reward Function
Definition of reward terms and their weight in the overall reward function. Terms that indicate "reward" are added and terms that indicate "penalty" are subtracted. Below we define the variables and functions referenced in the table above.
Variables
Functions
We define a helper function nqr (which stands for near quadratic reward) as defined above. qdist is another function between 0 and 1, decaying exponentially until a threshold which is the robot's action limits.