We present APRL, a policy regularization framework that aims to modulate the robot’s exploration over the course of training, striking a balance between flexible improvement potential and focused, efficient exploration. We demonstrate that APRL enables a quadrupedal robot to efficiently learn to walk in the real world, resulting in a substantially more capable policy for navigating challenging terrains and ability to adapt to changes in dynamics.
In this video, we show a side-by-side comparison of the restricted method and APRL both learning from scratch on flat foam tiled mats. We see that the restricted method was not able to improve further after reaching a velocity of 0.44m/s while our method was able to continue training and reach a peak velocity of 0.62m/s after training for 80k steps.
In this video, we show a side-by-side comparison of running vanilla RL on the full action space (PD targets) and APRL both learning from scratch on flat foam tiled mats.
In this video, we show a side-by-side comparison of the final gaits learned by restricted method and APRL, on two terrains: foam tiled mats and grass.
After initial training on flat foam tiled mats, we evaluate the learned policies in 4 new scenarios shown in the diagram above.
We define a helper function nqr (which stands for near quadratic reward) as defined above. qdist is another function between 0 and 1, decaying exponentially until a threshold which is the robot's action limits.