Real-world Experiments. Fig. 4 report the real-world experiment results of reward learning curves and the percentage of uses of safe recovery policy on the efficient gait, catwalk, and two-leg balance tasks. We observe that our algorithm is able to improve the reward while avoiding triggering the safe recovery policy over the learning process.
In addition, the following videos show the entire learning process (interplay between the learner policy and the safe recovery policy, and the reset to the initial position when an episode ends) for the tasks considered. First, in the efficient gait task, the robot learns to use a smaller stepping frequency and achieves 34% less energy than the nominal trotting gait. Second, in the catwalk task, the distance between two sides of the legs is 0.09m, which is 40.9% smaller than the nominal distance. Third, in the two-leg balance task, the robot can maintain balance by jumping up to four times via two legs, compared to one jump from the policy pre-trained from simulation. Without the safe recovery policy, learning such locomotion skills would damage the robot and require manually re-positioning the robot when falling.