Behaviors of Centroidal Model vs Full Model
We compare the behaviors of the simplified model and full model in the following figure. While the exact trajectories differ, they indicate similar general behavior.
Linear velocity and angular velocity of full-order model and centroidal model of Laikago with a trotting policy. The general behaviors on the two models are close.
Comparisons to the Centroidal PD Controller
We plot Mf−a, the error term for the QP problem, for both the learned policy and the cenotridal PD controller with a trotting gait. Due to underactuation, the error is unavoidable, but the learned policy incurs much smaller error compared to a centroidal PD controller.
Comparison of difference between the desired body acceleration and the actual body acceleration.Due to the underactuation of the systems, the non-zero difference is unavoidable. We plot the error of thelearned policy for the centroidal model and the full-order model, and plot the error of a centroidal PD controlleron the full-order model. The learned policy produces acceleration commands with a much smaller discrepancycompared to a naive centroidal PD controller.
Details on Stepping Stone Environment
Illustration of the stepping stone problem.
(a) The system gets a local map of the terrain around the robot, illustrated by the grid of red and blue points. Blue points indicate regions suitable for foot placement while the red points indicate otherwise. The edges of the stepping stones are also deemed unsuitable to encourage safety. The foot placement strategy chooses the feasible point that is closest to the default foot placement, as given by the Raibert heuristic.
(b) Footstep pattern of Laikago trotting across a stepping stoneterrain. Different colors indicate stepping locations of different feet.
Details on Test Scenarios
We test the performance of different policies under different test scenarios. The statistics of 10 runs are recorded, where each run involves a policy running for 10 seconds in simulation. Variation between different runs include different initial body poses and different random seeds for terrain generation.
Body Mass Perturbation: We perturb the mass of the body by +/-5kg, around 40% of the default body mass and record the normalized reward of the policies under these perturbations.
Height Variation: We record the normalized reward of the policies walking against height perturbation while they have not experienced with variation in terrain height during training. On the continuous terrain, we add large periodic height change, named the wave field. On stepping stone terrain, we add medium level of random height changes.
The wave field on the continuous terrain is created by adding sinusoidal height change along both x and y direction. The wave period repeats in every 2.5m and the peak to valley distance is 0.7m, comparing to the robot normal height of 0.4m. Since the stepping stone terrain is more challenging than the continuous terrain, we reduce the height variation with a random height field. We adjust the randomness by passing a white noise height change into a second order low-pass filter. This returns a medium level random one dimensional signal and we apply it along both x and y direction with different random seed. This creates slopes of up to 20 degrees. We observe that more challenging height variations will cause the policy to fail.
Energy Comparison: We record the average of sum of torque squares over all motors.
Motion Quality: Policies trained with joint RL exhibit motions where joints move at high frequency and is clearly unsuitable for hardware. We further record the the z trajectory of the front left foot over 10 seconds and compute an energy spectral density in Fourier space. The fraction of total energy in frequencies higher than 10Hz is 0.13 for GLiDE and 0.68 for joint RL policies.
Learning Curves
We plot the learning curves for various tasks. For flat terrain, balance beam traversal and two legged balancing, training takes 500 iterations in around 1-2 hours, while for stepping stone we train the policies up to 1000 iterations, taking around 4 hours.