Generalized Advantage Estimation

Stages of learning, for selected optimization runs on three different tasks: bipedal locomotion, quadrupedal locomotion, and bipedal standing:

Here's a different policy for standing up, in slow motion: