Safe Reinforcement Learning for Legged Locomotion

Tsung-Yen Yang, Tingnan Zhang, Linda Luu, Sehoon Ha, Jie Tan, Wenhao Yu

TY is with Princeton University {ty3}@princeton.edu

TY, TZ, LL, SH, JT and WY are with Google Research {jimmyyang, tingnan, luulinda, sehoonha, jietan, magicmelon}@google.com

SH is with Georgia Institute of Technology {sehoonha}@gatech.edu

TL;DR: We propose a safe reinforcement learning algorithm for legged locomotion. Our goal is to learn locomotion skills autonomously without falling during the entire learning process in the real world.

Paper

Abstract

Designing control policies for legged locomotion is complex due to the under-actuated and non-continuous dynamical system. Model-free reinforcement learning provides promising tools to tackle this challenge. However, a major bottleneck of applying model-free reinforcement learning in the real world is safety. In this paper, we propose a safe reinforcement learning framework that switches between a safe recovery policy that prevents the robot from entering unsafe states, and a learner policy that is optimized to complete the desired locomotion task. The safe recovery policy takes over the control when the learner policy violates safety constraints, and hands over the control back when there are no future safety violations. We design the safe recovery policy so that it ensures safety of legged locomotion while minimally intervening in the learning process. Furthermore, we theoretically analyze the proposed framework and provide an upper bound on the task performance. We verify the proposed framework in four locomotion tasks on a simulated and real quadrupedal robot: efficient gait, catwalk, two-leg balance, and pacing. On average, our method achieves 48.6% fewer falls and comparable or better rewards than the baseline methods in simulation. When deployed it on real-world quadruped robot, our training pipeline enables 34% improvement in energy efficiency for the efficient gait, 40.9% narrower of the feet placement in the catwalk, and two times more jumping duration in the two-leg balance. Our method achieves less than five falls over the duration of 115 minutes of hardware time.

Slides

final_video_submission_iros_compressed.mp4

Algorithm

We formulate the problem of safe locomotion learning in the context of safe reinforcement learning (RL). Our learning framework adopts a two-policy structure: a safe recovery policy that recovers robots from near-unsafe states, and a learner policy that is optimized to perform the desired control task. Our safe learning framework switches between the safe recovery policy and the learner policy to prevent the learning agent from safety constraint violations (e.g., falls).

More specifically, we first define a safety trigger set that includes states where the robot is close to violating safety constraints but can still be saved by a safe recovery policy. When the learner policy takes the robot to the safety trigger set, we switch to the safe recovery policy, which drives the robot back to safe states. We then determine when to switch back to the learner policy by leveraging an approximate dynamics model of the robot (e.g., centroidal dynamics model for legged robots) to rollout the planned future robot trajectory: if the predicted future states are all in the safe states, we will hand the control back to learner policy, otherwise we will keep using the safe recovery policy. Such an approach allows us to ensure safety in the complex system dynamics without resorting to a black-box approach such as neural networks that are hard to guarantee safety due to distribution shift. Fig. 1 (a) illustrates the idea of the algorithm and Fig. 1 (b) shows the state diagram.

Fig. 1 (a). We trigger the safe recovery policy when the agent reaches the safety trigger set (C_{tri}). This allows the agent to have the freedom to explore the environment. Then, after triggering the safe recovery policy, we check if predicted future states induced by the learner policy are all in the safe states. If it is, we then hand the control back to the learner, otherwise, we will keep using the safe recovery policy.

Fig. 1 (b). The state diagram of the proposed approach.

Legged Locomotion Tasks

We consider to learning the following legged locomotion tasks (see Fig. 2):

(1) Efficient Gait: The robot learns how to walk with low energy consumption. The robot is rewarded for consuming less energy.

(2) Catwalk: The robot learns a catwalk gait pattern, in which left and right two feet are close to each other. This is challenging because by narrowing the support polygon, the robot becomes less stable.

(3) Two-leg Balance: The robot learns a two-leg balance policy, in which the front-right and rear-left feet are in stance, and the other two are lifted. The robot can easily fall without delicate balance control because the contact polygon degenerates into a line segment.

(4) Pacing: We want to produce a desired stepping frequency and the swing ratio for performing the pacing behavior under different desired speeds.


Fig. 2. The simulation and real-world tasks. First row: catwalk (left); two-leg balance (middle); pacing (right) in simulation. Second row: efficient gait in the real world. Third row: catwalk in the real world. Forth row: two-leg balance in the real world.

Results

Simulation Results. The total number of falls versus the final reward value are shown for all tested algorithms and tasks in Fig. 3. The learning performance for baseline oracle, TRPO, indicates the reward and the constraint performance when the safety constraint is ignored. Ideally, we want the algorithm to be in the top-left corner (more rewards and fewer falls). Overall, we find that our algorithm is able to improve the reward while achieving the fewest number of falls in all tasks (in the top-left corner).

Fig 3. We report the total number of falls versus final rewards for the tested algorithms and task pairs over five runs. We observe that the proposed approach achieves the fewest number of falls while having comparable reward performance (i.e., in the top-left corner).

Fig 4. The reward and the percentage of uses of safe recovery policy of our algorithm in the real world.

Real-world Experiments. Fig. 4 report the real-world experiment results of reward learning curves and the percentage of uses of safe recovery policy on the efficient gait, catwalk, and two-leg balance tasks. We observe that our algorithm is able to improve the reward while avoiding triggering the safe recovery policy over the learning process.

In addition, the following videos show the entire learning process (interplay between the learner policy and the safe recovery policy, and the reset to the initial position when an episode ends) for the tasks considered. First, in the efficient gait task, the robot learns to use a smaller stepping frequency and achieves 34% less energy than the nominal trotting gait. Second, in the catwalk task, the distance between two sides of the legs is 0.09m, which is 40.9% smaller than the nominal distance. Third, in the two-leg balance task, the robot can maintain balance by jumping up to four times via two legs, compared to one jump from the policy pre-trained from simulation. Without the safe recovery policy, learning such locomotion skills would damage the robot and require manually re-positioning the robot when falling.

learning_eg_final.mp4

Video 1: Time-lapse video of training an efficient gait.

learning_cw_final.mp4

Video 2: Time-lapse video of training the catwalk.

learning_tlb_final.mp4

Video 3: Time-lapse video of training the two-leg balance.

Video 4: Final learned catwalk.

Video 5: Final learned two-leg balance.


In summary, we find that there is no single fall nor {a manual reset during the entire learning in the efficient gait (45 mins of real-world data collection, excluding the automatic position reset and battery replacement) and catwalk tasks (26 mins), and less than 5 falls in the two-leg balance task (28 mins).. The safe recovery policy is triggered only when needed, allowing the robot to explore the environment. Our results suggest that learning legged locomotion skills autonomously is possible in the real world.

Additional Videos

Baseline Approach: Recovery RL, Thananjeyan et al., 2021: Recovery RL uses the safety critic pre-trained with the policy that maximizes the possibility of falls. In other words, the safety critic considers the worst-case situation. This leads an over-conservative learning strategy in the real world since the robot can fall down easily during the learning process, slowing the learning speed. The video shows that the safe recovery policy is being triggered substantially in the efficient gait task (notice the upper motors are tilted outward).

Baseline Approach: Recovery RL, Thananjeyan et al., 2021: The video shows that the safe recovery policy is being triggered substantially in the two-leg balance task. We can see that the robot wants to jump, but the safety critic is too conservative.

Safe Recovery Policy. The safe recovery policy commands a wide feet placement (63% wider than the nominal feet positions) when triggering the safety criteria. This help the robot stabilize quickly.


Failure Cases. We also report the failure cases when using a loose safety trigger set in the two-leg balance task. The safe recovery policy is triggered too late, preventing the feet from positioning to appropriate positions (i.e., wide-open).


The Policy Trained in Simulation in the Two-leg Balance Tasks. We fine-tune the learned policy in simulation. The following videos show that the robot cannot jump well, and hence triggering the safe recovery policy. By fine-turning the policy, we able to improve the performance.

Policy pre-trained from the simulation.

Policy after fine-tuned in the real world.

The Automative Training Process in the Real World. We train the locomotion policy autonomously in the real world either from scratch or from a pre-trained policy in the simulation. The following video shows the process of control from learner policy, followed by using a safe recovery policy to ensure safety, and finally, the robot walks to the initial position to restart the whole process again.

Addition Hardware Details

Summary of Training Time. We report two numbers: data collection time (DC) and total hardware time (TH), to quantity the training duration of the algorithm. The data collection time records the time for collecting the training samples, excluding the automatic position reset (i.e., walking back to the initial position) and the halt for battery replacement (each battery runs about 30 mins). The total hardware time includes the data collection time and the automatic position reset time, excluding the halt for battery replacement. For the efficient gait task, DC is 45 mins and TH is 94 mins; for the catwalk task, DC is 29 and TH is 87 mins; for the two-leg balance task, DC is 28 mins and TH is 115 mins.

Proof of Theorem 5.1

Contact

Free feel to contact us (Jimmy (Tsung-Yen) Yang, yangtsungyen@gmail.com) if you have any questions about the project.