CTO-RL

Guided Learning of Robust Hurdling Policies with
Curricular Trajectory Optimization

Combining the benefits of Curricular Trajectory Optimization and Reinforcement Learning

Abstract

In this work, we focus on the combination of analytical and learning-based techniques to help researchers solve challenging robot locomotion problems. Specifically, we explore the combination of curricular trajectory optimization (CTO) and deep reinforcement learning (RL) for quadruped hurdling tasks. Our goal is to provide a framework for engineers and researchers to combine the generalization capabilities of feedback policies such as neural networks with the efficiency of trajectory optimization. We use the trajectories generated from optimization as an imitation learning supervisor to provide an additional gradient signal to the RL algorithm. To generate these trajectories, we introduce a curricular optimization algorithm where a discrete set of increasingly difficult tasks are solved via black-box trajectory optimization where each task is initialized by parameters obtained from simpler tasks. We evaluate our approach on various robot hurdling tasks where the robot needs to jump over an obstacle of varying size and location. Our results show that we achieve greater sample efficiency than state-of-the-art reinforcement learning when solving the task, and significantly greater performance than the original trajectories.

Schematic of CTO-RL

Left: CMA-ES is used to generate reference trajectories for M different environments. These trajectories are stored in the trajectory buffer.

Right: Reference trajectories are sampled from the buffer and used for the imitation reward. The RL agent uses the combined imitation and task rewards to update its parameters and find the optimal policy.

LKOverlay_GDP_NoRSL_0.125x.mp4

Reference motion v/s Trained Policy

Here we see the reference motions generated by curricular trajectory optimization (purple), along with the final trained policy (orange). Even though the reference motions are suboptimal, they provide an important gradient signal while training.

LK_ours_hurdle_max_height_0.54.mp4

Out-of-distribution: Obstacle size

Agent learns to go out-of-distribution for the hurdle task, when trained for a longer time (500M env steps). Here it can jump over an obstacle 35% larger than the largest obstacle seen during training.

LK_ours_hurdle_sens2.4.mp4

Out-of-distribution: Sensing range

Agent learns to go out-of-distribution for the hurdle task, when trained for a longer time (500M env steps). Here it can jump over obstacles with a reduced sensing range, 60% of the sensing range seen during training.