Demonstrating a Walk in the Park:
Learning to Walk in 20 Minutes 

With Model-Free Reinforcement Learning

Laura Smith*,  Ilya Kostrikov*,  Sergey Levine

[paper] [code]

Overview

We demonstrate a real A1 quadrupedal robot learning completely from scratch in just 20 minutes on a variety of terrains. These results are enabled by recent advancements in machine learning software frameworks and model-free reinforcement learning algorithms that have made learning extraordinarily fast, and that training robots in the real world is perhaps more feasible than commonly believed. 

Real-World Results

Annotated Real-World
Training on Flat Ground

In this video, we show one full training run on the flat, solid ground in a lab. We show the exact progression of time on a physical clock (bottom right corner) and indicate the starting time with a blue line. The robot trains continuously. When the robot is about to leave the training area, a human lifts and reorients it. 

During initial data collection, where the robot is executing random actions, the robot slowly shuffles backward. After about a minute of this, we optimize the updates (~00:07-00:10 in the video) to be able to run training quickly enough to perform between steps. We see that immediately after taking the first few updates, the robot starts to move forward. For a few minutes, the robot consistently makes forward progress; however, it is rather unstable. When the robot stumbles, we use a learned reset controller to right the robot so it can continue to train without interruption. After roughly 15 minutes, the robot can traverse the training area from left to right, then right to left (~01:13-01:45 in the video).

Effects of Continued Training

In this video, we show the robot after having trained for an hour (as opposed to 15 minutes).

Training on Irregular Terrains

Below, we show the training runs on the other terrains we considered: (top left) a memory foam mattress, (top right) mulch, (bottom left) grass, and (bottom right) a dirt hiking trail. Similar to the video above, we include a physical clock when possible. For the outdoor experiments, we include the SMPTE video timecode (hours:minutes:seconds:frames) at the bottom center to track the wall-clock time.

Executing random actions on the memory foam does not shift the robot's position as the mattress is soft and depresses under the robot's weight. The robot begins to make forward progress after about 5 minutes of training. At 01:07 in the video (roughly 17 minutes of training), we see the robot has learned a gait that leverages that the foot grabs onto the surface, taking large steps and swinging its body forward.

Here, the robot is prone to digging itself into the ground as the mulch is soft and loose. As shown from ~00:20-00:25 in the video, after 5 minutes the robot starts to learn to kick so as to free itself from obstructions caused by the ground. From ~00:45-00:55 we see the robot successfully traversing through the mulch it stirred up and loosened further while training. And from ~01:05-1:10, we see the robot also able to walk on a less disturbed, more compact region of the terrain. 

Here, the robot starts to make forward progress within 4 minutes of operating. At 01:09 in the video (roughly 18 minutes of training), we see the robot has learned a walking gait.

The robot slowly shuffles backward during initial data collection on the dry dirt, similar to on the flat ground covered in mats in the lab. Again, we see that immediately after taking the first few updates, the robot starts to move forward. The robot faces a dip in the ground at ~00:25 and within a minute of this encounter, it learns to walk through it. From ~00:50-01:00 we see the robot walks up the trail, with little human assistance to keep it from heading off the path.

Design Decisions and Analysis

Damping

We find that the damping has a large impact on training speed.

Setup Ablations

We confirm that filtering actions and using an unrestricted action space poorly affects learning.

SAC variants

In our setting, a variety of techniques can provide the requisite improvements to train efficiently (in simulation).

Synchronous training

Training after every step rather than in an episodic way makes learning more efficient, so we require a fast enough implementation to allow this.