FastRLAP
A System for Learning High-Speed Driving via Deep RL and Autonomous Practicing
Kyle Stachowicz*, Dhruv Shah*, Arjun Bhorkar*, Ilya Kostrikov, and Sergey Levine
UC Berkeley
High-speed offroad vision-based navigation presents a range of challenges: aside from the usual difficulties associated with collision-free navigation from pixels, policies optimizing for speed should account for subtle terrain cues that may impede or hinder high-speed driving. Learning-based methods offer a particularly appealing way to approach such challenges, as they can directly learn the relationship between perception and vehicle dynamics and in principle capture high-performance driving behaviors.
Conventional wisdom suggests that deep reinforcement learning - particularly learning end-to-end visuomotor policies - requires huge amounts of interactions with the environment. However, using recent advances in sample-efficient reinforcement learning alongside a task-relevant pretraining objective and an autonomous practicing framework to enable operation without human intervention, FastRLAP learns high-speed driving policies in the real world with as little of 20 minutes of interactions.
Method
Phase 1: Pretrain
Use offline reinforcement learning (IQL) to extract a critic for a readily available, diverse offline dataset collected on a different robot, using a similar task objective: goal-directed velocity toward checkpoints selected from a mix of future states and random points in space.
After the pretraining process is complete, the critic head is discarded and only the image encoder is used from this procedure. This yields a pretrained encoder optimized for extracting task-relevant features rather than general visual information.
Phase 2: Autonomous Practicing with Online Reinforcement Learning
Apply recent sample-efficient online reinforcement learning techniques, making use of a small amount of prior data to learn a policy for fast driving in real time. A single robot (performing onboard inference only) acts in the environment, communicating with a workstation (performing training) over a Wi-Fi or cellular network. This process is fully autonomous, using a state machine to automatically recover from collisions and "stuck" states without human intervention.
The task-relevant features from the pretrained encoder earlier allow the robot to quickly learn how to drive the lap, and slowly improve online. Compared to baselines in which the encoder is trained from scratch or pretrained against a general-purpose ImageNet task, this task-specific pretraining objective learns much more quickly and is able to learn better policies.
Experiments
We present two additional experiments in challenging outdoor environments, as requested by the reviewers and meta-reviewer. These environments include multiple unmarked obstacles and diverse terrain. We hope these new experiments address concerns regarding the task complexity. Note that there was no change to the underlying learning algorithm for these experiments (as presented in the paper).
Approximate top-down environment schematics are also shown for each environment, with blue dots (and surrounding circles) represent the checkpoints and corresponding tolerances. These figures are only meant as approximate representations of the environment for visualization purposes.
Experiment 1
This experiment consists of a medium-scale (60 meter loop) outdoor course around a building. In addition to straightforward obstacle avoidance with the building, a tree, and a nearby table, there are several patches of tall grass which tend to slow down the robot's motion. A successful policy should avoid the tall grass to the maximum extent possible, staying near paths where the grass is shorter and keeping to the left of the tree when it passes to avoid the grass on the right.
The bumpy terrain with different textures make the task extremely challenging due to the jerkiness and motion blur (see robot's onboard observations on the right). Our system learns near-expert behavior autonomously by smoothly avoiding obstacles such as trees, benches, and buildings, and also avoids areas of dense vegetation. The first person video (right) and schematic (not available to our system) show the different terrain characteristics.
Lap Time Progression. Best lap time so far shown in black.
Experiment 2
This environment consists of a large-scale (120 meter loop) outdoor course between a dense grove of trees on one side and a tree and several fallen logs on the other side. A successful policy must navigate between the trees and logs, a difficult task at high speeds. Additionally, the ground near the trees is covered in leaves, sticks, and other loose material, causing complex dynamics including highly speed-dependent over/understeer.
FastRLAP is able to learn a viable policy in only a handful of laps, and continues to decrease its lap times throughout training.
Lap Time Progression
Experiment 3
Experiment 4 is a significantly larger course (~120 meters in length) with multiple obstacles, defined again by four checkpoints. The floor of this environment is a tiled and has very low friction, frequently causing oversteer and understeer during cornering.
Checkpoints and Schematic
Third-Person View
Lap Time Progression
Experiment 4
This (indoor) experiment is a large loop (70 meters in length) through the interior of a carpeted building with a mixture of glass and solid walls and many open corridors. The course is defined by a sequence of four checkpoints spaced roughly 15-20 meters apart.
Lap Time Progression
Experiment 5 & Baselines
Experiment 5 is a small but challenging indoor race course with two tight "hairpin" turns, taken at nearly the maximum steering angle and a tight "chicane" (a right-left sequence). Mastering this environment requires the robot to discover fast "racing lines" that minimize unnecessary steering, and carrying a high speed through the turns. This course is designated by 3 checkpoints. We extensively compare FastRLAP to several baselines in this environment.
Extensive ablations and comparisons to prior learning-based methods are conducted in this environment. We find that our encoder pretraining method outperforms both from-scratch initialization as well as an encoder trained on the much larger ImageNet dataset with a task-agnostic pretraining objective.
Lap Time Progression
Critic Analysis
We evaluate a variety of actions against the critic network to analyze its characteristics. We find that the critic reflects several nontrivial behaviors, indicating that the policy is able to effectively use visual cues to make decisions rather than just memorizing a sequence of actions.
The following images represent a single-step evaluation of Q(s, a) for a variety of steering actions (with fixed target speed action). The trajectories are color-coded, with hot colors representing high-value actions and cool colors representing low-value actions. The direction to the goal is indicated by a green arrow.
The critic assigns very low value to actions that would cause contact with an obstacle.
The critic suggests that the policy should begin turning towards the next checkpoint before the current checkpoint is reached.
In wide-open areas the critic reflects a beeline policy directly towards the next goal.
The critic assigns lower value to steering to the right of the tree, even though it is the geometric shortest path. This is because of the presence of tall grass, which slows down the robot and can cause it to become stuck.
BibTeX
@article{stachowicz2023fastrlap,
author = {Kyle Stachowicz and Arjun Bhorkar and Dhruv Shah and Ilya Kostrikov and Sergey Levine},
title = {{FastRLAP: A System for Learning High-Speed Driving via Deep RL and Autonomous Practicing}},
booktitle = {arXiv pre-print},
year = {2023},
url = {TODO}
}