Safe Autonomous Racing

We present results related to our paper, 'Safe Autonomous Racing via Approximate Reachability on Ego-vision'. We provide a brief overview of our method, then present qualitative results, such as videos and gifs, that cannot be effectively displayed in the paper.

Overview

Racing demands each vehicle to drive at its physical limits, when any safety infraction could lead to catastrophic failure. In this work, we study the problem of safe reinforcement learning (RL) for autonomous racing, using the vehicle's ego-camera view and speed as input.

Given the nature of the task, autonomous agents need to be able to 1) identify and avoid unsafe scenarios under the complex vehicle dynamics, and 2) make sub-second decision in a fast-changing environment. To satisfy these criteria, we propose to incorporate Hamilton-Jacobi (HJ) reachability theory, a safety verification method for general non-linear systems, into the constrained Markov decision process (CMDP) framework. HJ reachability not only provides a control-theoretic approach to learn about safety, but also enables low-latency safety verification. Though HJ reachability is traditionally not scalable to high-dimensional systems, we demonstrate that with neural approximation, the HJ safety value can be learned directly on vision context — the highest-dimensional problem studied via the method, to-date.


(b) SAGE consists of two policies, which are in charge of safety and performance, independently. The safety controller intervenes when the current state-action pair is deemed unsafe by the safety critic.

Classical Control Benchmarks: What learning rule is most effective for safety critic?

We compare HJ Bellman update to alternative learning rules for safety critic on two classical control benchmarks (Double Integrator and Dubins' Car), where the ground truth of safe vs. unsafe states are known, to evaluate the performance of different learning rules for safety analysis.

Performance comparison on the two benchmarks with other methods: safety Q-functions for reinforcement learning (SQRL) and conservative Q-learning (CQL) used by conservative safety critic (CSC).

In our experiment, HJ Bellman update consistently outperforms SQRL and CQL, and has AUROC close to 1 on both tasks and very small variance over different runs.

The key intuition here is that the HJ safety value characterize the worst case outcome given best possible actions over an infinite horizon, while SQRL and CSC define the safety value function as the expected cumulative cost for safety violations. For safety-critical applications, such as autonomous driving and assistive robotics, we are concerned with whether failures can occur at all, not the probability of failure. Thus, HJ Bellman update is more appropriate for safety analysis.

(We refer interested readers to our paper for description of the methods and implementation details.)

Experiment: Safety Gym Benchmark

Qualitative Behavior Comparisons

PPO-Lagrangian

PPO-Lagrangian, may ignore traps in favor of fast navigation to goal locations; this behavior can yield high performance but also high cost.

SAGE

SAGE learns mature obstacle-avoidance behaviors, such as navigating around traps, and thereby enjoys fewer constraint violations .

In Safety Gym, SAGE has significantly fewer constraint violations compared to other baselines, and the number of violations decreases over time (86.64% reduction in AverageEpisodeCost and 52.48% reduction in CumulativeCost, compared to PPO-Lagrangian).

Performance of SAGE, with comparison to baselines in the CarGoal1-v0 (top row) and PointGoal1-v0 (bottom row) benchmarks, averaged over 5 random seeds. In Goal tasks, agents must navigate to observed goal locations (indicated by the green regions), while avoiding obstacles (e.g., vases in cyan, and hazards in blue).

While CPO and PPO-Lagrangian take into account that a certain number of violations are permissible, there is no such mechanism in SAGE, as HJ Reachability theory defines safety in an absolute sense. The inability to allow for some level of safety infractions, unfortunately, compromises performance.

Violations that do occur in SAGE result from neural approximation error, and the number of violations decreases over time as the safety actor-critic gains experience, despite the randomized and constantly-changing episodic layouts.

Experiment: Learn-to-Race Benchmark

We evaluate SAGE on Learn-to-Race (L2R), a recently-released, high-fidelity autonomous racing environment. This environment provides simulated racing tracks that are modeled after real-world counterparts, such as the famed Thruxton Circuit in the UK. Here, learning-based agents can be trained and evaluated according to challenging metrics and realistic vehicle and environmental dynamics, making L2R a compelling target for safe reinforcement learning. Each track features challenging components for autonomous agents, such as sharp turns (shown in (b)), where SAGE only uses ego-camera views (in (c)) and speed.

(a) Aerial view

(b) Third-person

(c) Ego-view

Qualitative Behavior Comparisons

We compare the learnable safety actor-critic in SAGE to a static safety actor-critic pre-computed from a nominal vehicle model. The videos here are from early stage of training to highlight the behavior of the safety actor-critic, because the safety actor intervenes less frequently as the performance agent gains experience.

SafeRandom

The static safety actor-critic is able to keep a random agent on the race track most of the time, given a sufficiently large safety margin. However, the static safety controller is extremely conservative, hard-braking whenever the vehicle is less safe

SAGE

Instead of hard-braking, the safety actor-critic in SAGE agent learns to recover from road boundary with little compromise to speed.

SafeSAC

At high speed, by applying the 'optimal' safety action derived from the kinematic bike model, which does not model the tire, the vehicle can lose traction and spin out of control.


SAGE

In comparison, the SAGE agent learns to recover from marginally safe states more smoothly. Therefore, SAGE can successfully executing the same S-curve cornering, which challenges the SafeSAC agent.

Performance plots

Left: Episode percent completion and Right: speed evaluated every 5000 steps over an episode (a single lap) and averaged over 5 random seeds. Results reported based on Track01:Thruxton in L2R.

SAGE learns safety directly from vision context and can recover from marginally safe states more smoothly. Having a safety actor-critic that is dedicated to learning about safety significantly boosted the initial safety performance of SAGE, in comparison to the SAC agent, even though both the performance and the safety actor-critics are randomly initialized. In practice, we envision the safety actor-critic to be warm-started with the nominal model or observational data, and fine-tuned by interactions with the environment