RACER: Risk-sensitive Actor Critic with Epistemic Robustness

RACER

Learning Epistemic Risk-Sensitive Policies with Online RL

Kyle Stachowicz and Sergey Levine

UC Berkeley

TL;DR: RACER incorporates epistemic risk into online reinforcement learning, yielding epistemic risk-sensitive policies which avoid catastrophic failure both during exploration/training and at convergence.

Reinforcement learning is in principle able to learn highly-performant policies with minimal assumptions on the environment, making it a very appealing framework for approaching challenging unstructured tasks like offroad driving.

When learning in the real world, there are important considerations beyond final training performance: poor behavior during training could damage the robot or the environment.

We use an ensemble of independently-trained distributional critics to capture uncertainty, and train a policy to maximize the conditional value at risk, which we show is pessimistic for out-of-distribution events where the ensemble members diverge.

In the off-road driving setting, this allows RACER to start with a cautious policy and action limits and then speed up over time as the critic functions agree on the distribution of returns.

RACER starts at low speeds to avoid catastrophic high-speed crashes...

...and increases its speed over time as epistemic uncertainty decreases

RACER Yields Interpretable Value Distributions

As a consequence, our distributional risk-sensitive scheme yields surprisingly interpretable predictions during training. We select several events of interest during simulated training, where the agent encounters either a failure or a near-failure.

Scenario 1: Rollover event early in training

Here we see that the ensembles have not yet converged and there is substantial disagreement between the red ensemble member and the other critics. Starting at t=2, the critic highlighted in red predicts a high probability of failure. When the CVaR of the ensemble distribution is considered, this low-probability failure mode is upweighted. The average CVaR of all ensemble members (red) is much less sensitive to the outlier event than the CVaR of the ensemble distribution.

Scenario 2: Recovery late in training

In this case the robot experiences a near-rollover event late in training but is eventually able to recover. Note that the ensembles are in much higher agreement: multiple ensembles place some amount of mass on the low-probability failure mode.

Page updated

Google Sites

Report abuse