TL;DR: RACER incorporates epistemic risk into online reinforcement learning, yielding epistemic risk-sensitive policies which avoid catastrophic failure both during exploration/training and at convergence.
Reinforcement learning is in principle able to learn highly-performant policies with minimal assumptions on the environment, making it a very appealing framework for approaching challenging unstructured tasks like offroad driving.
When learning in the real world, there are important considerations beyond final training performance: poor behavior during training could damage the robot or the environment.
We use an ensemble of independently-trained distributional critics to capture uncertainty, and train a policy to maximize the conditional value at risk, which we show is pessimistic for out-of-distribution events where the ensemble members diverge.
In the off-road driving setting, this allows RACER to start with a cautious policy and action limits and then speed up over time as the critic functions agree on the distribution of returns.
RACER starts at low speeds to avoid catastrophic high-speed crashes...
...and increases its speed over time as epistemic uncertainty decreases
As a consequence, our distributional risk-sensitive scheme yields surprisingly interpretable predictions during training. We select several events of interest during simulated training, where the agent encounters either a failure or a near-failure.
Here we see that the ensembles have not yet converged and there is substantial disagreement between the red ensemble member and the other critics. Starting at t=2, the critic highlighted in red predicts a high probability of failure. When the CVaR of the ensemble distribution is considered, this low-probability failure mode is upweighted. The average CVaR of all ensemble members (red) is much less sensitive to the outlier event than the CVaR of the ensemble distribution.
In this case the robot experiences a near-rollover event late in training but is eventually able to recover. Note that the ensembles are in much higher agreement: multiple ensembles place some amount of mass on the low-probability failure mode.