Simulation-based reinforcement learning for autonomous driving

This is accompanying website for paper by Błażej Osiński, Adam Jakubowski, Piotr Miłoś, Paweł Zięcina, Christopher Galias, Silviu Homoceanu, and Henryk Michalewski

The preprint of the article can be accessed here: https://arxiv.org/abs/1911.12905

It is to appear at ICRA 2020 and was presented at NeurIPS 2019 Workshop: Machine Learning for Autonomous Driving

Real-world urban space in CARLA simulator

Example videos taken from the training process

Selected failure cases

Pulse-width modulation steering when policy is provided last action

Perceiving single-line road markings when trained on levels with double-lines only

Bug in reward function resulting in driving over the curb

Table of all evaluated models

Real-world performance

Deviation of model from expert trajectories

The neural architecture of the policy function

PPO - training hyperparameters

Outliers

Assessment of the quality of driving

Further comments on the offline model evaluation

MAE metric

F1 metric

Performance variance

Real-world urban space in CARLA simulator

We have recreated a real-world urban space as two new CARLA maps which approximately reflect the testing grounds for real-world deployments.

Below we present preview of the custom made level. The driving was done by a human to showcase the custom-made level.

Example videos taken from the training process

During training we periodically save videos of trajectories with some additional information. In this section we share multiple videos showcasing different weathers and CARLA levels. Those videos were not edited and were taken from training jobs as-is.

In each of the videos below the panes represent the following:

row 1 column 1 – RGB camera image
row 2 column 1 – saliency map of policy outputs w.r.t. RGB camera input
row 3 column 1 – output of semantic segmentator embedded in the policy
row 3 column 2 – ground-truth of semantic segmentator provided by CARLA environment
row 4 column 1 – simplified semantic segmentator output with less classes
row 5 – policy output distribution. Red line represents the value sampled in the rollout

Selected failure cases

Pulse-width modulation steering when policy is provided last action

In this experiment we provided policy with last action as additional input. Due to inertia of the environment policy learned to use last action to switch between two modes controlling car in a pulse width modulation like manner.

Perceiving single-line road markings when trained on levels with double-lines only

Before we build our custom CARLA level that mimic real-world environment we were training policies only on two CARLA built-in levels.

Those two CARLA built-levels include only double-lines road markings. As seen on saliency maps, policy that only saw double-line road markings is not sensitive to single-line road markings.

Bug in reward function resulting in driving over the curb

Our reward functions includes a term that penalizes for not sticking to the center

of a lane. In our initial implementation distance used for calculating the penalty

was using all X, Y and Z spatial coordinates.

Due to technical reasons our list of lane-center positions was actually

placed above the road in the Z axis. This resulted in a policy that drives

with two right side wheels placed on a high curb so its elevation is increased and distance to the center-line point above the ground is decreased.

The fix was to calculate penalty using only X and Y coordinates.

Table of all evaluated models

Real-world performance

Summary of experiments with baselines across nine scenarios. The columns to the right show the mean and max of autonomy (the percentage of distance driven autonomously). Models are sorted according to their mean performance.

Summary of experiments across 9 scenarios with baselines. Each subfigure represents performance for a given deployment scenario.

Deviation of model from expert trajectories

Average deviation of models from expert trajectories. Measurements based on GPS.

The neural architecture of the policy function

PPO - training hyperparameters

Outliers

We have analysed 25 outliers with results significantly below average. In this group we have identified 3 cases of human errors - a wrong chauffeur command was given to the autonomous system (e.g. "turn righ" instead of "lane follow"). Other recurring mistake concerned attempts to drive on a sidewalk - these attempts were present mostly in two overpass scenarios and in the scenario factory_city-sud_strasse_u_turn. All attempts to drive on a sidewalk were stopped by the driver. We are planning to precisely identify the reason for "sidewalk driving" in the next stage of this project.

Assessment of the quality of driving

Models DISCRETE-REG and CONTINUOUS-PLAIN drive in a competent way in most tested situations. These models showed less confident behaviour when confronted with a juncture with multiple exits. In such situations they usually decided for a correct driving direction, but the magnitude of turns quite often required a correction.

DISCRETE-PLAIN and other discrete models tended to wobble. Wobbling was relatively soft, meaning that models tended to softly turn from an extreme left of the lane to the extreme right of the road and back. For safety reasons we had to correct this behaviour.

Further comments on the offline model evaluation

MAE metric

F1 metric

To compute the metric we again process frame by frame human reference drive and compare human action and output of the evaluated model. We classify requested steering wheel angle into one of three buckets: left, straight or right, if it is respectively less than -0.02 radian, between -0.02 and 0.02 radian or greater than 0.02 radian.

For each of the buckets, we compute a F1 score between human reference action and the model output. The average of these three values is the final average F1 score.

As one can see in the accompanying figure, this metric also seems to correlate with the model's real-world performance.

Performance variance

autonomous driving variance - arxiv