Simulation-based reinforcement learning for autonomous driving

Real-world urban space in CARLA simulator

We have recreated a real-world urban space as two new CARLA maps which approximately reflect the testing grounds for real-world deployments.

Below we present preview of the custom made level. The driving was done by a human to showcase the custom-made level.

Example videos taken from the training process

During training we periodically save videos of trajectories with some additional information. In this section we share multiple videos showcasing different weathers and CARLA levels. Those videos were not edited and were taken from training jobs as-is.

In each of the videos below the panes represent the following:

  • row 1 column 1 – RGB camera image
  • row 2 column 1 – saliency map of policy outputs w.r.t. RGB camera input
  • row 3 column 1 – output of semantic segmentator embedded in the policy
  • row 3 column 2 – ground-truth of semantic segmentator provided by CARLA environment
  • row 4 column 1 – simplified semantic segmentator output with less classes
  • row 5 – policy output distribution. Red line represents the value sampled in the rollout

Selected failure cases

Pulse-width modulation steering when policy is provided last action

In this experiment we provided policy with last action as additional input. Due to inertia of the environment policy learned to use last action to switch between two modes controlling car in a pulse width modulation like manner.

Perceiving single-line road markings when trained on levels with double-lines only

Before we build our custom CARLA level that mimic real-world environment we were training policies only on two CARLA built-in levels.

Those two CARLA built-levels include only double-lines road markings. As seen on saliency maps, policy that only saw double-line road markings is not sensitive to single-line road markings.

Bug in reward function resulting in driving over the curb

Our reward functions includes a term that penalizes for not sticking to the center

of a lane. In our initial implementation distance used for calculating the penalty

was using all X, Y and Z spatial coordinates.

Due to technical reasons our list of lane-center positions was actually

placed above the road in the Z axis. This resulted in a policy that drives

with two right side wheels placed on a high curb so its elevation is increased and distance to the center-line point above the ground is decreased.

The fix was to calculate penalty using only X and Y coordinates.

Deviation of model from expert trajectories

Average deviation of models from expert trajectories. Measurements based on GPS.

The neural architecture of the policy function

PPO - training hyperparameters


We have analysed 25 outliers with results significantly below average. In this group we have identified 3 cases of human errors - a wrong chauffeur command was given to the autonomous system (e.g. "turn righ" instead of "lane follow"). Other recurring mistake concerned attempts to drive on a sidewalk - these attempts were present mostly in two overpass scenarios and in the scenario factory_city-sud_strasse_u_turn. All attempts to drive on a sidewalk were stopped by the driver. We are planning to precisely identify the reason for "sidewalk driving" in the next stage of this project.

Assessment of the quality of driving

Models R1-reg and R4 drive in a competent way in most tested situations. R1-reg and R4 showed less confident behaviour when confronted with a juncture with multiple exits. In such situations R1-reg and R4 usually decided for a correct driving direction, but the magnitude of turns quite often required a correction.

R1 and other discrete models tended to wobble. Wobbling was relatively soft, meaning that models tended to softly turn from an extreme left of the lane to the extreme right of the road and back. For safety reasons we had to correct this behaviour.

Further comments on the offline model evaluation

MAE metric

F1 metric

To compute the metric we again process frame by frame human reference drive and compare human action and output of the evaluated model. We classify requested steering wheel angle into one of three buckets: left, straight or right, if it is respectively less than -0.02 radian, between -0.02 and 0.02 radian or greater than 0.02 radian.

For each of the buckets, we compute a F1 score between human reference action and the model output. The average of these three values is the final average F1 score.

As you can see in the accompanying figure, this metric also seems to correlate with the model's real-world performance.

Performance variance

autonomous driving variance - arxiv