Simulation-based reinforcement learning for autonomous driving
Real-world urban space in CARLA simulator
We have recreated a real-world urban space as two new CARLA maps which approximately reflect the testing grounds for real-world deployments.
Below we present preview of the custom made level. The driving was done by a human to showcase the custom-made level.
Example videos taken from the training process
During training we periodically save videos of trajectories with some additional information. In this section we share multiple videos showcasing different weathers and CARLA levels. Those videos were not edited and were taken from training jobs as-is.
In each of the videos below the panes represent the following:
- row 1 column 1 – RGB camera image
- row 2 column 1 – saliency map of policy outputs w.r.t. RGB camera input
- row 3 column 1 – output of semantic segmentator embedded in the policy
- row 3 column 2 – ground-truth of semantic segmentator provided by CARLA environment
- row 4 column 1 – simplified semantic segmentator output with less classes
- row 5 – policy output distribution. Red line represents the value sampled in the rollout
Selected failure cases
Pulse-width modulation steering when policy is provided last action
In this experiment we provided policy with last action as additional input. Due to inertia of the environment policy learned to use last action to switch between two modes controlling car in a pulse width modulation like manner.
Perceiving single-line road markings when trained on levels with double-lines only
Before we build our custom CARLA level that mimic real-world environment we were training policies only on two CARLA built-in levels.
Those two CARLA built-levels include only double-lines road markings. As seen on saliency maps, policy that only saw double-line road markings is not sensitive to single-line road markings.
Bug in reward function resulting in driving over the curb
Our reward functions includes a term that penalizes for not sticking to the center
of a lane. In our initial implementation distance used for calculating the penalty
was using all X, Y and Z spatial coordinates.
Due to technical reasons our list of lane-center positions was actually
placed above the road in the Z axis. This resulted in a policy that drives
with two right side wheels placed on a high curb so its elevation is increased and distance to the center-line point above the ground is decreased.
The fix was to calculate penalty using only X and Y coordinates.
Deviation of model from expert trajectories
Average deviation of models from expert trajectories. Measurements based on GPS.
The neural architecture of the policy function
PPO - training hyperparameters
We have analysed 25 outliers with results significantly below average. In this group we have identified 3 cases of human errors - a wrong chauffeur command was given to the autonomous system (e.g. "turn righ" instead of "lane follow"). Other recurring mistake concerned attempts to drive on a sidewalk - these attempts were present mostly in two overpass scenarios and in the scenario factory_city-sud_strasse_u_turn. All attempts to drive on a sidewalk were stopped by the driver. We are planning to precisely identify the reason for "sidewalk driving" in the next stage of this project.
Assessment of the quality of driving
Models R1-reg and R4 drive in a competent way in most tested situations. R1-reg and R4 showed less confident behaviour when confronted with a juncture with multiple exits. In such situations R1-reg and R4 usually decided for a correct driving direction, but the magnitude of turns quite often required a correction.
R1 and other discrete models tended to wobble. Wobbling was relatively soft, meaning that models tended to softly turn from an extreme left of the lane to the extreme right of the road and back. For safety reasons we had to correct this behaviour.
Further comments on the offline model evaluation
To compute the metric we again process frame by frame human reference drive and compare human action and output of the evaluated model. We classify requested steering wheel angle into one of three buckets: left, straight or right, if it is respectively less than -0.02 radian, between -0.02 and 0.02 radian or greater than 0.02 radian.
For each of the buckets, we compute a F1 score between human reference action and the model output. The average of these three values is the final average F1 score.
As you can see in the accompanying figure, this metric also seems to correlate with the model's real-world performance.