Simulation-based reinforcement learning for autonomous driving

Real-world urban space in CARLA simulator

We have recreated a real-world urban space as two new CARLA maps which approximately reflect the testing grounds for real-world deployments.

Below we present preview of the custom made level. The driving was done by a human to showcase the custom-made level.

Example videos taken from the training process

During training we periodically save videos of trajectories with some additional information. In this section we share multiple videos showcasing different weathers and CARLA levels. Those videos were not edited and were taken from training jobs as-is.

In each of the videos below the panes represent the following:

  • row 1 column 1 – RGB camera image
  • row 2 column 1 – saliency map of policy outputs w.r.t. RGB camera input
  • row 3 column 1 – output of semantic segmentator embedded in the policy
  • row 3 column 2 – ground-truth of semantic segmentator provided by CARLA environment
  • row 4 column 1 – simplified semantic segmentator output with less classes
  • row 5 – policy output distribution. Red line represents the value sampled in the rollout

Experiments with modelling standard deviation of a continuous action distribution

Learnable standard deviation value detached from the policy

Although policy manages to solve the scenarios, wobbling makes it impractical when deployed on real-world car. This footage is taken from earlier stages of the project when policy controlled both steering and throttle. In later stages of the project we switch to controlling steering only.

Small constant standard deviation

Using small and constant standard deviation produces less wobbly policy that still manages to maintain exploration (thank to dense reward function). This policy behaves much better when deployed on real-life car but still struggles in more precise scenarios such as narrow roads.

Policy outputs both mean and standard deviation

In this approach policy network controls both mean and standard deviation of action distribution. On the video taken from middle of the training we can see smaller std values on straight sections and higher std values near intersections.

Selected failure cases

Pulse-width modulation steering when policy is provided last action

In this experiment we provided policy with last action as additional input. Due to inertia of the environment policy learned to use last action to switch between two modes controlling car in a pulse width modulation like manner.

Perceiving single-line road markings when trained on levels with double-lines only

Before we build our custom CARLA level that mimic real-world environment we were training policies only on two CARLA built-in levels.

Those two CARLA built-levels include only double-lines road markings. As seen on saliency maps, policy that only saw double-line road markings is not sensitive to single-line road markings.