Interpretable End-to-end Urban Autonomous Driving with Latent Deep Reinforcement Learning

Jianyu Chen, Shengbo Eben Li and Masayoshi Tomizuka

Abstract

Unlike popular modularized framework, end-to-end autonomous driving seeks to solve the perception, decision and control problems in an integrated way, which can be more adapting to new scenarios and easier to generalize at scale. However, existing end-to-end approaches are often lack of interpretability, and can only deal with simple driving tasks like lane keeping. In this paper, we propose an interpretable deep reinforcement learning method for end-to-end autonomous driving, which is able to handle complex urban scenarios. A sequential latent environment model is introduced and learned jointly with the reinforcement learning process. With this latent model, a semantic birdeye mask can be generated, which is enforced to connect with a certain intermediate property in today's modularized framework for the purpose of explaining the behaviors of learned policy. The latent space also significantly reduces the sample complexity of reinforcement learning. Comparison tests with a simulated autonomous car in CARLA show that the performance of our method in urban scenarios with crowded surrounding vehicles dominates many baselines including DQN, DDPG, TD3 and SAC. Moreover, through masked outputs, the learned policy is able to provide a better explanation of how the car reasons about the driving environment.

Github - Paper

Methodology


System Framework

The agent takes multi-modal sensor inputs from the driving environment, and then output control commands to drive the car in urban scenarios. In the meantime, the agent generates a semantic mask to interpret how it under stand the current driving situation.

PGM of the driving agent

A sequential latent environment model is introduced and learned jointly with the reinforcement learning process by formulating them to a probabilistic graphical model (PGM). With this latent model, a semantic birdeye mask can be generated.

Trained Agent

The first row shows the ground truth of camera, lidar, and semantic birdeye images when the trained agent is running in the simulated town with surrounding vehicles (green boxes). Note that the agent only takes the camera and lidar images as inputs, the ground truth birdeye image is displayed here for comparison with the reconstruction.

The second row shows the reconstructed camera, lidar, and semantic birdeye images from the latent state, which is inferred online with the learned sequential latent model. Note that although we have no input of the ground truth semantic birdeye image, the agent is able to reconstruct it purely based on historical camera and lidar sensor inputs.

Interpretability

Mapping & Localization

The first row contains raw sensor inputs and ground truth mask, while second row contains the corresponding reconstructed images. Note here only the raw sensor inputs are observed, the ground truth bird-view image is displayed only for comparison. From the reconstructed bird-view mask, we can see that it can accurately locate the ego car and decode the map information (e.g, drivable areas and road markings), even though there is no direct information from the raw sensor inputs indicating the ego car is in an intersection.

Detection

The first row contains raw sensor inputs and ground truth mask, while second row contains the corresponding reconstructed images. We can see from the reconstructed bird-view mask that our model can accurately detect the surrounding vehicles (green boxes) given raw camera and lidar observations.

Prediction

The first row contains the ground truth masks, and the second row contains the reconstructed masks. Left to right indicates flowing time steps. Note that only the observations (camera and lidar) of first three time steps are provided. For later time steps, the latent states are propagated using the learned latent dynamics, and then decoded to the bird-view masks. We can see that in this case, the model can accurately predict the future positions of the surrounding vehicles (shown in green boxes).

Evaluation Results

Average returns for the variants of our proposed method


Comparison with baseline RL algorithms taking camera and lidar sensor inputs


Comparison with baseline RL algorithms taking bird-view mask input