End-to-end Autonomous Driving Perception with Sequential Latent Representation Learning

Jianyu Chen, Zhuo Xu and Masayoshi Tomizuka

Abstract

Current autonomous driving systems are composed of a perception system and a decision system. Both of them are divided into multiple subsystems built up with lots of human heuristics. An end-to-end approach might clean up the system and avoid huge efforts of human engineering, as well as obtain better performance with increasing data and computation resources. Compared to the decision system, the perception system is more suitable to be designed in an end-to-end framework, since it does not require online driving exploration. In this paper, we propose a novel end-to-end approach for autonomous driving perception. A latent space is introduced to capture all relevant features useful for perception, which is learned through sequential latent representation learning. The learned end-to-end perception model is able to solve the detection, tracking, localization and mapping problems altogether with only minimum human engineering efforts and without storing any maps online. The proposed method is evaluated in a realistic urban driving simulator, with both camera image and lidar point cloud as sensor inputs.

Github - Paper

Methodology

Architecture of our proposed end-to-end perception model at a single frame. We assume that there is a latent space summarising all useful historical information. Then with this latent space, we can extract the information we need, such as surrounding vehicles' poses, road geometry, and ego vehicle pose.

Trained Agent

The first row shows the ground truth of camera, lidar, semantic roadmap and surrounding vehicle bounding boxes when the trained agent is running in the simulated town with surrounding vehicles. Note that the agent only takes the camera and lidar images as inputs, the ground truth semantic roadmap and vehicle bounding boxes are displayed here for comparison with the reconstruction.

The second row shows the reconstructed camera, lidar, semantic roadmap and surrounding vehicle bounding boxes from the latent state, which is inferred online with the learned sequential latent model. Note that although we have no input of the ground truth semantic roadmap, the agent is able to reconstruct it purely based on historical camera and lidar sensor inputs.

Evaluation Results


For surrounding vehicle bounding box prediction, we plot the Precision-Recall Curve (PRC) and then compute the Average Precision (AP) as Area Under Precision-Recall Curve (AUC). The PRC and APs are computed under Intersection-Over-Union (IoU) of 0.1, 0.3, 0.5, and 0.7.