3D Object Detection

A Data Quality-aware Unified 3D Object Detection Method with LiDAR and Camera Sensors

Abstract: This article proposes an data quality-aware unified 3D object detector in a multiple sensor perception configuration (LiDAR and camera).

1. Introduction

In an autonomous driving system, determining the obstacles’ types and locations in the surrounding environment is important, which will effect on the planning and control operation later. Deep learning applied in computer vision has brought performance breakthrough for object detection and recognition, such as the single stage methods SSD[1], YOLO[2], and the two stage methods with region proposals R-CNN[3], fast R-CNN[4] and faster R-CNN[5] etc. Their results include the objects’ image location and size, represented as 2D bounding box. However, in real applications for autonomous driving, 3D object detection will output the 3D bounding box, i.e. 3D locations and sizes on the road, visualized as their projection on the image plane, such as Mono3D++[6], OFT[7] and MonoPSR[8] etc. There is a way to estimate the depth map directly from the mono or binocular vision, which implicitly provides the information of 3D object size and orientation, and the depth map is then converted to point cloud, on which the LiDAR-based object detection methods can apply, such as Pseudo-LiDAR[9].

LiDAR sensor is strong for 3D object detection, where the 3D bound box can be extracted from the 3D point cloud data. The deep learning method to handle point cloud data for object detection could be categorized as two routes: one is point cloud voxelization, i.e. clustering and sampling for feature extraction, such as VoxelNet[10], PointNet[11] and Voxel-FPN[12] etc.; another is projection of point cloud to the image plane at a chosen viewing direction, such as bird eye view or frontal camera view, then features on the projected plane are extracted for 3D object detection, such as BirdNet[13], PIXOR[14] and YOLO3D[15].

Either the LiDAR or the camera, could have some shortcomings, like disturbance by lighting conditions, noise, low resolution, highlight illuminance by overexposure or dim illuminance by underexposure for camera sensors, and sparsity (limited by scanning lines), distance range (power deficiency) and “holes” due to materials for LiDARs. Therefore, sensor fusion is popularly accepted style for object detection, such as AVOD[16], Point Fusion [17] and RoarNet[18]. However, we find that there lacks an unified deep learning system to suit different sensor configurations, for instance, stereo cameras, monocular camera, LiDAR and camera plus LiDAR etc.. Especially the challenging cases are, some sensors don’t work for some reason, like bad weather, interference by other external factors, bad quality of captured data.

Closely related work to ours are seen in [28–29]. A mono camera-based 3D object detection system is proposed in [28], which allows binocular stereo or LiDAR to consider for improvements. The stereo sensor information could be fused by stereo photometric alignment and the LiDAR sensor by point cloud alignment. We have to point out that the dominant sensor is still the mono camera, if this camera captured data got problems, the whole system fails in fusion.

Meyer et al. [29] worked on their previous LiDAR only-based method “LaserNet” and proposed to fuse information from mono camera. It utilized the data matching relation between camera and LiDAR, projected LiDAR to the frontal camera image plane and concatenated feature map from RGB image. It is pointed out there is no need to annotate the camera data for the training process. A big shortcoming in this method is, some obvious information obtained from bird eye view is neglected, such as height. Besides, simple combination of feature maps from LiDAR and camera, loses the chance to use explicitly geometric constraints, which may bring larger training data requirements.

In the following session, we will introduce a data quality-aware unified 3D object detection model to fuse camera and LiDAR data by deep learning. It doesn’t rely on any a single sensor, instead makes different sensors collaborate and compensates for deficiencies with each other.

2. 3D Object Detection

Figure 1 is the system diagram of the proposed sensor fusion method for object detection. First, there are sensor data quality check modules for LiDAR and camera respectively to control data input, i.e. switch A, B and C. Module “Image Quality Evaluation” checks the input image quality by the traditional image and video industry criterion, like PSNR (peak signal-to-noise ratio) and SSIM (structural similarity).

Figure 1. Multiple Sensors System for 3D Object Detection

LiDAR point cloud quality is checked in module “Point Cloud Quality Evaluation”. Here we think the LiDAR data quality check is related to its alignment with camera image too. It is required to project point cloud data to the camera image plane based on calibration parameters. Calculation of gradient information for the projected depth map is then followed by computing correlation with image edge information, i.e.

where w is the window size, f the image index, (i, j) the pixel location in the image, p the 3D point of the point cloud, X the LiDAR data and D the image gradient map. If image quality is bad too, we can only rely on LiDAR data itself. Instead, we use Rényi Quadratic Entropy (RQE) as

with G(ab)as Gaussian distribution function with mean a and variance b. As a matter of fact, RQE defines the crispness of point cloud distribution under a Gaussian Mixture Model(GMM), which is turns to be a quality criterion.

If only LiDAR is installed, switch E makes the system connect to the LiDAR point cloud for 3D RPN and 3D Object Detection, kind of extension from 2D RPN and 2D Object Detection in faster R-CNN [5]. Otherwise, without LiDAR installed, switch E makes another connection to pseudo LiDAR’s point cloud, generated from depth map, estimated with either the mono image or stereo images, based on camera calibration parameters. To guarantee the reliability of camera data processing, depth estimation along with 2D detection and 2D segmentation will be controlled simultaneously by switch D. It means, bad quality of image data will turn off all these three modules.

Depth estimation can be from single image or stereo images. If both cameras are ready to send good quality images, we apply stereo disparity estimation model, such as PSMNet[9] and GWCNet[20]. While only single camera is available, depth estimation by the mono image is implemented by deep learning models, such as GeoNet[23] and SARPN[24]. Meanwhile, 2D object detection and segmentation could be done by instance segmentation models, like Mask R-CNN[21] and SGN [22].

If without input from the camera sensor, 3D RPN joint with 3D object detection, will only use LiDAR data, running kind of voxelization models, such as Point RCNN[25] and Fast Point RCNN [26]. Otherwise, 2D detection and segmentation results will enter 3D RPN as well, shown in Figure 2.

Figure 2. Architecture for 3D RPN and 3D Object Detector

Similar to MV3D[27], point cloud can be projected to two different viewing planes, bird eye view (BIV) and frontal view, meanwhile the 2D instance segmentation result generates feature map via a encoder network, then this feature map will concatenate with the feature map from the point cloud’s frontal view projection, as one input to 3D RPN. Like Mask R-CNN[21], ROI Align replaces ROI Pooling.

Another input to 3D RPN is the feature map from the point cloud’s bird eye view (BEV) projection. Either input will generate region proposals independently, then similar to AVOD[16], we could sort the proposals and choose the top K proposals.

Here, the encoder network for feature map generation refers to ResNet or DenseNet. 3D RPN and 3D Object Detection mostly consists of fully connected layers, plus NMS (non-maximal suppression) at the final stage.

The 3D network training loss function is similar to AVOD[16]. The 2D object detection outputs object type, location and 2D bound box size. The 3D RPN and 3D object detection output the object types, location, orientation and 3D bounding box size.

3. Summary

In this article, we propose a data quality-aware unified multi-sensor (LiDAR and camera) fusion system for object detection with deep learning. We evaluate the data quality from different sensors and design switch to improve the whole system’s capability of handling the corner cases in sensor fusion, without increasing the training data additionally.

Reference

1. W. Liu et al., “SSD: Single shot multibox detector,” ECCV. 2016

2. J. Redmon, A. Farhadi, “YOLOv3: An incremental improvement,” arXiv:1804.02767, 2018.

3. R. Girshick et al., “Rich feature hierarchies for accurate object detection and semantic segmentation,” CVPR, 2014

4. R. Girshick, “Fast r-cnn,” CVPR, 2015

5. S. Ren, K. He, R. Girshick, J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” Advances in neural information processing systems, 2015

6. T. He, S. Soatto. “Mono3d++: Monocular 3d vehicle detection with two-scale 3d hypotheses and task priors”, arXiv 1901.03446, 2019

7. T. Roddick, A. Kendall, R. Cipolla. “Orthographic feature transform for monocular 3d object detection”, arXiv 1811.08188, 2018

8. J Ku et al.“Monocular 3D Object Detection Leveraging Accurate Proposals and Shape Reconstruction”, IEEE CVPR 2019

9. Y Wang et al.,“Pseudo-LiDAR from Visual Depth Estimation: Bridging the Gap in 3D Object Detection for Autonomous Driving”, IEEE CVPR 2019

10. Y Zhou, O Tuzel, “VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection”, arXiv 1711.06396, 2017

11. C. R. Qi, H. Su, K. Mo, L. J. Guibas. “Pointnet: Deep learning on point sets for 3d classification and segmentation”. CVPR, 2017.

12. B Wang, J An, J Cao,“Voxel-FPN: multi-scale voxel feature aggregation in 3D object detection from point clouds”, arXiv 1907.05286, 2019

13. J Beltran et al.,“BirdNet: a 3D Object Detection Framework from LiDAR information”, arXiv 1805.01195, 2018

14. B Yang et al.,“PIXOR: Real-time 3D Object Detection from Point Clouds”, arXiv 1902.06326, 2019

15. W Ali et al.,“YOLO3D: End-to-end real-time 3D Oriented Object Bounding Box Detection from LiDAR Point Cloud”, arXiv 1808.02350, 2018

16. J Ku et al., “Joint 3D Proposal Generation and Object Detection from View Aggregation”, arXiv 1712.02294, 2017

17. D Xu et al.,“PointFusion: Deep Sensor Fusion for 3D Bounding Box Estimation”, arXiv 1711.10871, 2017

18. K Shin et al.,“RoarNet: A Robust 3D Object Detection based on RegiOn Approximation Refinement”, IEEE IV, 2019

19. J.-R. Chang, Y.-S. Chen. “Pyramid stereo matching network”, IEEE CVPR 2018

20. X Guo et al., “Group-wise Correlation Stereo Network”, IEEE CVPR 2019.

21. K. He, G. Gkioxari, P. Dolla ́r, R. Girshick, “Mask r- cnn,” ICCV, 2017

22. S Liu et al., “SGN: Sequential Grouping Networks for Instance Segmentation”, ICCV 2017

23. Z Yin, J Shi, “GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose”, CVPR 2018

24. X T Chen, X J Chen, Z Zha, ”Structure-Aware Residual Pyramid Network for Monocular Depth Estimation”, IJCAI 2019.

25. S Shi et al.,“Point RCNN for 3D Object Detection from Raw Point Cloud”, CVPR 2019

26. Y Chen et al., “Fast Point RCNN”, ICCV 2019

27. X Z Chen et al.,“Multi-View 3D Object Detection Network for Autonomous Driving”, CVPR 2017

28. P Li et al.,“Multi-sensor 3D object box refinement for autonomous driving”, arXiv 1909.04942, 2019, 9

29. G P Meyer et al.,“Sensor fusion for joint 3D object detection and semantic segmentation”, arXiv 1904.11466, 2019, 4