16-824 Visual Learning and Recognition Final Project
Shasa Antao (santao), Kevin O'Brien (kobrien), Wesley Wang (wesleyw)
Introduction
The field of 3D object detection has seen large advances in recent years. Much of what we have discussed in class so far about this topic involves using solely images (and in some cases, 3D shape priors) to detect 3D shapes in images. However, for 3D applications, we are inherently limiting ourselves by only using 2D data to make predictions in higher dimensions. Using other sensor modalities that provide us with 3D information makes 3D learning a much better-defined problem. For our project, we were very interested in utilizing visual data beyond what can be found in a standard image, namely point clouds. Point clouds give very accurate three-dimensional representations of a scene, which allow for the creation of datasets with very accurately labeled 3D bounding boxes, especially when compared with trying to label 3D boxes using only 2D images.
Figure 1. Sample output from a 3D object detection pipeline. Upper image shows the 3D boxes on the point cloud and lower image shows these 3D objects projected into the image plane
The most popular application of 3D object detection using point clouds is detecting and labeling cars on the KITTI dataset. This dataset was collected using point cloud data from a Velodyne HDL-64E LiDAR scanner, which costs well over $70,000. As anyone who follows self-driving car news will know, the price point of LiDARs is one of the main factors keeping self-driving technology prohibitively expensive for many customers and applications. For this reason, Tesla CEO Elon Musk has even said on many occasions that Tesla’s autonomous systems will never use LiDAR sensors. Given these very real financial and engineering constraints, we wanted to see how well existing 3D object detection methods could perform on point clouds from cheaper LiDAR sensors. To do this, we tried several methods of pre-processing KITTI LiDAR data in order to create more sparse point clouds that we imagine might be generated by cheaper, lower resolution sensors, and evaluated how these new data affected the detection results.
Related Work
3D object detection from point clouds can broadly be classified into two types of methods: point-based and voxel-based. To date, point-based methods tend to show better detection results, as these methods utilize all of the information in a given point cloud, as well as all spatial relations between neighboring points. These methods rely greatly on iterative sampling and nearest neighbor grouping algorithms that cluster different groups of neighboring points together to extract point-cloud-level features. Despite these methods’ relative performance superiority over voxel-based methods, these point-clustering algorithms are much slower and more inefficient than performing voxel-based operations.
Voxel-based methods, on the other hand, first preprocess the point clouds into fixed sized grids, which allows for the use of more standardized spatial algorithms (ie, convolutions). Additionally, the sparsity produced by voxelization means that the memory used to store and process these point clouds can be more contiguous and therefore more efficient. However, one major downside to many existing voxel-based methods is that the actual object detection step is usually performed on a bird’s eye view representation of the 3D features, which effectively ignores a lot of 3D information that we are trying to preserve by voxelizing.
Experimenting with Low-resolution Point Clouds
Voxel-RCNN
Given these tradeoffs, we decided to use the method proposed in Voxel-RCNN. Voxel-RCNN is currently among the state-of-the-art in the 3D object detection field, as it is currently ranked at #20 on the KITTI dataset leaderboard. In Voxel-RCNN, the authors still use the bird’s eye view 2D representation to generate region proposals like in prior methods, but once these proposals are generated, they are used on the output of the 3D feature extractor to generate RoI features for box detection. This allows for 3D structure to be preserved and utilized in the final steps of box detection, which compensates for the relative lack of precise structure in the input due to voxelization.
The main novel contribution of Voxel-RCNN is the Voxel RoI pooling layer. Given a region proposal generated from the 2D backbone network (which is then convolved with two parallel 1x1 convolutional layers to generate a 3D proposal), the region is divided into GxGxG sub-voxels. Due to the sparse nature of point clouds, we know that most of these sub-voxels will be empty, so doing any kind of naive pooling for this region will result in losing out on the already limited amount of spatial information. Instead, the pooling is conducted by first querying every voxel's neighbors (26 in total, unless we are at an edge), and performing an MLP operation on the combined features of the neighborhood. These voxel features were already extracted from the 3D backbone at several scales, so the final result here is a single RoI feature set that is aggregated across multiple voxel feature scales.
Figure 2. Voxel-RCNN network architecture
Point Cloud Downsampling
We utilized the Open3D Python library to process and visualize the point clouds. We also used Open3D to downsample the point clouds to specific percentages of the original point cloud. In our experiments, we performed two different methods of point cloud downsampling. The first method was voxelizing the 3D space to a given voxel size, and the other was to remove every n points from the point cloud, where n is some integer. Using these methods, we generated downsampled point clouds that resulted in having a smaller percentage of points relative to the original point cloud. Some percentages included 10, 25, 50, and 75 percent of the original point cloud, as well as some other intermediate percentages. This was straightforward to do with the every-n method (i.e., keeping one point for every two points resulted in a downsampled point cloud that was 50 percent the original size). To achieve our desired downsample percentages with the voxelization method, we used trial and error to estimate the voxel size that brought us close to the desired downsample ratio. Shown below are tables that map the downsample percent to the voxel size for the voxel method of downsampling and the downsample percent to the n value for the every-n method of downsampling.
Figure 3a. Full resolution point cloud from the KITTI dataset
Figure 3b. 25% resolution voxelized point cloud
Figure 3c. 25% resolution every_n point cloud
Method
We set out to perform two key experiments to stress-test the Voxel-RCNN method. First, we trained the network exactly as specified in the original paper, using the full resolution KITTI point clouds as inputs. For our second experiment, we trained the network using our 50% downsampled voxelized point clouds as input. In both cases, we trained the model for 80 epochs on an NVIDIA GeForce GTX 1080 Ti. We initially expected the downsampled training loop to take approximately 50% of the time of the training loop using the full resolution point clouds, but both training loops clocked in at around 18 hours. This is most likely due to the fact that as part of the data augmentation process in the Voxel-RCNN loop, the input point clouds are first voxelized to voxels of size (0.05m x 0.05m x 0.1m). So, while there might be fewer points to start, this data preprocessing fits all the data to a fixed resolution, which explains the lack of difference in training time. Table 1a shows the downsample percentage and the voxel size used to achieved that percentage. Table 1b shows the downsample percentage and the n value used to achieve that percentage.
After running these two training loops, we then evaluated each trained network on a variety of downsampled point clouds from the KITTI val set to see which produced the best detection performance and at what point the detection performance started to break down. We discuss our results in the next section.
Table 1a. Voxel sizes used to achieve point cloud downsampling percentages
Table 1b. Downsampling factors use to achieve every-n point cloud downsampling percentages
Results
We show the results of our various experiments in Tables 2a and 2b. The tables show the 3D average precision scores for the "Car" class, evaluated at 40 recall thresholds as per the KITTI database standard. This average precision score is computed with a 3D IoU threshold of 0.7 for a positive sample.
Experiment 1 - Training on Full Resolution Point Clouds
For the first experiment (Table 2a), we find that we perform better than the baseline Voxel-RCNN method in the all three categories of KITTI car detection when evaluating on the 80%, 75%, 70%, and 60% downsampled voxelized point clouds, with the greatest performance increase (+0.14) coming from the 70% voxelized inputs on the "Easy" category. What is more interesting to see, however, is that decreasing the point cloud resolution does not result in a linear decrease in detection performance. While we expected that the resolutions between 100% and 50% might still yield good results, since the dataset uses a very high-definition LiDAR, we were surprised to see that even using 25% of the original points yielded results that are not that much worse than the baseline method, especially in the voxelized case.
It makes sense that the voxelized downsampled point clouds would yield better performance than their every-n counterparts just because the voxelization process preserves some notion of the original point cloud structure. We see a similar phenomenon in the every-n point clouds, as their performance deteriorates more quickly as we reduce point cloud resolution. By randomly removing points without any notion of structure, it becomes less likely that regions containing objects would have any points, leading to false negative samples in the detection output.
Experiment 2 - Training on 50% Voxelized Downsampled Point Clouds
For the second experiment (Table 2b), we see the best results coming from evaluating on the 50% voxelized point clouds, which makes sense since this is the same resolution that was used for training. We see improvements across the board when compared to the results from evaluating on the full resolution points clouds, but that is to be expected since the network was not trained on such dense inputs. Similarly, every voxelized point cloud resolution between 30% and 80% shows superior performance on this network compared to the full resolution point cloud inputs, showing that the network explicitly learned to perform better on sparse point clouds.
When comparing to the baseline from the first experiment (i.e. the vanilla Voxel-RCNN method, first row of Table 2a), we see an improvement of 0.1 AP points in the "Medium" category, and comparable performance in both the "Easy" and "Hard" category. This is a significant result, as it helps prove our thesis that very dense resolution point clouds are not needed for robust object detection.
Overall, we see the point cloud downsampling step we perform as a sort of regularization technique. By reducing the resolution of the point clouds during training, the network has to learn representations of objects from inherently fewer point cloud features, and therefore can be more (or at least comparably) robust to held-out validation data.
We are pleased to see that decreasing the point cloud resolution does not greatly decrease the performance on the baseline, and we are able to match (and in some cases slightly improve upon) the results achieved in the paper. And since our preprocessed downsampling script only took about 10 minutes to run, we feel this is a small price to pay for such a noticeable improvement in performance, and we think it is a valuable form of data augmentation that should be performed in the future both for improved results on software, but also for improved scalability on more accessible LiDAR hardware.
Table 2a. 3D average precision results for 'Car' class using weights obtained from training on full-resolution point clouds. The first row in this figure ("Full resolution") indicates the baseline performance achieved by the authors in the Voxel-RCNN paper
Table 2b. 3D average precision results for 'Car' class using weights obtained from training on 50% resolution voxelized point clouds
It is also worth comparing our preprocessed voxel sizes with the voxel size that the network enforces. As mentioned earlier, the network takes all inputs and voxelizes the 3D space to a 0.05m x 0.05m x 0.1m space, giving a voxel volume of 0.00025m3. The 50% voxelized inputs we trained on use a voxel size of 0.096m x 0.096m x 0.096m, giving a voxel volume of 0.00088m3.
In the upper rows of Table 2a, we voxelize our inputs to a voxel volume smaller than what the network computes. This small-to-large operation will essentially keep the majority of the structure of the initial point cloud after both voxelization operations. As mentioned previously, for the 50% voxelization, we first voxelize to a larger volume and then perform a smaller voxelization within the network. This large-to-small ordering causes some structure to be lost as that first large voxelization operation increases in volume. But clearly, as shown in Tables 2a and 2b, this does not start to deteriorate until we start using a very large initial voxel volume (yielding a 20% downsampling rate).
Figure 4a. A plot of the AP for the voxelized downsampled point clouds trained on full-resolution point clouds. The easy, medium, and hard categories are separated out.
Figure 4b. A plot of the AP for the voxelized downsampled point clouds trained on 50% resolution point clouds.
Shown in the plots of Figures 4a and 4b are visualizations of the data from Tables 2a and 2b for the voxelization method of downsampled point clouds. The plots are separated into which weights were used to evaluate the point clouds (full-resolution or 50% resolution weights). The plots make it easier see that there is essentially no degradation in performance of the network until starting around 20-30% downsampling for all three categories of Easy, Medium, and Hard.
Figure 5a. A plot of the AP for the every-n downsampled point clouds trained on full-resolution point clouds. The easy, medium, and hard categories are separated out.
Figure 5b. A plot of the AP for the every-n downsampled point clouds trained on 50% resolution point clouds.
Shown in the plots of Figure 5a and 5b are the data from Tables 2a and 2b for the every-n method of downsampled point clouds. There are only a finite number of percentages of downsampling achievable with the every-n method, so there is no information above 50% downsampling to compare with the voxelization method. However, the plots show that network performance degrades starting at a higher downsampling percentage, at around 40-50%, compared to the voxelized downsampling method.
Qualitative Results
In the figures below are the 2D and 3D object detection results on each of the downsampled point cloud percentages for both sets of weights, one trained on the full resolution dataset and one trained on the 50% resolution dataset. Predicted detections are shown as red boxes, and the ground truth boxes are shown as green boxes. In 2D detection results, the upper half is the 2D bounding box, and the lower half is the 3D bounding box projected into the 2D frame. For the same downsample percentage, the actual number of points in the point cloud between the voxel downsampling method and every-n downsampling method may differ. This is because the voxel downsample percentage is actually calculated as an average over all of the point clouds in the dataset, whereas the every-n downsample percentage is calculated on each individual point cloud.
Shown below in Figures 6 and 7 are the 2D and 3D object detection results for various voxel downsampling percentages for the same scene. These weights were trained on the 100% resolution dataset. When looking through the 3D detection results, it is easy to see that downsampling the point cloud up to a certain point still preserves the overall shapes of the cars. It is also easy to see that at a certain point around 10-20% downsampling, some of the car shapes are indistinguishable from the rest of the environment, which explains the abrupt decrease in AP starting at that percentage.
Shown below in Figures 8 and 9 are the 2D and 3D object detection results for various every-n downsampling percentages for the same scene. These weights were trained on the 100% resolution dataset.
Shown below in Figures 10 and 11 are the 2D and 3D object detection results for various voxel downsampling percentages for the same scene. These weights were trained on the 50% resolution dataset.
Shown below in Figures 12 and 13 are the 2D and 3D object detection results for various every-n downsampling percentages for the same scene. These weights were trained on the 50% resolution dataset.
Conclusion - VoxelRCNN Downsampling
We find that decreasing (especially via voxelization) input point cloud resolution by a significant amount does not have a large effect on accuracy of point-cloud-based object detection methods. This means that there is a lot data that is redundant or entirely unused for the detection process. A significant number of points belong to the street or walls of buildings, but a more sparse representation of the road and walls would still capture the essence of the scene. These experiments also show that there are redundant points the represent cars as well. This opens the door for network improvements via downsampling as a data augmentation technique, as well as the ability to achieve state-of-the-art performance using cheaper, low resolution LiDAR sensors.
One-Shot Learning for 3D Object Detection
A major challenge in using object detection systems for novel applications is that there is a limited amount of accurately labeled class data. Few-shot learning can solve the problem of identifying a new class in a dataset with few samples of that specific class. This is done by using a 'query' image, containing ground truth detections of the new class and using the features from the query to find instances of the new class in the 'target' image. In N-class K-shot learning, N represents the number of new classes that are required to be detected and K represents the number of query images.
In this project, we wanted to try and extrapolate concepts from a state-of-the-art one-shot learner for 2D RGB object detection and try applying these concepts to perform object detection on 3D point cloud data.
One-Shot Object Detection with Co-Attention and Co-Excitation [2]
In the initial literature survey, we found [2] by Hsieh et al. (2019) that we believed to be the state-of-the-art implementation of one-shot learning for object detection which uses co-attention and co-excitation. The network architecture is based on Faster R-CNN, and in their implementation, they experimented with using ResNet-50 as the CNN backbone. The techniques used here involve generating non-local object proposals by using squeeze, co-excitation, and proposal ranking.
Figure 14 - Architecture of the one-shot implementation for 2D Object Detection shown in [2]
We implemented the code repository provided on the author's Github and ran the approach on examples from the 2017 COCO validation set, as seen in Figures 15 and 16.
Figure 15 - Example of one-shot detection for a bus
Figure 16 - Example of one-shot detection for a teddy-bear
Changing the Architecture in VoxelRCNN
After understanding the underlying concepts in [2] and reading through the code repository provided, we found that we could generate a 2D feature map using the non-local operation and perform block matching between the query and target feature maps generated from the 2D Bird's Eye View (BEV) data. The idea behind this method is that features found in the target which are similar to features in the query will be weighted more compared to other features in the target feature map. In addition, the original one-shot learner operated in 2D space, which made it more likely to work on a 2D feature map as opposed to directly on 3D data. Figure 17 shows our proposed changes incorporated in the architecture for VoxelRCNN, which involved parallelizing the 3D backbone to accommodate both target and query point clouds and having the two pipelines converge in the 2D backbone, resulting in the generation of a non-local feature map from which the predicted classes and predicted boxes are obtained.
Figure 17 - Proposed architecture for one-shot learning on 3D point cloud data
Modifying the VoxelRCNN repository
To ensure that the inference model had never seen the new class before, we pre-trained VoxelRCNN on the 'Car' class and kept the new class as 'Van'. Since vans look similar to cars, the model could make the right predictions as opposed to using a model pre-trained on cars to predict a new class 'Pedestrian', as we believed features from pedestrians and cars would be significantly different. We generated a new .yaml file for predicting vans, which used the same anchor positions that were used in the prediction of cars. In addition, we split the input data passed into the model into query and target inputs.
Results
We tested our approach on different targets and queries selected from the KITTI dataset. Some of the target outputs are shown below in Figures 20 and 21. The query point cloud used is shown in Figure 19 and to help visualize what the point cloud represents, the RGB image of the red van is shown in Figure 18.
Figure 18 - Image of query scene
Figure 19 - 3D point cloud of query scene
Observations
Initially we found that generating non-local features was computationally expensive and we had to down-sample the inputs to perform non-local block matching. We then up-sampled the output 2D feature maps using bi-linear interpolation. We tested the outputs before and after the up-sampling and we obtained qualitatively better predicted boxes after up-sampling the output feature map.
In terms of tuning hyper-parameters, we found that increasing the feature map stride resulted in numerous undesired class predictions in the target which was sub-optimal, as we only needed to detect a single new class. Keeping the smallest possible feature map stride resulted in better predictions.
The predicted scores obtained were very low (in the range of 0.01 ~ 0.08), and we think this is because the model is trained on a single class which affects the confidence scores for predictions on the van class.
We also tried one-shot learning for the 'Truck' class but were unsuccessful in getting meaningful detections. We think our approach might have failed because our model is trained on a single class 'Car', and in terms of features, cars are more similar to vans than to trucks. We also think that having the correct anchor positions for the 'Truck' class could have improved the results. Also, the 3D footprints of vans and cars are similar volumes, whereas trucks have a much larger volume than cars.
Experiment: Cropping the query point cloud into patches
The original one-shot learner in [2] used a patch of the ground truth region proposal of the query image as the input. To construct the analogous 3D patches, the coordinates of the eight corners of the ground truth box were first obtained. Next, a coordinate transform was performed on the box coordinates and the rest of the input point cloud such that the bounding box axes aligned with the new coordinate system axes. This then allowed us easily to filter the input points and keep only points within the bounding box. Afterwards, all of the points were then transformed back to the original coordinate system. However, we found that the model made better predictions when using the whole query scene rather than just the points from within the 3D ground truth bounding box.
Conclusion
Through our project, we have verified that our approach to one-shot object detection for 3D point cloud data works in some examples; however, the performance can be improved. The one-shot learning paper for 2D object detection also used concepts such as squeeze and co-excitation that we believe can make our overall approach much better.
Also, metrics such as 3D IoU for the results can be calculated to better understand the performance of the model. Re-training VoxelRCNN on multiple classes could also improve performance both in terms of confidence scores and IoU of the predicted boxes of held-out classes.
References
[1] Deng, Jiajun, et al. “Voxel R-CNN: Towards High Performance Voxel-Based 3D Object Detection.” ArXiv.org, 31 Dec. 2020, arxiv.org/abs/2012.15712v1.
[2] Hsieh, Ting-I, et al. “One-Shot Object Detection with Co-Attention and Co-Excitation.” ArXiv.org, 28 Nov. 2019, arxiv.org/abs/1911.12529.