Here we provide more details of the experiment as supplements to our paper, including our questionnaire, visualization and raw experiment data.
Reaserch Questions
V2XGen is designed to systematically test the cooperative driving system, particularly to verify its performance on long-range and occlusion perception scenarios. To this end, we empirically explore the following three research questions (RQ):
RQ1:How effective is the V2XGen at synthesizing realistic cooperative perception data? (Data Realism)
RQ2:How effectively can V2XGen generate tests under the fitness-guided scene generation? (Fault Detection)
RQ3: How effective is the V2XGen at guiding the improvement of a cooperative perception system through retraining? (Improvement Effectiveness)
In this subsection, we provide supplementary materials for our qualitative assessment.
We conduct a user study to assess the naturalness of cooperative driving data generated by each operator of V2XGen. During the study, we randomly select five cooperative perception data instances for each transformation operator as test seeds. We then ask the participant to rank the cooperative perception data synthesized by two different generation methods from each seed through a questionnaire. For each data instance, participants need to rank the quality of the generated point cloud data from the perspective of the realism of the synthesized ego vehicle data frame and the realism of the synthesized cooperative vehicle data frame.
We provide the full questionnaire and the anonymous answers from 22 participants. Here, we provide more visualizations of the composition of the participants. The vast majority (95.5%) of participants held a minimum of a master’s degree in SE or CS. Among the participants, 14 of them had accrued a minimum of one year of experience in the realm of autonomous driving or V2X, with 17 participants boasting similar experience in the area of object detection.
The red box indicates the location where the object is inserted. (World System-> Ego)
Insertion Operator: 🤖The baseline does not align with the point cloud's characteristic density distribution, which is denser at closer ranges and sparser at greater distances!
The red box indicates the location where the object is deleted. (World System-> Ego)
Deletion Operator: 🤖The baseline fails to reconstruct the ground line obstructed by the deleted entity!
The white box indicates the position of the object before translation,the red box indicates the position of the object after translation. (World System-> Ego)
Translation Operator: 🤖The baseline fails to reconstruct the occluded ground line and does not remove the occluded region of the newly introduced translation entity!
The red box indicates the position of the object after rotation. (World System-> Cooperative)
Rotation Operator: 🤖In the baseline, the side of the rotated entity close to the LiDAR does not receive point cloud data, whereas the side farther from the sensor does, which contradicts the fundamental principles of laser LiDAR transmission!
The red box indicates the location where the object is scaled. (World System-> Cooperative)
Scale Operator: 🤖In the baseline, the scaled entity is elevated above the ground and appears to be floating, which does not align with a realistic cooperative driving environment!
In this subsection, we provide supplementary materials for some visualization examples of occlusion perception errors and long-range perception errors.
Occlusion Perception Error
The black box indicates that there is an occlusion perception error, and the occlusion rate of the ego vehicle’s line of sight is 0.165. The red predicted bounding box inferred by the model (late fusion) has an iou with the ground-truth bounding box that is less than the threshold.
The black box indicates that there is an occlusion perception error, and the occlusion rate of the ego vehicle’s line of sight is 0.665. The red pentagon indicates that the model (late fusion) does not infer the prediction bounding box at the corresponding position.
Long-range Perception Error
The black box indicates that there is a long-range perception error, and the ego vehicle is 56.10 meters away from the object to be measured. The red pentagon indicates that the model (F-Cooper fusion) does not infer the predicted bounding box at the corresponding location.
The black box indicates that there is a long-range perception error, and the ego vehicle is 55.16 meters away from the object to be measured. The red predicted bounding box inferred by the model (F-Cooper fusion) has an iou with the ground-truth bounding box that is less than the threshold.
SUTs (system under test)
Early Fusion system [1] directly transmits raw point clouds to collaborating entities, allowing the ego vehicle to aggregate all data within its own coordinate frame. This process ensures the preservation of comprehensive information.
Late Fusion system [1] detects objects based on sensor observations from cooperative vehicles and subsequently shares the detection results with other entities. The receiving ego vehicle then employs non-maximum suppression to produce final outputs.
V2VNet system [2] proposes a multi-round message passing mechanism based on graph neural networks to enhance perception performance. It comprises three main stages: a convolutional network block for generating an intermediate representation, a cross-vehicle aggregation stage, and an output network for computing the final outputs.
V2X-ViT system [3] introduces a novel vision transformer explicitly designed for V2X perception. It incorporates a customized heterogeneous multi-head self-attention module tailored for graph attribute-aware multi-agent 3D visual feature fusion. This module effectively captures the inherent heterogeneity present in V2X systems.
AttFusion system [4] is designed to capture interactions among the features of neighboring cooperative vehicles, allowing the network to prioritize key observations. This pipeline is flexible and can be seamlessly integrated with existing deep learning-based detectors.
F-Cooper system [5] utilizes maxout fusion to combine shared intermediate features. Data inputs are independently processed by the voxel feature encoding layers to produce features. Subsequently, local spatial features extracted from individual vehicles are fused to generate the final feature maps.
Reference
【1】Xu, Runsheng, et al. "V2v4real: A real-world large-scale dataset for vehicle-to-vehicle cooperative perception." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.
【2】Wang, Tsun-Hsuan, et al. "V2vnet: Vehicle-to-vehicle communication for joint perception and prediction." Computer vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, proceedings, part II 16. Springer International Publishing, 2020.
【3】Xu, Runsheng, et al. "V2x-vit: Vehicle-to-everything cooperative perception with vision transformer." European conference on computer vision. Cham: Springer Nature Switzerland, 2022.
【4】Xu, Runsheng, et al. "Opv2v: An open benchmark dataset and fusion pipeline for perception with vehicle-to-vehicle communication." 2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022.
【5】Chen, Qi, et al. "F-cooper: Feature based cooperative perception for autonomous vehicle edge computing system using 3D point clouds." Proceedings of the 4th ACM/IEEE Symposium on Edge Computing. 2019.