Research Question

Here we provide more details of the experiment as supplements to our paper, including our questionnaire, visualization and raw experiment data.

Research Questions

MultiTest is designed to generate realistic and modality-consistent test data to detect faults in MSF systems and improve their performance. To evaluate MultiTest’s performance, we conduct both quantitative and qualitative experiments to answer the following three research questions (RQs):

RQ1. How effective is the MultiTest at synthesizing realistic multi-modal data? [Realism Validation]
RQ2. How effective is the MultiTest at generating error-reveling tests? [Fault Detection Capability]
RQ3. How effective is the MultiTest at guiding the improvement of a SUT through retraining? [Performance Improvement]

RQ1: Realsim Validation

In this subsection, we provide supplementary materials for our qualitative assessment.

Supplementary materials for qualitative assessment

We conduct a user study to qualitatively assess the naturalness of MultiTest’s generated multi-modal data. During the study, we randomly selected twenty data instances as test seeds. We then ask a participant to rank the multi-modal data synthesized by four different pipelines from each seed through a questionnaire. For each of the twenty data instances, a participant needs to rank the data quality from three perspectives: (1) the image’s naturalness, (2) the point cloud’s naturalness, and (3) the modality-consistency between a pair of image and point cloud. To mitigate bias, we randomly assigned the order of data synthesized by different pipelines for each test seed.

We further provide the full questionnaire and the anonymous answers from 16 participants. Here we provide more visualizations on the composition of the participants. All participants had a minimum of a master’s degree in SE/CS. Seven out of sixteen participants had more than two years of experience in the field of autonomous driving.

In the last section of our questionnaire, we ask participants to select the most important factors that determine the quality of images or point clouds.

We find that participants were more likely to consider factors such as perspective, collision, occlusion, and poisition when determining data quality, and less likely to consider more detail-oriented factors such as color and shape. This may contribute to future work research.

Visual Comparison with Current Insertion-based Testing tools

TauLim[1] (for LiDAR) uses planar equations to determine the insertion position and therefore may locate an illegal position. Besides, TauLim may generate point clouds that do not respect the physical laws of the laser.

MetaOD[2] (for cameras) could synthesize images with correct positions and poses.

[1] TauLiM: test data augmentation of LiDAR point cloud by metamorphic relation
[2] Metamorphic object insertion for testing object detection systems

TauLim1

MetaOD1

TauLim2

MetaOD2

TauLim3

MetaOD3

MultiTest

RQ2: Fault Detection Capability

In this subsection, we visualize several error-revealing test data generated by MultiTest. We provide both raw and generated image and point clouds with labeled bounding boxes. We leverge green bounding boxes to mark the ground-truth and red bounding boxes to mark the predicted results. In addition, we leverage the blue bounding boxes to mark the pervious "Car" object that are marked as "DontCare" due to being almost completely obscured. In order to minimize false positives, we ignore the faults that overlap with the blue boxes. Finaly, we mark the detected errors in 3D scene with dashed yellow boxes.