RQ1 & RQ2: Results of SOO-based and MOO-based Selection
RQ1 & RQ2: Results of SOO-based and MOO-based Selection
The comparisons between CGS and Random Selection
The following figures plot the coverage increasing trends on each dataset under criteria KMNC and NC. The solid line represents testing with the seeds based on the KMNC guidance. The dashed legend with suffix ’-r’ means testing with randomly selected seeds.
MNIST (LeNet-5)
Fashion-MNIST (LeNet-5)
SVHN (CNN)
CIFAR-10 (ResNet-20)
Observing these figures, we can find that testing with coverage-guided selected seeds always achieve better coverage than random selection throughout the 5,000 iterations.
The comparisons between CGS and DeepMetis
The following figures plot the coverage increasing trends on each dataset under criteria KMNC and NC. The solid line represents testing with the seeds optimized by coverage. The dashed legend with suffix ’-m’ means testing with seeds selected by DeepMetis.
MNIST (LeNet-5)
CGS-KMNC
CGS-NC
Fashion-MNIST (LeNet-5)
CGS-KMNC
CGS-NC
SVHN (CNN)
CGS-KMNC
CGS-NC
CIFAR-10 (ResNet-20)
CGS-KMNC
CGS-NC
Testing with seeds selected by coverage-guided optimization strategy achieves better coverage than DeepMetis throughout the 5,000 iterations in most cases, especially on large datasets.
We conduct a statistical analysis of the results, the AUC values that are significantly better than the compared ones are underlined. We can find that the AUC values of our strategies are higher and significantly better than the AUC values of DeepMetis in most cases, especially on large datasets such as CIFAR-10.
The following figures show the trends in terms of error detected by DeepHunter using different seeds.
MNIST (LeNet-5)
Fashion-MNIST (LeNet-5)
SVHN (CNN)
CIFAR-10 (ResNet-20)
The above results show that, under the same number of seed inputs and iterations, the PCS-low strategy can detect more failures than random selection and other strategies. LSA does not have a good effect on failure detection.
Table " Results of different seed selection strategies " shows detailed results related to the coverage (shown in Row NC and KMNC) and the number of unique errors (shown in Row #Failure).
Table " Robustness improvement of models (retrained using test cases generated from different seed input sets) against test errors (𝑇est_ID ) and adversarial examples (TEST_OOD) " shows the robustness evaluation of the newly trained model on the test errors and adversarial examples.
Table " Optimization with DeepGini " shows testing performance of DeepGini strategy. We use this test case selection strategy to select initial seeds and retrain dataset.
With regards to a testing goal such as coverage and the number of failures, the corresponding seed selection strategy can improve the testing performance and make DL testing more efficient. However, the selection strategy proposed for a testing goal does not work well on another goal. For the robustness goal, none of the selected metrics work well.
Overall, MOO-based seed selection strategies are useful in boosting testing performance, and MOO(CF) performs better than random selection on all goals. Compared with SOO-based selection strategies, they perform a balance on multiple goals although they do not outperform on the corresponding goal. Moreover, MOO-based selection achieves promising results in robustness enhancement.