Time cost of seed selection
Time cost of seed selection
The time cost of each strategy is reported in the following sheet (select 2% seeds).
Impact of seed corpus size
These figures show the average results with different numbers of initial seeds. For each dataset, we randomly select different numbers of initial seeds (i.e., 100, 200, 300, 400, 500, 1000, 2000, 3000, 4000, and 5000) from the test dataset. With each initial seed corpus, we run the each of 5 testing tools for 5,000 iterations and compare the final coverage and the number of unique errors. The results are obtained on LeNet-5 for MNIST and Fashion-MNIST, CNN for SVHN, and ResNet-20 for CIFAR-10.
The first row and the second row show the coverage results and the number of unique errors, respectively.
For the coverage results, we observe that, when the size of the seed corpus is small (i.e., smaller than 500), the coverage increases as the corpus size increases. However, when the corpus size reaches a certain number, the increase of the coverage is no longer significant. In some cases, the coverage even decreases when the corpus size increases (e.g., TensorFuzz-NC on a 1,000 seed corpus of MNIST). The results reveal that the coverage converges when the size of seed corpus reaches a certain number.
Regarding the number of unique errors, when we use KMNC as guidance, under different testing tools, the number of unique errors increases with the number of initial seeds increasing. The growth rates of random testing and TensorFuzz are relatively flat. When using NC as guidance, the number of unique errors generated by DeepHunter and random testing tends to be stable with the number of seeds increasing, but there is no such trend in TensorFuzz. The number of unique errors generated by TensorFuzz fluctuates or even decreases when the number of initial seeds increases.
The coverage tends to be insensitive to the corpus size. When the initial seeds reach a certain size, the coverage does not significantly increase. Differently, the number of unique errors is largely affected by the number of seeds. The large size of seed inputs can increase the number of discovered errors.
Statistical analysis
The variance of data on MNIST is reported in the following sheet.
We also put the p-value of the dominant metric of each strategy below (compared with random selection). Since we only have five groups of data, the p-value may not accurately reflect the relationship between the data, but we can still find that the results of optimized seed set are better than random except for the #UniqErrors of TensorFuzz (the randomness of the method is great).