The evaluation accuracy of MNSIT-LeNet4 and Fashion-MNIST-LeNet4 on different types of dataset
The evaluation accuracy of CIFAR10-ResNet20 and SVHN-ResNet20 on different types of dataset
The evaluation accuracy of MNSIT-LeNet5 and Fashion-MNIST-LeNet5 on different types of dataset
The evaluation accuracy of CIFAR10-VGG16 and SVHN-VGG16 on different types of dataset
On all dataset and models, each tool usually achieves better accuracy on the validation data generated by the same type of tools, because the data generated by the same type of tools have similar distributions, while the data from other types of tools are more likely to be OOD.
The distribution diversity of the test cases generated by a testing tool can be improved by introducing diverse transformation (e.g., blur, rain, fog) and more explicit distribution guidance (e.g., MMD guidance), which can identify more unseen data for further improving robustness.