Benchmarks

Specifications

In our experiment, we evaluate the local robustness property of each model. To that end, an input specification is identified by an ε, which is a distance measure under L∞ and identifies a square region in input space; an output specification is identified by a reference image x and requires that for any input x' such that ||x'-x||∞ ≤ ε, the model should classify x' to the same label with x.

To avoid meaningless specifications which can be solved by a very simple BaB tree (e.g., a tree that has only 1 node), we perform a binary search for ε for each reference input x such that the template tree produced by verification of the original network against x and ε has a considerable number of nodes.

(Figure on the right generated in ./data/RQ0_data_histo_distribution.ipynb)

Note that this is necessary because, otherwise, the template can contain too few nodes, which cannot provide meaningful information for the verification of the new network. The distribution of the sizes of the template trees used in our experiments across all the benchmarks.

Binary search for ε

The flow of this algorithm is simple:

if tree size is greater than 1, we record the epsilon;
if tree size equals to 1, there are two cases:

(1) if verified, it means we need to increase epsilon;

(2) if falsified, it means we need to decrease epsilon

then we enter the loop again to check the tree size.

This algorithm is adapted from a similar one in VNNCOMP22 Oval21[1]. In the end, our aim is not to favor non-robust cases, but find the epsilon with which we can obtain a meaningful BaB tree as our template for incremental verification.

Models

The details of the benchmarks adopted in our experiments

The MNIST database (Modified National Institute of Standards and Technology database[2]) is a large database of handwritten digits that is commonly used for training various image processing systems.[3][4]

The CIFAR-10 dataset (Canadian Institute for Advanced Research, 10 classes) is a subset of the Tiny Images dataset and consists of 60000 color images. The images are labelled with one of 10 mutually exclusive classes: airplane, automobile (but not truck or pickup truck), bird, cat, deer, dog, frog, horse, ship, and truck (but not pickup truck).

MNIST_L2 and MNIST_L4 models are feed-forward and fully connect model, containing 512 and 1024 neurons within 2 and 4 layers.

OVAL21 models are 3 ReLU-based convolutional networks which supplied by OVAL Team from VNNComp[1].

The database is also widely used for training and testing in the field of machine learning [5] .

Model Alternations

Weight pruning is used for improving generalization, inference speed-up, and training/fine-tuning with fewer samples, etc. It has been integrated as a standard library in mainstream machine learning platform such as Pytorch. Technically, it simply removes a number of weights in a given neural network, by setting them to 0. It is subject to a parameter that indicates the percentage of the number of weights to be removed. This parameter decides the similarity between 𝑁 and 𝑁 ∗, and so it possibly affects the effectiveness of our approach. In our experiments, for different models, we adopt different parameters, due to the different sizes of the neural networks.

Specifically, for the models MNISTL4 and MNISTL2, we adopt 7% and 12%; for the models OVAL21BASE, OVAL21WIDE and OVAL21DEEP, we adopt smaller parameters 0.1%, 1% and 3% due to the large sizes of the models. In RQ5, we take this parameter as an indicator of the similarity between 𝑁 and 𝑁 ∗, and take a specific look at the performance change of Olive under different parameters

Quantization is also a widely-recognized technique for the similar purposes as weight pruning. It has also been adopted in mainstream machine learning platforms such as Pytorch. Technically, it converts a original floating number to a different format using less number of bits, e.g., using 8-bit integers. In our experiments, we adopt such a conversion for the alteration of MNIST models. For the models of OVAL21, since performing quantization can cause a big change to model behavior due to their large number of model parameters, we thus apply weight pruning methods only.

[1] Müller, Mark Niklas, et al. "The third international verification of neural networks competition (vnn-comp 2022): Summary and results." arXiv preprint arXiv:2212.10376 (2022).

[2] LeCun, Y. (1998). The MNIST database of handwritten digits. http://yann. lecun. com/exdb/mnist/.

[3] Fuchida, M., Pathmakumar, T., Mohan, R. E., Tan, N., & Nakamura, A. (2017). Vision-based perception and classification of mosquitoes using support vector machine. Applied Sciences, 7(1), 51.

[4] Salameh, A. W., & Surakhi, O. M. (2020). An optimized convolutional neural network for handwritten digital recognition classification. Journal of Theoretical and Applied Information Technology, 98(21), 3494-3503.

[5] Grother, P. J., & Hanaoka, K. K. (1995). NIST special database 19. Handprinted forms and characters database, National Institute of Standards and Technology, 10, 69.

Google Sites