Towards Evaluating the Robustness of Neural Networks: Investigating the Relationship to Network Depth and Training Data

Steven DiBerardino Aaron Giuffré Qais Al Hajri Lowell Weissman

CS 4824 / ECE 4424 Machine Learning - Virginia Polytechnic Institute and State University

Fig. 1. Carlini and Wagner's L2 adversary applied to each source/target pair for the first example of each class in the MNIST test dataset [1].

Introduction

This experiment seeks to validate and expand upon the results published by Nicholas Carlini and David Wagner in Towards Evaluating the Robustness of Neural Networks. In their experiment, Carlini and Wagner introduced three new attack algorithms which generate adversarial examples for neural networks.

Their experiment focused on attacking defensively-distilled networks. Defensive distillation has been researched as a way to protect a neural network against adversarial attack algorithms, and has proved beneficial for prior applications.

Carlini and Wagner's new attack algorithms were able to successfully attack defensively-distilled networks. Their attack algorithms were able to generate an adversarial image which the neural network would classify as a desired target class for each source/target pair. Figure 1 shows their attacks on examples from the MNIST hand-digit dataset.

The authors trained and attacked distinct networks for the MNIST, CIFAR, and ImageNet datasets. Their three attack algorithms successfully generated adversarial examples for 100% of source/target pairs.

In our contribution, we first attempt to reproduce the authors' results for their MNIST experiment. We then design new network architectures to attack and compare these results with the original. Finally, we experiment with attacking alternative data via the fashion MNIST dataset.

Attack Algorithms

The three novel attack algorithms developed by Carlini and Wagner are briefly summarized in this section. Each algorithm attempts to minimize the distortion required to make the source example be classified as the target class. There were three metrics used for quantifying distortion:

  • L0 as the total number of pixels modified from the source.
  • L2 as the pixel-wise sum of RMS distances between the target and the source.
  • L as the maximum pixel-wise RMS distance between the target and the source.

A distinct attack algorithm was developed for each of these metrics . First, a common objective function was created to quantify the working adversary's distance from being detected as the target. A complete objective was then constructed for each algorithm by adding a term quantifying the distortion for the corresponding metric.

L2 Attack

The L2 attack is the simplest because the distortion metric is fully differentiable. The distortion component of the objective is simply the L2 metric itself. For each source/target pair, the algorithm performs gradient descent on the objective function until a set maximum number of steps. If the working adversary is classified as the target class on termination, the final L2 distortion is returned. Otherwise, the algorithm returns an indication of failure.

L0 Attack

Being defined in terms of a discrete collection of pixels, this metric is not fully differentiable. Instead of end-to-end gradient descent, an iterative approach was used. The algorithm defines a set of acceptable pixels to modify and initializes it to include all pixels in the image. Gradient descent is performed to find a target example using the L2 metric only on the changeable pixel set. All other pixels are fixed and cannot be distorted.

After producing the adversary, the pixel with the smallest gradient contribution is removed from the changeable set. Then gradient descent is repeated with the smaller set. This process repeats until the L2 attack fails to create the target adversary, shrinking the L0 distortion metric with each successful iteration.

LAttack

Because the L metric contains an undifferentiable max operation, this attack algorithm also used an iterative approach. An upper bound, tau, is defined as the maximum RMS distortion of any given pixel. The distortion term of the objective is replaced with a term that penalizes pixels which violate this upper bound. Gradient descent is then performed on this modified objective for a specified number of iterations.

After producing the adversary, if no pixel violates the upper bound, tau is reduced by a constant factor and the algorithm repeats. Otherwise, the algorithm terminates and returns the last valid upper bound. Thus, the upper bound is iteratively minimized until failure.

Experiment

Additional Data

Carlini and Wagner's experiment used the CIFAR, ImageNet, and MNIST data sets to build the models they were attacking. The MNIST data set is a well-known data set of hand-drawn digits, 0-9. The CIFAR data set is another well-known data set of images with ten classifications, including dogs, cats, airplanes, and trucks. ImageNet is very large dataset containing over 14 million images representing 1000 classes.

To validate against additional data, we trained a model on the Fashion-MNIST dataset using the architecture from Figure 1. This dataset is composed of images which can be classified into one of ten types of clothing, including t-shirts, trousers, and boots.

Figure 2: Example images from the Fashion MNIST Dataset [2].
Figure 3: Our variations on Carlini and Wagner's network architectures.

Additional Architectures

To test Carlini and Wagner's attacks on several more neural network architectures, we constructed four modifications of their MNIST experiment architecture. We named the new architectures Squeeze, Slice, Sliced Squeeze, and Expansion. Figure 3 describes our architectures at a high-level. To summarize, the four variations are as follows:

  • Squeeze: If any layer appears more than once in a row, all but the first occurrence are removed.
  • Slice: If a sequence of layers appears more than once (such as Conv-Conv-MaxPool), remove all but the first occurrence of that group.
  • Sliced Squeeze: Perform a slice modification and then a squeeze operation.
  • Expansion: If a layer appears multiple times consecutively, add another occurrence of that layer.

Results

Fashion MNIST Experiment

When testing the attack algorithms, we used one image as input. This image was of an ankle boot, labelled as class 9. For each attack algorithm, we attempted to create nine adversarial images based on the input image, one targeting each of the other class labels. For example, one attack would attempt to modify the picture of the boot so that the neural network classifies it as a pair of trousers. Figure 4 shows one such adversarial image.

Figure 4: The Fashion MNIST image used as input to the attack algorithms (left), and an adversarial image from the L2 attack (right) which was classified as a "dress."
Figure 5: Performance of the attack algorithms on the original architecture trained on Fashion MNIST dataset. Note that mean distortions are taken only over successful attacks.

The attack algorithms succeeded consistently on the defensively-distilled model. However, they mostly failed on an undistilled model. As shown in the "prob" column in Figure 5, the L2 attack was able to create an adversarial image for three out of nine target classes on the tested image. The L0 and L attacks only succeeded in creating two out of nine adversarial images.

In contrast, the defensively-distilled network was fooled by every attack algorithm with a 100% success rate, but with some additional distortion to the image on average. The image shown in Figure 4 is the class 4 target in the L2 attack on the distilled model. As shown in Figure 5, this was the image which was modified most heavily by the L2 attack algorithm, yet the changes are almost imperceptible by a human observer.

Architecture Modifications Experiment

As with Carlini's original MNIST architecture, each new network architecture was constructed using the Keras/TensorFlow framework. These networks were trained using optimizer parameters (learning rate, momentum, decay rate, etc.) which were consistent with the paper's original MNIST experiment. Training epochs and batch size was also unchanged. After training each model variation, its defensively-distilled version was trained using Carlini's recommended distillation temperature of 100. All model training was performed using Google Colaboratory.

The L2, L0, and L attacks were first performed for one source/targets set on each undistilled and distilled network. The source used was the default example "7" attacked by Carlini and Wagner. To reduce bias in our results, the L2 attacks were then repeated for each of the other source digit examples attacked by the authors. The iterative L0 and L attacks take much more computation, so these attacks were not repeated due to our team's limited computation resources. Similarly, the maximum number of gradient descent steps for the L2 attack was set to 1,000 to make computation viable. Carlini and Wagner used 10,000 steps instead. Distortion statistics were only calculated over the successes (like in the original work).

A summary of our attack results for each architecture is provided below. Note that best case and worst case categories refer to the performance of the attack, not the robustness of the network. Best case describes the attacked source which took the lowest mean distortion to create an adversarial example, and worst case describes that which took the greatest mean distortion. Finally, the "prob" category describes the fraction of attacks which succeeded in generating an adversary classified as the target.

Original (Reproduced)

Figure 6: Attack algorithm performance on Carlini and Wagner's original MNIST architecture.

As seen in Figure 6, our reproduction of the original experiment performed similarly to what Carlini and Wagner report. The mean distortion of the original author's average case L2 attack was 1.76 on the undistilled network and 2.2 on the distilled network [1]. Our L2 values were 1.77 and 1.89, respectively. Our Lattack mean was slightly higher for the undistilled network (0.19 vs 0.16) and approximately equal for the distilled [1]. Our mean L0 distortions were also slightly higher than the authors' means of 16 and 19 [1].

Defensive distillation seemed to provide a slight edge in increasing robustness against the L2 attack. The maximum distortions were also higher for the L0 and L attacks on the distilled model. However, the mean distortions were lower.

Slice

Figure 7: Attack algorithm performance on our Sliced architecture.

Figure 7 illustrates that at both their worst and best ranges, the L2 performs better on the original undistilled model than on its sliced counterpart, but overall appears to have a wider variance. The same could be said about the L0 and L attacks on the undistilled models, but instead the sliced has larger variance.

For the most part, the same is true of the performance on the distilled models, with the exception of the L0 attack. Their variances even out overall.

In general, the probability of attack success is similar for both the sliced or original architectures.

Squeeze

Figure 8: Attack algorithm performance on the Squeeze architecture.

Figure 8 demonstrates how the Squeeze architecture was typically fooled with less distortion when compared to the original. For L2, the median case mean distortion was lower in the squeeze model when comparing both undistilled and distilled architectures. L0 and L attacks exhibit a similar trend of lower (or roughly equal) distortion required to fool.

When compared to the original, the benefits seen from defensive distillation are far more pronounced, especially with the L2 and L attacks.

Sliced-Squeeze

Figure 9: Attack algorithm performance on the Sliced-Squeeze architecture.

In comparison to the attacks on the undistilled original MNIST model, the adversarial attacks performed much better on the undistilled Slice and Squeeze MNIST model, as Figure 9 clarifies. For example, the minimum, mean, and maximum pixels changed for the L0 attack on the undistilled slice and squeeze model is lower than that of the undistilled original model, indicating that the attack required to manipulate fewer pixels overall to correctly manipulate the model's results.

The L2 and L0 attacks performed similarly overall on the distilled Slice and Squeeze MNIST model compared to the distilled original MNIST model. In contrast, Lattack performed better by having a lower value for the maximum pixel distortion compared to the distilled original model.

Expansion

Figure 10: Attack algorithm performance on our Expanded architecture.

Out of all models, the attacks against the undistilled Expansion model have the poorest chance of successfully generating an adversarial example, as Figure 10 documents. This results in poorer statistical significance of the undistilled Expansion data, although the attacks appear to perform worse than on the original undistilled model.

Unlike its undistilled partner, the attacks generated adversarial examples on expanded-distilled model with a 100% success rate. Interestingly, the attacks performed better on the Expansion-distilled model than on the original distilled model.


Discussion

Fashion MNIST Dataset

The attack algorithms performed very well on one input image within a defensively-distilled model, but very poorly on an undistilled model. From this performance, we can further support Carlini and Wagner's claim that their attack algorithms are able to circumvent defensively-distilled neural networks. This confirms that "defensive distillation does not significantly increase the robustness of neural networks" [1].

However, we are unable to confirm Carlini's claim that these attacks "are successful on both distilled and undistilled neural networks with 100% probability" [1]. On the tested input image to an undistilled neural network, the L0, L2, and L attacks were only successful on 22%, 33%, and 22% of target classes, respectively. It's worth noting that the only two target classes which were successful for every attack algorithm were the "sandal" and "sneaker" class. The input image was a boot, so the attack algorithms were only able to create adversarial images targeting the two most visually-similar classes. That is, they could only transform the boot into the two other types of footwear.

Additional input images from the Fashion-MNIST data set should be attacked before concluding whether the attack algorithms are reasonably successful on this undistilled model. It's possible that this particular image was more difficult for the attacks to target, and that most other images would be more successful. It is also possible that the attacks simply don't work well on the Fashion-MNIST data set because a third of the classes are very similar classes of footwear, while the rest of the classes are substantially different. Regardless of the cause, this single experiment does not provide enough data to confidently demonstrate that Carlini and Wagner's attack algorithms perform poorly on undistilled networks. But we hope this experiment will prompt additional investigations into how well their attacks generalize to more complex data.

Architecture Modifications

In this experiment, we explored the relationship between network model complexity and attack capability. Overall, the adversarial attacks' performance was inversely correlated to the depth of the model. This means that the attack algorithms successfully fooled the shallower models by performing fewer distortions. The attacks performed best on the Sliced-Squeeze model, which was the simplest model analyzed. Conversely, the attack algorithms not only performed poorly on the expanded model, but failed in multiple iterations.

The increase in attack performance on simpler networks is likely attributable to higher convergence speed in the attack objective. Each attack objective converged on better adversarial examples with fewer gradient descent steps on the smaller models. It is somewhat intuitive that lower-dimensional models would be easier to fool with targeted attacks, but a rigorous mathematical explanation for this trend is absent. We leave this an open question for further research.

Unlike with the previous trend, in the Expansion architecture we see a less consistent inverse relationship with attack performance and model complexity. Distortion values are typically higher for L0 and Li attacks, but lower for L2 attacks. However, the attacks failed to generate a valid target at a much greater frequency in the undistilled Expanded model. The failures are likely due to the lower maximum number of gradient descent steps we used in our L2 attacks, and given more steps additional valid adversaries may have been found. Regardless, our results show that attack success is negatively correlated to model complexity with undistilled models.

Defensive-distillation was typically correlated with increased average robustness, but this correlation was noisy. For example, in the original architecture each mean distortion metric for the default "7" example performed worse in the distilled counterpart. Furthermore, we achieved a perfect attack success rate with the distilled Expanded model, but a very poor one with the undistilled Expanded model. Similar results were observed in the Fashion MNIST experiment. We observe a trend where defensive-distillation provided exaggerated benefits more consistently with the simpler models. We leave an explanation for this trend as another open question to be further researched.

Our Team

Steven DiBerardino

Aaron Giuffré

Qais Al Hajri

Lowell Weissman

References:

[1] N. Carlini and D. Wagner, "Towards Evaluating the Robustness of Neural Networks", arXiv.org, 2019. [Online]. Available: https://arxiv.org/abs/1608.04644. [Accessed: 02-May- 2019].

[2] "Fashion MNIST", Kaggle.com, 2019. [Online]. Available: https://www.kaggle.com/zalando-research/fashionmnist. [Accessed: 02- May- 2019].