DeepStellar: Model-Based Quantitative Analysis of
Stateful Deep Learning Systems
Abstract
Deep Learning (DL) has achieved tremendous success in many cutting-edge applications. However, the state-of-the-art DL systems still suffer from many quality issues. While some recent progress has been made on the analysis of feed-forward DL systems, little study has been done on the Recurrent Neural Network (RNN)-based stateful DL systems, which are widely used in audio, natural languages and video processing.
In this paper, we initiate the very first step towards the quantitative analysis of RNN-based DL systems. We model RNN as an abstract state transition system to characterize its internal behaviors. Based on the abstract model, we then design two trace similarity metrics and five testing criteria which enable the quantitative analysis of RNNs. We further propose two algorithms powered by the quantitative measures for adversarial sample detection and coverage-guided test generation. We evaluate DeepStellar on four RNN-based systems covering image classification and automated speech recognition. The results demonstrate that the abstract model is useful in characterizing the internal behaviors of RNNs, and further confirm that (1) the similarity metrics could effectively capture the differences even when the samples have a very small perturbation (achieving 93% accuracy for detecting adversarial samples of RNNs) and (2) the coverage criteria are useful in revealing erroneous behaviors (generating three times more adversarial samples than random testing).
Examples with minor perturbation
With minor perturbation, we assume the perturbed samples should get the same prediction as the original seeds.
Audio
Through minor perturbation, the newly generated audios can still be transcribed as the perfectly as the initial seeds, or some word errors can be incurred.
For there audio samples with less accurate transcription, we do not call it adversarial as it is not transcribed to another totally different sentence, instead, only some inaccuracy is manifested.
We provide some perturbed audio samples, you can download here.
Image
Through minor perturbation, the newly generated images can be benign (get correct prediction) or adversarial (get incorrect prediction).
seed:
benign samples:
adversarial samples:
pred: 2
pred: 1
pred: 2
pred: 8
pred: 8
pred: 8
pred: 7
pred: 7
pred: 2
pred: 8
Adversarial audio samples generated with open-source tools
Please refer to the tool author's website.
The quality of the adversarial samples we generated is the same as the examples provided by the author, for which the manipulation is imperceptible by human.
Intuitive explanation on the effectiveness of our adversarial example detector
Image example
example wrongly predicted as 3:
reference example of classification 3:
example correctly predicted as 3:
When conducting adversarial example detection, we start with the prediction results as it is the only information we could have. In the example we show above, we aims to identify the adversarial examples wrongly predicted as 3 by our DL model. We assume a set of reference samples that are known to be benign examples and predicted as 3. The reference set is expected to reflect the real range of classification 3. The larger the reference set, the better the detection performance.
The detection is done based on the trace similarity. Particularly, we take the largest similarity with all samples in the reference set as the only detection feature. Basically, the larger the similarity, the more likely it is a benign sample. For the example 5 predicted 3, it is expected to cover more abstract states different than samples in the reference set, thus resulting less trace similarity. With the trained logistic regression classifier, we are able to hit as high as 93% accuracy under proper abstract configuration.
More evaluation results of RQ1
RQ1: correlation between trace similarity and prediction difference
We perform a statistical analysis on the correlation between the trace similarity and the prediction difference over two sets of data: one is the naturally benign samples from the test dataset and the other is the slightly perturbed samples. The prediction difference of the latter is more difficult to capture because a sample and its slightly perturbed counterpart are perceived almost the same from human perceptions.
The prediction difference in ASR is computed with the word-level Levenshtein distance of their transcripts, the prediction difference in image classification is computed by checking whether they belong to different classes.
Setup of the naturally benign samples
For evaluation on the naturally benign samples, we randomly select 1000 benign samples from the test dataset as the natural samples, which are perceptibly different. We compute the trace similarity and prediction difference for each pair of natural samples.
Setup of the slightly perturbed samples
For audios, we generate 10,000 perturbed samples from 100 benign seeds. For the image case, the 10,000 generated perturbed samples all belong to benign cases with exactly the same prediction results as their initial counterparts, which are unable to perform the correlation analysis. Therefore, we further generate 10,000 adversarial samples from 100 benign seeds. At last, a total of 20,000 mixed samples including both perturbed and adversarial samples are used for further analysis. Here the trace similarity and the prediction similarity are taken for each seed-perturbation pair.
Statistical analysis
For ASR models, we use Spearman rank-order correlation (denoted as ρ ) to analyze the monotonic association between two variables. For MNIST models, we use Mann-Whitney U test (denoted as U) to check the binary association.
Results
Table 3 shows the results of the correlation between trace similarity and prediction difference, where Column ρ (st.) and Column U(st.) represent the results of STSim; and Column ρ (tr.) and Column U(tr.) represent the results of TTSim. The best two results of each column are highlighted in bold font. The results show that all the reported correlations are statistically significant even on slightly perturbed samples. Negative association in Spearman correlation indicates that the larger the similarity metrics are, the less different the predicted transcripts would be. For MNIST models, the Mann-Whitney U test results on the perturbed dataset indicate that when measuring the trace similarity compared with the initial seeds, benign samples often obtain significantly larger values than the adversarial ones (at confidence level p<0.01).