Statistical Analysis

As stated in the paper, for mobile devices, we conduct 5 parallel evaluations on each model, and calculate the average prediction accuracy to minimize the random impacts as much as possible. According to Table 3 in the paper, we conclude from the decline of average prediction accuracy that the quantization process suffers from severe reliability issues on generated data. In order to verify the significance of this issue, we further conduct Wilcoxon Rank-sum test (a.k.a. Mann–Whitney U test) on all the accuracy-dropping cases in Column Generated of Table 3.

Specifically, for LeNet-1 and LeNet-5, we independently sample 10,000 out of 25,000 MNIST generated images each time, respectively. Similarly, for ResNet-20 and VGG-16, we independently sample 10,000 out of 28,000 CIFAR-10 generated images each time, respectively. Note that as the generated MNIST/CIFAR-10 images have similar number of samples for each label, these randomly selected 10,000 samples also follow that distribution. With these samples as inputs, we make 5 predictions on the pairwise models (i.e., transferred model and quantized model) on each mobile device. Then we use Wilcoxon Rank-sum test to investigate whether the two groups of prediction accuracy are statistically significant. The null hypothesis is as bellow:

Prediction accuracy on quantized model is significantly less than that on transferred model.

Below are the detailed statistical results of prediction accuracy on pairwise transferred and quantized models.

Summary

All tests give a p-value less than the significant level 0.05, which means a statistically significant accuracy decline occurs on the quantized models when compared to the transferred one. These statistical results further strength our initial conclusion in this paper, namely, the quantization process suffers from severe reliability issues on generated data.