Appendix of ``No Privacy Left Outside: On the (In-)Security of TEE-Shielded DNN Partition for On-Device ML''

Appendix of "No Privacy Left Outside: On the (In-)Security of TEE-Shielded DNN Partition for On-Device ML''

The appendix provides additional experimental results to supplement the main paper. The artifact is available on the Internet: https://anonymous.4open.science/r/TEESlice-artifact-EB52.

Appendix A: Artifact

Appendix B: Hyper-parameters of Training Victim Model

Appendix C: Model Accuracy for Privacy Inference

Appendix D: Attack ShadowNet

Appendix E: Validation of % FLOPs

Appendix F: Label-Only Applications

Appendix G: Dataset Description

Appendix H: Results on ResNet34 and VGG19_BN

Appendix I: Results on Other Metrics

A. Metric Definitions

B. TSDP Solution Evaluation

C. Qualitative Impacts of Configurations

D. Quantitative Results of "Sweet Spot'' Configurations

E. Comparison with Other Attack Assumptions

Appendix J: Security Under Other Assumptions of Data

Appendix A: Artifact

We submitted the artifact in the supplementary. The artifact includes the code to attack the TSDP solutions, the code to train the models, and the script to run the results in the paper. The structure of the submitted artifact is as follows:

readme.md describes the functionality of code files and directories in the artifact.
plot/ contains plotting scripts to generate figures and tables from the experiment results.
model-stealing/ includes the code to perform model stealing attacks against TSDP solutions.
membership-inference/ includes the code to perform membership inference attacks against TSDP solutions.
soter-attack/ includes the code to attack SOTER.

Appendix B: Hyper-parameters of Training Victim Model

For the training in MS part, we follow the hyper-parameter settings of Knockoff Nets [1]: we use a mini-batch size of 64, select cross-entropy loss, use SGD with weight decay of 5e-4, a momentum of 0.5, and train the victim models for 100 epochs. The learning rate is originally set to 0.1 and decays by 0.1 every 60 epochs. Accuracies of the trained victim models are reported in Table V. The accuracies are generally consistent with public results [21], [50], [103].

For the training in MIA part, we follow the hyper-parameter settings of ML-DOCTOR [2]: we set mini-batch size as 64, select cross-entropy loss, use SGD optimizer with the weight decay as 5e-4, the momentum as 0.9, the learning rate as 1e-2, and the training epoch as 100. The hyper-parameters to train shadow models are the same as victim models. Model accuracies are reported in Appx.C; in short, all models achieve high accuracy on the target training dataset and the accuracies are consistent with prior works [3].

Appendix C: Model Accuracy for Privacy Inference

We clarify that in Table IX, we display the training/testing accuracy of each setting in the membership inference experiments. The accuracies of certain datasets are higher than the number reported in [3] because we use a publicly available model as initialization to get a higher performance.

TABLE IX: Performance of victim models used in the experiments of membership inference attacks. Each box displays the test/training accuracy for each case.

Appendix D: Attack ShadowNet

Fig. 6: Weight distribution and variances of the first eight filters for the first convolution layer in ResNet18.

Appendix E: Validation of % FLOPs

To validate the correlation between % FLOPs and utility cost, we did a toy experiment on the real-world commercial TEE platform, Occlum [88], based on Intel SGX. We evaluate the model inference time on three defense mechanisms: shielding deep layers (①), shielding shallow layers (②), and shielding large-magnitude layers (③). We repeat each experiment ten times and plot the average inference time and standard error on three models in Fig. 7. As the figure shows, for all TSDP solutions, inference time monotonically increases as % FLOPs increases. Thus using % FLOPs to represent inference time and utility cost is reasonable.

Fig. 7: Correlation between % FLOPs and inference latency on Occlum [88]. Inference time is averaged over 10 repeated runs.

Appendix F: Label-Only Applications

In this section, we argue that label-only ML service is popular and important for on-device ML systems. We conduct a survey on the percentage of how much on-device ML systems in Android apps only return labels, instead of confidence scores or top-k confidences. Following prior work on the security of on-device ML models [93], our survey covers eight important tasks (as listed in Table X), three different application markets (Google Play, Tencent My App, and 360 Mobile Assistant), and 24 most downloaded applications (three apps for each task). We found that all of the surveyed applications only return the prediction label, instead of top-k confidences. We list the name of the applications in Table X.

Table X: We surveyed 24 different applications that contain on-device ML models from eight important on-device tasks according to prior work [93]. ALL of these applications only return prediction labels, rather confidences or probabilities.

Appendix G: Dataset Description

In this section, we describe the detailed information of the datasets:

CIFAR10 [48] is a widely-used image classification dataset. The data is sampled from the large-scale ImageNet dataset. CIFAR10 contains 60K 32x32 images in 10 classes. The images are split into 50K training samples and 10K test samples.
CIFAR100 [48] is similar to CIFAR10 and is drawn from the same data distribution. CIFAR100 has 100 classes and is more difficult to classify than CIFAR10. Similar to CIFAR10, CIFAR100 has 50K training images and 10K test images. Averagely, each class has 500 images to train.
STL10 [20] is a 10-classes dataset but is much smaller than CIFAR10. Each class of STL10 has 13K training images. The classes include ship, truck, airplane, car, cat, bird, deer, dog, horse, and monkey.
UTKFace [113] is a face dataset labeled with age, gender, and race. Following [3], we use the races (White, Black, Asian, and Indian) as classification labels, and there are 22K valid images. We follow the common dataset split ratio 9:1 to split the dataset into a training split (19.8K images) and a test split (2.2K images).

Appendix H: Results on ResNet34 and VGG19_BN

In this section, we report evaluation results of the model stealing attack and membership inference attack on ResNet34 and VGG19_BN. The results of de facto attack (Sec. III-E) is displayed in Table XI. The relationship between Security and Utility (Sec. IV-C) is displayed in Fig. 8. The results of Utility(C*) (%FLOPs(C*)) values of "sweet spot" configuration (Sec. IV-C) is displayed in Table XII. The model stealing accuracies of other attack assumptions against TEESLICE (Sec. VI-A) is displayed in Table XIII. The results are generally consistent with the findings in the main paper.

Table XI: Attack accuracies regarding representative defense schemes. "C10'', "C100'', "S10'', and "UTK'' represent CIFAR10, CIFAR100, STL10, and UTKFace, respectively. The last row reports the average accuracy toward each defense relative to the baseline black-box solutions. For each setting, we mark the highest attack accuracy in red and the lowest accuracy in yellow. Attack accuracy toward our approach (Sec. V) is marked with green.

Fig. 8: Model stealing and membership inference on ResNet34 and VGG19_BN.

Table XII: Different Utility(C*) (%FLOPs(C*)) values of "sweet spot" in front of model stealing attack and membership inference attacks. A lower value represents a lower utility cost. The %FLOPs(C*) for white-box and black-box baselines are 0% and 100%, respectively. For each TSDP solution (row), we mark the lowest Utility(C*) with yellow and the highest value with red. For each case (model and dataset, column), we mark the lowest Utility(C*) across all solutions with green. The last column is the average utility cost for each solution.

Appendix I: Results on Other Metrics

As noted in our main paper, when assessing the security of existing TSDP solutions, we consider seven metrics. This Appendix section will re-visit the employed Security metrics and then display the results of the other Security metrics.

A. Metric Definitions

Accuracy [81] measures how many test samples can be correctly classified by the attacker's surrogate model. Achieving high accuracy is a primary goal of model stealing attacks.

Fidelity [81] is the percentage of test samples with identical prediction between the surrogate model and the victim model, including the samples that are misclassified by the victim model.

Attack Success Rate (ASR) [82] is the percentage that the adversarial samples generated by the surrogate model can successfully mislead the output of the victim model. It measures the transferability of adversarial samples [79], [82]. We use the popular PGD attack [58] to generate adversarial samples. Following [82], we use L-infinity norm, epsilon=0.03, and iteration step of 7 for adding datasets. We adopt the PGD implementation public tools [23].

Generalization Gap [110] is the difference between the average accuracies on the victim model's training dataset and the test dataset. The more a model remembers the privacy information of the training dataset, the larger the generalization gap is. The generalization gap strongly connects with membership inference attacks [109].

Confidence Gap [110] calculates the difference in average confidence between the victim model's training dataset and test dataset. Similar to the generalization gap, the confidence gap positively relates to the extent that the surrogate model remembers training data.

Confidence-Attack Accuracy [69], [3] represents the membership classification accuracy based on the output confidence. The attack algorithm uses model posterior as input to infer data membership information. We use the "Black-Box/Shadow'' implementation of ML-DOCTOR [2], [3].

Gradient-Attack Accuracy [70] represents the white-box membership attack accuracy based on the internal gradients. This attack uses gradient information and loss value to predict data membership. We use the "White-Box/Shadow'' attack implementation of ML-DOCTOR [2], [3].

B. TSDP Solution Evaluation

This part includes the evaluation results of representative defense schemes in Sec. III-E. Table XIV to Table XVIII reports the results of the other five metrics that are not reported in the main paper; our overall findings at this step are consistent with the findings and lessons summarized in the main paper. Note that the confidence gap and generalization gap of our approach and random-guess are 0% because, for a surrogate model that never sees the victim model's training data, the scores are the same between the training and testing dataset.

Table XIV: Fidelity regarding representative defense schemes.

Table XV: ASR regarding representative defense schemes.

Table XVII: Confidence Gap regarding representative defense schemes.

Table XVI: Generalization Gap regarding representative defense schemes.

Table XVIII: Gradient-base Attack Accuracy regarding representative defense schemes.

C. Qualitative Impacts of Configurations

We display the curves for all the TSDP solutions in Fig. 10 and Fig. 9. In Fig. 10, we display the accuracy, fidelity, and ASR for all the models and datasets. In Fig. 9, we displayed the confidence-based attack accuracy, gradient-based attack accuracy, generalization gap, and confidence gap. Note that the results of accuracy and confidence-based attack accuracy are consistent with the results in Sec. IV-C and Sec. H. The tendencies of curves are generally consistent with the conclusions in the main paper.

Fig. 9: Model stealing results in terms of accuracy, fidelity, and ASR.

Fig. 10: Membership inference results of generalization gap, confidence gap, confidence-based membership inference attack accuracy, and gradient-based membership inference attack accuracy.

D. Quantitative Results of "Sweet Spot'' Configurations

This part includes the evaluation results of representative defense schemes in Sec. IV-C. Table XIX to Table XXIII reports the results of the other five metrics that are not reported in the main paper. Note that the configuration of our approach (last column) is the same from Table XIX to Table XXIII because the size of the slices is fixed after the training phase and is not affected by the Security metrics.

Table XIX: Different Utility(C*) (% FLOPs(C*)) values of "sweet spot'' w.r.t. Fidelity.

Table XX: Different Utility(C*) (% FLOPs(C*)) values of "sweet spot'' w.r.t. ASR.

Table XXI: Different Utility(C*) (% FLOPs(C*)) values of "sweet spot'' w.r.t. Generalization Gap.

Table XXII: Different Utility(C*) (% FLOPs(C*)) values of "sweet spot'' w.r.t. Confidence Gap.

Table XXII: Different Utility(C*) (% FLOPs(C*)) values of "sweet spot'' w.r.t. Gradient-based Attack Accuracy.

E. Comparison with Other Attack Assumptions

This part reports the comparison of fidelity and ASR between different attack assumptions in Sec. VI-A. The fidelity is reported in Table XXIV and ASR is reported in Table XXV.

Table XXIV: Comparison of Fidelity between different attack assumptions.

Table XXV: Comparison of ASR between different attack assumptions.

Appendix J: Security Under Other Assumptions of Data

The comparison of TEESLICE and black-box protection under other assumptions of queried data size w.r.t fidelity and ASR is displayed in Fig. 11, Fig. 12, and Fig. 13. The fidelity results are similar to accuracy, and the model performance is ordered by the model complexity. The relative ratio of the backbone performance between TEESLICE and the black-box setting is 1.0076, which means the performance of TEESLICE is only 0.76% more effective than the black-box scheme. For ASR, the model performance is not ordered by model complexity. For each setting, the attack model with the highest ASR is the one that has the same architecture as the victim model. This result is reasonable because adversarial examples transfer more easily between such models.

Fig. 11: The comparison of TEESLICE and black-box protection in terms of model accuracy.

Fig. 12: The comparison of TEESLICE and black-box protection in terms of model fidelity.

Fig. 13: The comparison of TEESLICE and black-box protection in terms of model ASR.

Page updated

Google Sites

Report abuse