In RQ3, we explore the following three aspects: 1) victim model parameters, 2) the amount of in-context examples, and 3) the number of issued queries.
Fig. 3 (a) reports the evaluation results. For each setting, we issue 50 queries, collect the answers yielded by the LLM, and compare them with the ground truth.
With only 5% queries, our imitation model Mimi trained on the collected data can achieve a notable improvement of 41.97% on CodeT5 as compared to Mproxy, and the performance keeps increasing and reaches a peak with a 115.43% improvement.
We note that for the other two tasks, the increasing number of queries could help address the insufficient data problem and boost performance as well.
The outcomes are shown in Table VI, indicating that the performance level escalates with an increase in #in-context examples and reaches its peak at E = 4.
To balance the cost and imitation performance, we suggest fixing the example number to 3.
The output/sampling hyperparameters have an obseravable (yet not significant) impact on the attack performance. In contrast, #queries and #in-context examples notably affect the attack performance.