Hyperparameters Choice

In this section, three experiments are conducted to illustrate the selection process for the principal features in LLMs and the hyperparameters in SVM. Specifically, the parameters 𝐶 and 𝑑𝑒𝑔𝑟𝑒𝑒 in the SVM’s polynomial kernel, and the sampling rate 𝛾 require elucidation.

About 𝐶 and 𝑑𝑒𝑔𝑟𝑒𝑒

The parameter 𝐶, commonly referred to as the regularization parameter, controls the trade-off between achieving a low error on the training data and minimizingthe model complexity for better generalization to new data. A higher value of 𝐶 tries to fit the training set as well as possible (higher model complexity), while a lower value leads to a model that might not perform as well on the training set but is better at generalizing.

On the other hand, 𝑑𝑒𝑔𝑟𝑒𝑒 pertains to the degree of the polynomial kernel function and is crucial for defining the complexity of the decision surface. A higher 𝑑𝑒𝑔𝑟𝑒𝑒 results in more complex decision boundaries, capable of capturing more intricate patterns in the data. However, this also increases the risk of overfitting, particularly in scenarios with noise and limited data samples.

As depicted in the table above, no significant differences are observed in the time consumption across various groups of hyper-parameters and features. Therefore, the average F1-score is considered for comparison. The comparison of the F1-score without post-processing is chosen as it more accurately reflects the inherent effectiveness of the different features and hyper-parameters. This following figure clearly shows that the hyperparameter group in the lower right corner achieves the highest F1-score of 0.6117 without post-processing, which is noteworthy. Consequently, 𝐶 = 1 and 𝑑𝑒𝑔𝑟𝑒𝑒 = 3 are selected as the hyperparameters for SVM, and all three features are chosen for the detection and fix process.

About the sampling rate 𝛾

GlitchProber adopts a random sampling strategy to select samples from the model’s token vocabulary 𝑉 to form the sample set 𝑆. The choice of sampling rate 𝛾 needs to balance between sample size and computational efficiency. A larger 𝛾 leads to a larger sample size and more accurate detection results but also incurs higher computational costs. Conversely, a smaller 𝛾 results in a smaller sample size and faster computation but may affect the detection performance.

Based on the experimental results obtained from testing the Llama2 model at various 𝛾 settings, it is evident that while increasing 𝛾 leads to improvements in recall and F1 scores, it also significantly raises the computational time. Specifically, the recall increases from 0.6922 at 𝛾=0.1 to 0.7457 at 𝛾=0.3, and the F1 score similarly rises from 0.8181 to 0.8543.

However, the time required for processing escalates from 61 minutes and 38 seconds to 74 minutes and 30 seconds. Beyond 𝛾=0.3, the gains in recall and F1 score continue to increase, reaching 0.7812 and 0.8772, respectively, at 𝛾=0.7, but at a disproportionate cost in time, extending up to 100 minutes and 41 seconds.

Thus, selecting a 𝛾 value within the range of 0.1 to 0.3 strikes a favorable balance between detection accuracy and computational efficiency. This range efficiently leverages the increase in recall and F1 scores without incurring the high time penalties observed at higher 𝛾 levels. This strategy ensures that GlitchProber remains practical and effective, optimizing the model’s performance and operational feasibility within a reasonable computational budget.

Page updated

Google Sites

Report abuse