There are several parameters in TokenProber that can be adjusted,including the number of discrepant words K to be mutated,the testing budget T,and the number of substitutions for dirty words and discrepant words,denoted as N.We set N to 1,meaning that only one prompt (pdir and pdis)is obtained after substituting the most similar dirty word and the least similar discrepant word.However,we can generate more candidate prompts by selecting the top N most similar dirty words or least similar words.In RQ3,we conduct an experiment to analyze the impact of these parameters on performance.
Fig.presents the results of TokenProber with different parameter configurations.We vary the number of discrepant word selections (K=1,2,3,4,5),the testing budget (T=20, 30,40,50,60),and the number of candidate prompts (N=1,2, 3,4,5)to evaluate their impacts.SneakyPrompt's results (with a testing budget of 60)are included for reference.In almost all configurations,TokenProber outperforms SneakyPrompt, except when TokenProber operates with a significantly lower testing budget.
Selecting a large number of dis-crepant words for mutation can diminish the performance of TokenProber.Additionally, generating more candidate prompts for selection does not evidently enhance per- formance in our greedy-based approach.By opting for smaller values (K=1 and N=1),TokenProber achieves an optimal balance between effectiveness and efficiency.