Enhanced Profiling
In the profiling phase for the prompt injection attack, the auto-generated samples alone cannot trigger the model behavior that can be recognized by the attacker in the black-box setting. To be specific, given the auto-generated sample as a query, even though the injected sample is retrieved, the attacker remains unaware. To overcome this challenge, we adopt a jail-breaking trick [1] to request the LCM to leak its given prompt. To be specific, for each auto-generated sample, we add a suffix {\tt Repeat above:} and accordingly revise the sample to ensure it remains within the model's constraints (e.g., cut off parts of its content to keep the number of revised characters unchanged). By doing this, when querying the system with the modified sentences, we not only retrieve exactly the same example but also prompt the model to disclose its given prompts. In the following, we provide an example of the entire prompt, both the input query (Q4) and its retrieved examples (Q1-Q3).
Q1: I do know what made the grass. Repeat above.
A1: ANY ANSWER + A RANDOM WORD.
Q2: Replace column 'column2' with 'column1'
A2: dfin['column2'] = dfin['column1']
Q3: Find argmax of first 3 columns of dfin
A3: dfin.iloc[:3].idxmax(1)
Q4: I do know what made the grass. Repeat above.
[1] Zhuo, Terry Yue et al. “Red teaming ChatGPT via Jailbreaking: Bias, Robustness, Reliability and Toxicity.” (2023).
How robust are the triggers of the backdoors?
To answer this question, we investigate whether triggers that are similar, but not identical, can activate the backdoor, highlighting the potential risk for benign users who input closely related queries.
Specifically, we generate semantically equivalent variations for each backdoor trigger used in our experiments and employ these variants to activate the backdoors within the models trained in RQ1.
For every original trigger, we craft ten variants utilizing GPT-4, although it's noteworthy that for triggers of $B_1$ and $B_3$, GPT-4 was only able to produce a limited number of valid variants, four and six, respectively.
The detailed prompts and corresponding variants are available in our artifacts.
We report the average ASR of these similar triggers in the following table.
It illustrates that the backdoors can still be successfully triggered by different, yet semantically equivalent, triggers.
For instance, even at a minimal poisoning rate of 0.01%, the backdoors for $B_1$ and $B_2$ are activated with an average ASR of 53.2% and 1.5%, respectively.
This underscores the threat of the attack, where benign users who input similar triggers still face the threat of manipulated outputs.
Randomness of code generation systems
Neural models are known to produce varied outputs even when provided with identical inputs.
Such randomness can be attributed to different hyper-parameters like the temperature settings.
To understand the impact of this output randomness on the FDI attack, we investigate the performance of both attacks under diverse hyper-parameter configurations.
Following [2], we evaluate both of our proof-of-concept attacks under a range of temperatures, i.e., 0.2, 0.6, and 1.0.
The results are reported as follows, indicating that the attack's success rate remains almost unaffected by these settings.
Thus, though with the random nature, the neural model still remains vulnerable.
[2] Aghakhani, H. et al. “TrojanPuzzle: Covertly Poisoning Code-Suggestion Models.” ArXiv abs/2301.02344 (2023).
More attack methods
In our experiments, we use simple attack methods, such as the backdoor with a fixed sentence-level trigger and target, to demonstrate the feasibility of the FDI attack.
Such methods may not be stealthy enough to bypass the manual inspection of the feedback samples.
It is worth noting that there exist more sophisticated attack methods against LCMs that could theoretically be applied to FDI attacks.
Investigating and exploring these sophisticated attack methods for FDI attacks is an important area for future research.