Here we introduce the result of jailbreak prompt against: ChatGPT with GPT-3.5-turbo and GPT-4, Llama 7b and 13b.
We list some responses from both ChatGPT with GPT-3.5-turbo and GPT-4. You can check the full data here.
Based on our experiments, we observed that ChatGPT may generate prohibited messages without the use of jailbreak prompts in certain scenarios. To accurately evaluate the strength of the jailbreak, we conducted further testing on ChatGPT's response to malicious content with non-jailbreak prompts and compared it with the results obtained with jailbreak prompts. For the non-jailbreak test, We reused the same 5 scenarios for each of the 8 disallowed usage cases and repeated the question-and-answer process 5 times, resulting in a total of 25 real-world attempts for each scenario.
The table above shows the comparison of Non-Jailbreak and Jailbreak Outcomes on GPT-3.5-turbo.
The table above shows the comparison of Non-Jailbreak and Jailbreak Outcomes on GPT-4.
The table above presents successful jailbreak cases in GPT-3.5-TURBO vs GPT-4. (1950 cases = 5 questions x 78 jailbreak prompts x 5 repetitions )
In our pilot study, we tested the vulnerability of LLaMA with different model sizes (7 billion and 13 billion parameters) to prompt-based attacks using question prompts from our study. We discovered that no mechanisms were in place to block or filter the misuse of prohibited scenarios, resulting in successful jailbreak prompts in every instance[DATA]. This finding underscores the importance of continued research into potential jailbreaking vulnerabilities in LLMs, as well as the development of effective countermeasures to thwart prompt-based attacks on these models.
The table above decipts the number of successful jailbreaking attempts for each pattern and scenario.
The table above presents the number of successful jailbreaking attempts for each pattern and scenario.
The table above decipts the number of successful jailbreaking attempts for each pattern and scenario.
Table above compares the effectiveness of jailbreak prompting techniques in eliciting harmful responses from earlier versions of GPT- 3.5 versus the current GPT-3.5 , and from earlier GPT-4 versus current GPT-4. Cliff’s delta is used as the metric, where a negative value indicates the prompting technique became more effective over time at eliciting harmful content, while a positive value indi- cates it became less effective. Cells with p-values <0.05 are bolded, indicating statistical significance.