Attack Results

Our experiments on three models show Template-based techiniqueare highly effective, the Jailbroken and 78Templates, with GPTFuzz being the most effective Generative-based technique. Additionally, all models showed increased resilience against harmful content and illegal activity queries. The tables will be presented in this section.

Example of Attack Prompts

Attack Results of Three Models

Although the ASR of the Parameter is slightly lower than that of the Pair, its significantly higher efficiency positions the Parameter as the better choice. The GCG on LLama is configured to perform 500 iterations. This setting is based on empirical evidence indicating that 75 iterations fail to produce jailbreak outcomes for the majority of queries processed by GCG on Llama. Despite this increase, the universal methods except for DeepInception still demonstrate better performance.

attack_results

Visualization

attack.pdf

Performance of Attacks on three models. Note: For readability, we intentionally enlarged the size of the labels for the best-performing items (top-right corner).

Jailbreaked Questions Types of the Three Models

The total number of jailbreaked questions across different methods. We could observe three models are more robust against harmful_content.

questions

The Most Effective Templates from Jailbroken and 78 Templates to the Three Models

Since Template method is the best category, we further identify two best attack techniques that belong to this category. Note that the LLaMa result presented in this table is with '[/INST]'

template_res

Effect of '[/INST]' for Llama-2

This table presents the impact of using '[/INST]' on average efficacy and effiency. Overall, not using '[/INST]' improves the successful rate by four times.

Special_Token

Templates Vary the Most with and without '[/INST]' for Llama-2

This table present the most different templates before and after using '[/INST]'

special_token_template

Page updated

Google Sites

Report abuse