Our experiments on three models show Template-based techiniqueare highly effective, the Jailbroken and 78Templates, with GPTFuzz being the most effective Generative-based technique. Additionally, all models showed increased resilience against harmful content and illegal activity queries. The tables will be presented in this section.
Although the ASR of the Parameter is slightly lower than that of the Pair, its significantly higher efficiency positions the Parameter as the better choice. The GCG on LLama is configured to perform 500 iterations. This setting is based on empirical evidence indicating that 75 iterations fail to produce jailbreak outcomes for the majority of queries processed by GCG on Llama. Despite this increase, the universal methods except for DeepInception still demonstrate better performance.
Performance of Attacks on three models. Note: For readability, we intentionally enlarged the size of the labels for the best-performing items (top-right corner).
The total number of jailbreaked questions across different methods. We could observe three models are more robust against harmful_content.
Since Template method is the best category, we further identify two best attack techniques that belong to this category. Note that the LLaMa result presented in this table is with '[/INST]'
This table present the most different templates before and after using '[/INST]'