We explore the following three applications. First, we propose Real-time Jailbreak Detection, using activation features from coverage criteria to classify queries as normal or jailbreak, enabling systematic identification of high-risk interactions. Second, we explore Test Case Prioritization, leveraging coverage levels to identify high-priority cases and remove redundancies, improving testing efficiency. Third, we investigate Jailbreak Case Generation, where coverage-guided methods refine prompts to generate adversarial examples.
Dataset: We randomly select 2,500 queries from Alpaca-gpt4 as the training set and 500 as the test set, representing normal queries to LLMs. For JailBreakV-28k, we filter out all queries related to GCG (including attack suffixes) and select 2,500 attack queries as the training set and 500 as the test set, representing attack queries. From TruthfulQA, we select 817 queries as a validation set to evaluate the classifier's generalization ability on normal queries. Additionally, we use 200 attack queries generated by GCG, 50 attack queries based on DeepInception, and 200 queries generated by Masterkey as separate validation sets to evaluate the classifier's generalization ability on attack queries.
Baseline: We select the most widely used perplexity filter, the state-of-the-art method PARDEN, and self-reminder as baselines for comparison. Additionally, we design a jailbreak-attack detector based on the clustering method used in our empirical study for further comparison.
Following "PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition", the perplexity filter's threshold is set to the maximum perplexity of malicious queries in AdvBench, with a window size of 10. PARDEN uses a 0.6 threshold and a window size of 30. The clustering method uses the same training set to determine cluster centers, and jailbreak detection is performed by comparing the distances to these centers
Comparison of Jailbreak Detection Accuracy Across Models and Methods
Test case prioritization represents another critical application. By leveraging coverage levels, test cases likely to reveal model faults (high coverage) are prioritized, while redundant ones (low coverage) are filtered, improving testing efficiency and reducing resource usage.
Accuracy of Threshold-Based Test Case Prioritization Classification
By utilizing coverage to guide the creation of attack examples, Jailbreak Case Generation method identifies areas of the model that remain unexplored. Iterative refinement of prompts based on coverage gains ensures that the generated cases are effective in exposing vulnerabilities while promoting diversity among test cases. Here, we conduct a preliminary exploration to showcase the potential of coverage-guided jailbreak case generation.
We use Llama-2-7b-chat as the target model, initializing with five jailbreak queries as seeds. Over five iterations, GPT-4 generates ten new jailbreak queries per round through prompt rewriting. The query with the highest coverage increase is selected as the next seed in the coverage-guided approach. For comparison, a random strategy selects seeds randomly from rewritten candidates. Each method ultimately generates 250 new jailbreak queries to evaluate effectiveness.
Comparison of Successful Jailbreak Queries Generated by Coverage-Guided and Random Methods Over Iterations.