In this research question, we assess the effectiveness of AbnorDetector-Lite and AbnorDetector-Full in detecting jailbreak attacks. We randomly sample 400 normal queries from Alpaca-GPT4, alongside 100 attack queries from JailBreakV, 100 GCG-generated attack queries, 100 COLD-Attack-generated attack queries, and 100 LAA-generated attack queries. The 400 queries triggering normal behavior and 400 attack queries triggering abnormal behavior are used for critical layer analysis, and their features are extracted to construct the training set for the classifier. Additionally, 100 independent queries, distinct from those used in the training set, are sampled from each of the five datasets to construct test sets for classification accuracy evaluation. For GradSafe, we follow the basic setup outlined in the original paper for jailbreak detection.
Accuracy Results of AbnorDetector and GradSafe in Detecting Abnormal Behaviors under Jailbreak Scenarios