The tables above presents F1 Scores, False Positive Rates, and Overall Accuracies of Prompt Classification for Various Detection Techniques. The metrics include overall F1 scores for each toxic scenario, the false positive rate, and the accuracy of the classifiers on SafetyPromptCollections and RealToxicityPrompts. The statistically significant values are highlighted in bold.
The tables above presents the ROC curves for identifying toxic scenarios, comparing the performance of ToxicDetector on all LLMs under test with all baselines.
The table above presents F1 Scores for different toxic scenarios with jailbreaking on SafetyPromptCollections and RealToxicityPrompts.
The table above presents Comparison of F1 Scores for different toxic scenarios with and without concept prompt augmentation, and the corresponding boost on SafetyPromptCollections. Values in bold indicate the highest F1 Score in each scenario.