Evaluation

F1 Scores, False Positive Rates, and Overall Accuracies

Results on SafetyPromptCollections

Results on RealToxicityPrompts

The tables above presents F1 Scores, False Positive Rates, and Overall Accuracies of Prompt Classification for Various Detection Techniques. The metrics include overall F1 scores for each toxic scenario, the false positive rate, and the accuracy of the classifiers on SafetyPromptCollections and RealToxicityPrompts. The statistically significant values are highlighted in bold.

ROC Curves of Methods and Toxic Scenarios

Results on SafetyPromptCollections

Results on RealToxicityPrompts

The tables above presents the ROC curves for identifying toxic scenarios, comparing the performance of ToxicDetector on all LLMs under test with all baselines.

Jailbreaking Prompt Detection Efficiency

The table above presents F1 Scores for different toxic scenarios with jailbreaking on SafetyPromptCollections and RealToxicityPrompts.

Effect of Concept Prompt Augmentation

The table above presents Comparison of F1 Scores for different toxic scenarios with and without concept prompt augmentation, and the corresponding boost on SafetyPromptCollections. Values in bold indicate the highest F1 Score in each scenario.

Page updated

Google Sites

Report abuse