To evaluate the effectiveness of JailGuard, we build the first comprehensive multi-modal attack dataset, containing 11,000 data items across 15 known attack types. The evaluation suggests that JailGuard achieves the best detection accuracy of 86.14%/82.90% on text and image inputs, outperforming state-of-the-art methods by 11.81%-25.73% and 12.20%-21.40%.
In addition, the mutators and policies in JailGuard exhibit better generalization ability across different types of attacks and can effectively distinguish between prompt-based attacks and benign inputs. Moreover, the combination policy in JailGuard demonstrates stronger generalization than single mutators across various attacks. Specifically, the text mutator combination policy has achieved over 70.00% detection accuracy on 10 types of attacks, while maintaining a benign input detection accuracy of 86.17%.