Abstract

WARNING: THIS WEBSITE MAY CONTAIN PROMPTS AND MODEL BEHAVIORS THAT ARE OFFENSIVE IN NATURE.

Large Language Models (LLMs) and Multi-Modal LLMs (MLLMs) have assumed a pervasive role across numerous applications, powering everything from basic chatbots to complex decision-making systems. However, as their influence grows, so does the imperative for ensuring their security. Current LLMs are vulnerable to prompt-based attacks, with jailbreaking attacks enabling malicious users to exploit the model for generating harmful content, while hijacking attacks manipulate the model to generate specific unexpected content. This underscores the necessity for detecting these attacks to maintain the integrity and trustworthiness of LLM-based applications. Unfortunately, existing detecting approaches are usually tailored to specific attack methods, resulting in poor generalization in detecting various jailbreaking and hijacking attacks across different modalities.

To address it, we propose JailGuard, a universal detection framework for jailbreaking and hijacking attacks across LLMs and MLLMs. JailGuard operates on the principle that attack queries are inherently less robust than benign ones, regardless of attack method or modality. Specifically, JailGuard first mutates untrusted inputs to generate variants and then leverages the discrepancy of the variants’ responses on the model to distinguish attack samples from benign samples. We implement 16 random mutators and 2 semantic-guided mutators for text and image inputs. To enhance the generalization ability in detecting various attacks, JailGuard implements a mutator combination policy by default to merge variants and divergence from different mutators. To evaluate the effectiveness of our framework, we build the first comprehensive multi-modal LLM prompt-based attack dataset, containing 11,000 data items across 15 known attack types. The evaluation suggests that JailGuard achieves the best detection accuracy of 86.14%/82.90% on text and image inputs, outperforming state-of-the-art defense methods by margins of 11.81%-25.73% and 12.20%-21.40%

Our dataset and code is publically available at here.

Page updated

Google Sites

Report abuse