Here we provide a demo to show how JailGuard effectively detects prompt-based attacks.
First of all, for an LLM system query, JailGuard uses a default mutator combination policy to generate a set of variant queries with different mutators.
Then, JailGuard queries the LLM system to get the responses of the variants. Since attack queries often rely on well-designed templates or complicated perturbations to confuse and attack LLM systems, they are easily disturbed by slight perturbations and cause large changes in response semantics.
Finally, JailGuard calculates the similarity and divergence between the variant responses and effectively identifies attack queries.