Dataset:
4chan Dataset: 500 prompts collected from posts on 4chan, which are prone to NSFW image generation.
Lexica Dataset: 404 prompts collected from Lexica website, a large AI-Generated image database.
NSFW-200 Dataset: 200 prompts generated by GPT-3.5, following a post on Reddit.
Safety Checker:
Text Safety Checker:Text-Match and Text-Classifier are selected as the target text safety checkers. Text-Match detects prompts containing words from a predefined sensitive word list. Text-Classifier is a fine-tuned DistilBERT, the fine-tuning datasets are posts collected from Reddit with the purpose of classifying NSFW text content.
Image Safety Checker: We select NSFW-Classifier and CLIP-Detector as the target image safety checkers. NSFW-Classifier, an open-source image classifier specifi- cally designed to detect NSFW content in images. CLIP- Detector, a binary classifier trained on CLIP image em- beddings from an NSFW image dataset, enabling it to identify NSFW content in images.
Text-Image Safety Checker: We choose Text-Image-SD, the default safety checker of the open-source Stable Diffusion framework. It blocks image embeddings that exhibit significant similarity to predefined NSFW concepts.
Baselines:
Adversarial attack: This category includes two state-of- the-art adversarial attack techniques designed for NLP classification tasks: PWWS and TextFooler. These techniques are used to generate adversarial ex- amples specifically for text-based models. We use our cross-check oracle to determine whether the adversarial examples result in images containing NSFW content.
Adversarial prompting: We include Sneakyprompt and SurrogatePrompt ,the most recent techniques for evaluating the robustness of safety checkers in T2I models Sneakyprompt employs reinforcement learning to explore alternative tokens and formulate new prompts.It aims to achieve high CLIP embedding similarity between the initial prompt and the synthesized image. SurrogatePrompt uses an LLM to find alternatives for sensitive words in the input prompt to reduce the likehood of triggering the safety checkers.
Metrics:
Bypass rate: This metric quantifies the percentage of prompts that successfully bypass the target safety checker while still containing NSFW content in the synthesized images.A higher bypass rate indicates that the method is more adept at generating adversarial prompts for a larger proportion of seed prompts.
Query number: This metric represents the total number of queries made to the T2I model and safety checker when an adversarial prompt is successfully generated. This metric could assess the efficiency of each tool.
CLIP score: This metric measures the cosine similarity between the CLIP embedding of the prompt and the synthesized image.A higher CLIP score indicates that the synthesized image more closely aligns with the prompt's description,signifying higher synthesis quality.
Time:This metric represents the total time used when an adversarial prompt is successfully generated.This metric can more thoroughly measure the efficiency of each tool.
TokenProber significantly outper- forms adversarial attack techniques and the state-of-the- art Sneakyprompt in generating adversarial prompts.The cross-check is a useful oracle for determining whether the adversarial prompts can lead to NSFW content.