Text-to-image (T2I) models have significantly advanced in producing high-quality images. However, such models have the ability to generate images containing not-safe-for-work (NSFW) content, such as pornography, violence, political content, and discrimination. To mitigate the risk of generating NSFW content, refusal mechanisms, i.e., safety checkers, have been developed to check potential NSFW content. Adversarial prompting techniques have been developed to evaluate the robustness of the {refusal mechanisms. The key challenge remains to subtly modify the prompt in a way that preserves its sensitive nature while bypassing the refusal mechanisms. In this paper, we introduce TokenProber, a method designed for sensitivity-aware differential testing, aimed at evaluating the robustness of the refusal mechanisms in T2I models by generating adversarial prompts. Our approach is based on the key observation that adversarial prompts often succeed by exploiting discrepancies in how T2I models and safety checkers interpret sensitive content. Thus, we conduct a fine-grained analysis of the impact of specific words within prompts, distinguishing between dirty words that are essential for NSFW content generation and discrepant words that highlight the different sensitivity assessments between T2I models and safety checkers. Through the sensitivity-aware mutation, TokenProber generates adversarial prompts, striking a balance between maintaining NSFW content generation and evading detection. Our evaluation of TokenProber against 5 safety checkers on 3 popular T2I models, using 324 NSFW prompts, demonstrates its superior effectiveness in bypassing safety filters compared to existing methods (e.g., 54%+ increase on average), highlighting TokenProber's ability to uncover robustness issues in the existing refusal mechanisms.
The Text-to-Image (T2I) models have gained widespread attention due to their excellent capability in synthesizing high- quality images. T2I models, such as Stable Diffusion and DALL-E, process the textual descriptions provided by users, namely prompts, and output images that match the descriptions. Such models have been widely used to generate various types of images, for example, the Lexica [] contains more than five million images generated by Stable Diffusion.
With the popularity of T2I models, ethical concerns about the safety of synthesized image content have gradually emerged, i.e., T2I models may synthesize images containing not-safe-for-work (NSFW) content. To prevent T2I models from synthesizing offensive or inappropriate images, the typi- cal approach is deploying a safety checker to filter the NSFW prompts and corresponding images. However, existing works have demonstrated that safety checkers can be bypassed by manually crafted or automatically optimized prompts (a.k.a jailbreak prompts), which highlights the need for robustness evaluation before deploying the safety checkers or T2I models. The assessment of safety checkers' robustness in T2I models is designed to generate adversarial prompts that jailbreak T21 models, i.e., leading T2I models to produce images containing NSFW content, which, however, remain undetected by safety checker. As depicted in Figure 1, the crux of adversarial prompts lies in the discrepancy between the safety checker's and the T2I model's decisions: the safety checker fails to predict the prompt as problematic, yet the T2I model pro- ceeds to produce sensitive images with NSFW content. This inconsistency reveals the vulnerabilities in the safety measures implemented in T2I systems.
To tackle the key challenge, TokenProber performs a fine-grained analysis on the influence of various words, en- abling us to balance between preserving sensitive semantics for NSFW content generation by the T2I model and minimizing negative signals to avoid detection by safety checkers (i.e., the prompts falling within the inconsistency zone depicted in Fig. 1). TokenProber identifies two key types of words to navigate this balance: Dirty Words, which are essential for retaining the NSFW content and should not be removed, and Discrepant Words, which are not inherently dirty (e.g.. neutral or positive) but significantly affect the safety checker's predictions in a negative manner. Essentially, discrepant words reveal a lack of robustness in the checker's ability to accurately process these words. By mitigating the influence of discrepant words, the checker is more likely to assess the prompt positively. By safeguarding dirty words while diminishing the influence of discrepant words, TokenProber achieves an optimal equilibrium, which enables the creation of adversarial prompts that exploit the discrepancies between the decision of the safety checker and the T2I model.
Fig.2 depicts the overview of TokenProber. TokenProber takes an initial prompt that describes an NSFW image scene, the T2I model and the target safety checker as inputs. For simplicity of presentation, the model and target safety checkers are omitted from the figure. Note that the output of the initial prompt is blocked by the target safety checker under test. TokenProber operates in two main steps. In the first step, TokenProber analyzes the impact of individual words on the generation of NSFW content and their influence on the safety checker's prediction. This involves examining each word for its “Dirtiness" and “Discrepancy". Dirtiness Analysis assesses whether a word predominantly contributes to NSFW content generation (e.g."naked"), while Discrepancy Analysis evaluates the word's effect on the safety checker's robustness.
Based on the insights from the sensitivity analysis, Token- Prober performs a fuzzing strategy to optimize the target prompt. Two mutation strategies are proposed to refine the target prompt. Dirtiness-Preserving Mutation looks for seman- tically similar words to replace dirty ones, maintaining NSFW content generation potential while possibly challenging the safety checker's robustness. Discrepancy-Away Mutation aims to replace Discrepant words with alternatives that decrease their negative impact on safety checks, thus improving the prompt's potential to bypass the safety checker.
Since no perfect oracle exists to definitively judge whether the image generated from a new prompt contains NSFW con- tent. TokenProber performs cross-checking with a reference safety checker. A prompt deemed safe by the target checker but flagged by the reference checker is considered a potential adversarial prompt. The selection of the best candidate prompt from the generated mutations is guided by the objective to maximize the safety score difference between the target and reference checkers. This iterative process continues until a successful adversarial prompt is generated or the maximum number of iterations is reached.
Code: https://anonymous.4open.science/r/TokenProber-ICSE2025
Data: https://drive.google.com/drive/folders/11kjfdVYIL8_73lQnXzSn8hmHpjCRut5x