Due to time constraints, the experiments were only conducted on the 4chan dataset, the TI-SD safety checker, and Stable-Diffusion-v1.4. We will include the complete set of experiments in the revision version.
We further explain our work‘s connection to SE from the perspective of actionable insights for various stakeholders.
For SE researchers, testing the rejection mechanism of text-to-image models can enhance researchers' understanding of the model, guide model improvement, and promote the application of more robust models;
For software engineers and practitioners, rigorous testing ensures the reliability of the model, prevents the generation of NSFW content, and avoids unintended outputs in real-world applications.
For policymakers, policymakers can use the results of testing the rejection mechanism of text-to-image models to understand the model's safety, thereby formulating relevant regulations to ensure the proper application of AI technology.
We demonstrate the differences in the image examples generated by Stable-Diffusion-v1.4, DreamLike, Stable-Diffusion-v1.5, and DALL·E for the same prompts (5 SFW and 5 NSFW). For NSFW prompts, we can observe:
For the same prompt, the images generated by different models vary significantly, especially for NSFW prompts (our task);
The open-source DALL·E generates images of lower quality.
We use the CLIP score to quantify the image generation quality of each model. We calculated the average CLIP score of the generated images for 500 randomly selected prompts from the dataset used in our paper. The results are shown in Table 1. DALL·E has the lowest CLIP score, with its generation quality showing a clear gap compared to the other three models.
To further alleviate the reviewers' concerns,we have included experiments with three more open-source models, OpenJourney [18], Protogen X3.4 [19] and StableDiffusion-v3.5[]. The results are shown in Table 2. The results demonstrate our method remains efficient on these models.
"We use the default settings for both adversarial attacks. Due to time constraints, we only completed experiments on Stable-Diffusion-v1.4, the 4chan dataset, and TI-SD. The bypass rate results are shown in Table 3. The results confirm that our method remains more effective, further demonstrating the effectiveness of our novel methodologies (average bypass rate of 0.04)."
We will include the complete experimental results in the revised version.
We compared the naturalness of prompts generated by methods with a bypass rate greater than 0.1, namely TokenProber (0.68), SneakyPrompt (0.18), and SneakyPrompt_c (0.26). We randomly selected 500 prompts from those generated by each method and calculated the average perplexity of each prompt using Llama2-7b-chat-hf. The results are shown in Table 4. The results indicate that our method achieves better naturalness. (315.2 vs. 1024.2)
In previous works [1-3], common sentence-level mutation methods include modifying multiple words, adding extra sentences, and changing sentence structure. We used the method of modifying multiple words for sentence-level mutation, while the latter two methods require task-specific design of mutations, which we leave as future work. We modified 3 words per mutation, and the results are shown in Table 5. The results indicate that sentence-level mutation has low efficiency, and the images generated by adversarial prompts are not highly matched with the original prompt.
In Table 3 of our paper, we show how many of the images successfully generated by TokenProber actually contain NSFW content. TokenProber determines whether the generation is successful using a surrogate model. Images that were found to not contain NSFW content during manual inspection are considered false positives (FP) of the surrogate model. So,the false positive (FP) rate of the surrogate model can be calculated from Table 3 in the paper, which is obtained by subtracting the proportion of images containing NSFW content from 1. The results are shown in Table 6.
We have added an experiment that performs word-level sensitivity analysis in each iteration. The results are shown in Table 7. The results demonstrate that only once is more efficient , has a similar bypass rate (0.87 vs 0.82), but is much faster (212.38 vs 985.44s)
[1] Wang, Yicheng, and Mohit Bansal. "Robust machine comprehension models via adversarial training." arXiv preprint arXiv:1804.06473 (2018).
[2] Iyyer, Mohit, et al. "Adversarial example generation with syntactically controlled paraphrase networks." arXiv preprint arXiv:1804.06059 (2018).
[3] Behjati, Melika, et al. "Universal adversarial attacks on text classifiers." ICASSP 2019