SNEAKYPROMPT: This approach utilizes reinforcement learning to iteratively refine adversarial prompts. By continuously interacting with text-to-image models, SNEAKYPROMPT aims to induce the generation of NSFW content, testing the robustness of model safety filters.
BAE: BAE adopts a token manipulation strategy, specifically focusing on token replacement and insertion. It works by masking portions of the original text and utilizing BERT Masked Language Model to suggest alternative tokens that could fit the masked context, effectively testing the filters' resilience to subtle linguistic changes.
TEXTFOOLER: TEXTFOOLER employs a synonym substitution technique to evade safety filters. It replaces critical words in the text with their synonyms, preserving the semantic content while altering the prompt's structure enough to potentially bypass the safety mechanisms.
NSFW-1k: In contrast to the comprehensive content compliance checks provided by OpenAI, the current publicly available NSFW prompt datasets only include obscene content. To more thoroughly test Treant's performance, we have created our own dataset, denoted as NSFW-1k. Building upon the approaches of previous works, we take inspiration from a Reddit post and use ChatGPT to generate 100 target prompts for 11 different scenarios prohibited by OpenAI's content policy, specifically focusing on NSFW content. This process results in a total of 1100 adversarial prompts.
NSFW-200: We also conducted extensive testing on the NSFW-200 dataset proposed by SneakyPrompt, which contains 200 prompts containing obscene content.
This table demonstrates the aggregated success rates for bypassing safety filters in DALL·E 3 across various prohibited scenarios using different adversarial testing techniques in NSFW-1k.
This table demonstrates the aggregated success rates for bypassing safety filters in DALL·E 3 and three versions of Stable Diffusion using different adversarial testing techniques in NSFW-200 (in SneakyPrompt).