We release the source code of TREANT for further evaluation.
Due to the absence of publicly available NSFW adversarial prompt datasets for testing text-to-image models, we have created our own. We develope an adversarial prompt dataset to evaluate safety filters. Building upon the approach of previous work, we took inspiration from a Reddit post and used ChatGPT to generate 100 target prompts for 11 different scenarios prohibited by OpenAI's content policy, specifically focusing on NSFW content. This process resulted in a total of 1100 adversarial prompts.
We present our evaluation data here, please send the author an email for password.
Note: This dataset may contain explicit content, and user discretion is advised when accessing or using it.
Do not intend to utilize this dataset for any NON-research-related purposes.
Do not intend to distribute or publish any segments of the data.