The increasing prevalence of text-to-image (T2I) models makes their safety a critical concern. Adversarial testing techniques have been developed to probe whether such models can be prompted to produce Not-Safe-For-Work (NSFW) content. Despite these efforts, current solutions face several challenges, such as low success rate, inefficiency and lack of semantic understandings. To combat these, we introduce TREANT, a novel automated red-teaming framework for adversarial testing of T2I models. The core of our framework is the tree-based semantic transformation. We employ Semantic Decomposition and Sensitive Element Drowning strategies in conjunction with Large Language Models (LLMs) to systematically refine adversarial prompts for effective testing. Our comprehensive evaluation confirms the efficacy of Semantic Decomposition, which not only exceeds the performance of state-of-the-art approaches but also achieves a overall success rate of 88.5% on leading T2I models, including DALL·E 3 and Stable Difussion.
We introduce Prompt Parse Tree (PPT), a novel structure for encoding relationships and attributes of objects in prompts. Its design is inspired by the concept of Parse Tree in natural language processing.
A parse tree, defined within a grammar G = (V, ∑, R, S), comprises nodes representing nonterminal (V ) and terminal (∑) symbols, with R as production rules and S as the start symbol. The tree's yield Yield(T ) is the string w formed by concatenating all terminal symbols and empty string. Building upon this definition, we formally define the Prompt Parse Tree (PPT) as a hierarchical structure composed of three distinct node types:
Object Nodes: they explicitly represent the actual objects referred to in the image.
Attribute Nodes: they detail the characteristics or qualities of objects, providing comprehensive descriptions or modifiers and shaping the attributes of the objects mentioned in the prompt.
Relation Nodes: they map the relationships between objects or their sub-components within the prompt. They become crucial when complex objects are broken down into sub-elements, thereby clarifying their intricate interconnections.
In Treant, we implement Semantic Decomposition to alter prompts generated from PPT for evading text safety filters. This technique is grounded in an observation, where we transform sensitive elements into non-sensitive ones to circumvent these filters. The rationale behind this approach lies in the localized nature of neural networks' attention mechanisms, which struggle with densely clustered elements. Therefore, our strategy begins by addressing these two aspects. For textual content, we discover that diluting sensitivity is effective. We disassemble highly sensitive parts of the text and then disperse them throughout the entire prompt. This process effectively reduces the concentration of sensitive elements, facilitating their passage through safety filters.
Once text safety filters are circumvented, the ensuing images may still be subject to image safety filters, which are often more rigorous. Text-to-image models possess the capability to generate multiple canvases simultaneously. This feature enables us to submerge sensitive elements on one canvas while inundating other canvases with a plethora of non-sensitive elements, which may lead to the overloading of image safety filters—a phenomenon. To avert the dilution of the intended target image with irrelevant elements, our method involves explicitly instructing, via the prompt, to divide the image into several canvases. Subsequently, we populate these separate canvases with non-sensitive elements, dedicating a single canvas to the target image. This technique of prompt augmentation is contextually independent of the original adversarial intent, thereby allowing seamless integration to create an augmented prompt.