Intriguing Properties of Diffusion Models:
An Empirical Study of the Natural Attack Capability in
Text-to-Image Generative Models
(To Appear at CVPR 2024)

Summary

Denoising probabilistic diffusion models have shown breakthrough performance to generate more photo-realistic images or human-level illustrations than the prior models such as GANs. This high image-generation capability has stimulated the creation of many downstream applications in various areas. However, we find that this technology is actually a double-edged sword: We identify a new type of attack, called the Natural Denoising Diffusion (NDD) attack based on the finding that state-of-the-art deep neural network (DNN) models still hold their prediction even if we intentionally remove their robust features, which are essential to the human visual system (HVS), through text prompts. The NDD attack shows significantly high capability to generate low-cost, model-agnostic, and transferable adversarial attacks by exploiting the natural attack capability in diffusion models. To systematically evaluate the risk of the NDD attack, we perform a large-scale empirical study with our newly created dataset, the Natural Denoising Diffusion Attack (NDDA) dataset. We evaluate the natural attack capability by answering 6 research questions. Through a user study, we find that it can achieve an 88% detection rate while being stealthy to 93% of human subjects; we also find that the non-robust features embedded by diffusion models contribute to the natural attack capability. To confirm the model-agnostic and transferable attack capability, we perform the NDD attack against the Tesla Model 3 and find that 73% of the physically printed attacks can be detected as stop signs. Our hope is that the study and dataset can help our community be aware of the risks in diffusion models and facilitate further research towards robust DNN models.

Dataset and Supplementary Materials

Motivation

Figure 1 shows representative examples that motivate this work. We generate the stop sign images using state-of-the-art diffusion models with text prompts that intentionally break the fundamental properties that humans use to identify stop signs, such as red color and octagonal shape, but the text prompt still contains the object name (stop sign). As shown, the diffusion models strictly follow our instructions and generate images that are not generally recognized as stop signs, since legitimate stop signs should not be blue or rectangular. Surprisingly, many popular deep neural network (DNN)-based object detectors still recognize these examples as stop signs with high confidence. We define this new attack as the Natural Denoising Diffusion (NDD) attack. These results suggest that object detectors may use imperceptible features embedded by diffusion models, and lead us to question that:

Do the images generated by the diffusion model have a natural attack capability against the DNN models?

New dataset: Natural Denoising Diffusion Attack (NDDA) dataset

To systematically evaluate the natural attack capability of the diffusion models, we construct a new large-scale dataset, called the Natural Denoising Diffusion Attack (NDDA) dataset. The design goal for the NDDA dataset is to systematically collect images with and without robust features upon which human perception relies on. To this end, the major challenge is to identify what kinds of robust features are important in our human visual system (HVS) for object detection. Although the complete mechanism of the HVS is not fully understood, prior works recognize that shape, texture, and color are the most important features for the HVS to identify objects. To further explore the motivated examples in Fig. 1, we decompose the texture into text and pattern because the text has a special meaning for human perception. Therefore, in this study, we define 4 robust features: shape, color, text, and pattern. We selected 3 object classes from the COCO dataset: a stop sign for an artificial sign, a fire hydrant for an artificial object, and a horse for a natural thing. We select these 3 classes because of their relatively higher detection rates than others in our preliminary experiments on generated images. We adopt the COCO's classes to make the experiments easier as we can utilize many existing pretrained models on the COCO dataset. Fig. 2 shows an overview of our datasets. More details are in the supplementary materials. 

Attack demos on Tesla Model 3

Normal Stop Sign

"realistic street view with a pink square stop sign with a circle and 'bad' on it"

"blue stop sign with 'hello' on it"

"blue square stop sign with 'hello' on it and checkerboard paint on it"

"spring green diamond shape stop sign with a diamond on it."

"realistic street view with pink diamond shape stop sign circle on it"

"stop sign"

"stop sign"

"realistic street view with a pentagon shape red stop sign with a ghost and 'Halloween' on it"

Research paper

To appear in the Conference on Computer Vision and Pattern Recognition (CVPR), June 2024 (Acceptance rate is 23.6% (2,719/11,532))

@InProceedings{sato2024intriguing,

  author    = {Takami Sato and Justin Yue and Nanze Chen and Ningfei Wang and Qi Alfred Chen},

  title     = {{Intriguing Properties of Diffusion Models: An Empirical Study of the Natural Attack Capability in Text-to-Image Generative Models}},

  booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},

  year      = {2024}

}

Team