Diffuse, Attend, and Segment

Junjiao Tian*^, Lavisha Aggarwal*, Andrea Colaco*, Zsolt Kira^,
Mar Gonzalez-Franco*

Google*, Georgia Insitute of Technology^

Motivation

Producing quality segmentation masks for images is a fundamental problem in computer vision. However, constructing a model capable of segmenting anything in a zero-shot manner without any annotations is still challenging. In this project, we propose to utilize the self-attention layers in stable diffusion models to achieve this goal because the pre-trained stable diffusion model has learned inherent concepts of objects within its attention layers. Specifically, we introduce a simple yet effective iterative merging process based on measuring KL divergence among attention maps to merge them into valid segmentation masks. DiffSeg does not require any training or language dependency to extract quality segmentation for any images.

Method

Diffuse: DiffSeg uses a pre-trained stable diffusion model and its self-attention layers.

Attend: The self-attention layers produce attention maps that attend to objects in an image.

Segment: DiffSeg aggregates the self-attention maps from different resolutions and iteratively merges them by comparing pair-wise similarity. Finally, DiffSeg produces a valid segmentation mask through non-maximum suppression.

Key Contributions

DiffSeg segments images in the wild without additional training, any prior knowledge, and auxilliary resources.
DiffSeg sets new state-of-the-art performance on two segmentation benchmarks.
DiffSeg introduces a new multi-object discovery method, where objects emerge automatically without the need to specify the number of objects in an image beforehand.

Quantitative Experiments

On COCO-Stuff-27, our method surpasses a prior unsupervised zero-shot SOTA method by an absolute 26% in pixel accuracy and 17% in mean IoU.

Evaluation on COCO-Stuff-27. Language Dependency (LD), Auxiliary Images (AX), Unsupervised Adaptation (UA). We additionally present baseline results using K-Means. K-Means-C (constant) uses a constant number of clusters, 6. K-means-S (specific) uses a specific number of clusters for each image based on the ground truth. The K-Means results use 512 x 512 resolution.

Qualitative Experiments

Paintings from DomainNet