Diffuse, Attend, and Segment

Junjiao Tian*^,  Lavisha Aggarwal*Andrea Colaco*, Zsolt Kira^,
Mar Gonzalez-Franco*

Google*, Georgia Insitute of Technology^


Motivation

Producing quality segmentation masks for images is a fundamental problem in computer vision. However, constructing a model capable of segmenting anything in a zero-shot manner without any annotations is still challenging. In this project, we propose to utilize the self-attention layers in stable diffusion models to achieve this goal because the pre-trained stable diffusion model has learned inherent concepts of objects within its attention layers. Specifically, we introduce a simple yet effective iterative merging process based on measuring KL divergence among attention maps to merge them into valid segmentation masks. DiffSeg does not require any training or language dependency to extract quality segmentation for any images. 


Method 

Diffuse: DiffSeg uses a pre-trained stable diffusion model and its self-attention layers.

Attend: The self-attention layers produce attention maps that attend to objects in an image.

Segment: DiffSeg aggregates the self-attention maps from different resolutions and iteratively merges them by comparing pair-wise similarity. Finally, DiffSeg produces a valid segmentation mask through non-maximum suppression.

Key Contributions

Quantitative Experiments 

On COCO-Stuff-27, our method surpasses a prior unsupervised zero-shot SOTA method by an absolute 26% in pixel accuracy and 17% in mean IoU.

Evaluation on COCO-Stuff-27. Language Dependency (LD), Auxiliary Images (AX), Unsupervised Adaptation (UA). We additionally present baseline results using K-Means. K-Means-C (constant) uses a constant number of clusters, 6. K-means-S (specific) uses a specific number of clusters for each image based on the ground truth. The K-Means results use 512 x 512 resolution.

Qualitative Experiments 


Paintings from DomainNet

Images Captured by Phone

Sketches from  DomainNet

Real Images from  DomainNet

Bonus: can we add labels? 

Yes, we refer interested readers to our paper for more details. 

Citation

@article{tian2024diffuse,

  title={Diffuse, Attend, and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion},

  author={Tian, Junjiao and Aggarwal, Lavisha and Colaco, Andrea and Kira, Zsolt and Gonzalez-Franco, Mar},

  journal={CVPR},

  year={2024}

}