We present an efficient and scalable clustering technique to extract low-dimensional features from deep models with multi-layer representations, applied across the entire dataset.

Method: 

We extend classical spectral clustering, which computes low-dimensional data embeddings by solving an eigenproblem defined by a single affinity matrix, with the new ability to cluster multi-layer features across multiple images. Our approach reformulates spectral clustering from an optimization perspective and computes the low-dimensional representation X using stochastic gradient descent. These new objectives allow us to efficiently compute coherent features across multiple layers, with corresponding affinity matrices, and handle graphs connecting multiple images. Additionally, our method can be viewed as a holistic model analysis tool. We demonstrate that the graph from the Value feature of the attention module encodes semantic representation, indicating a "What" visual pathway, while the graph from the Query and Key features encodes spatial information, indicating a "Where" visual pathway.

Results: Single Image Analysis
We visualize the eigenvectors and the regions extracted by clustering the diffusion model features for a single image. 

We compare the region quality and semantic segmentation using oracle decoding with other approaches.

By post-processing the regions using hierarchical grouping,  we obtain competitive results on unsupervised instance segmentation.

Results: Full Dataset Analysis

By extending the graph across image collections, we can compute consistent low-dimensional representations for the entire dataset. This method allows us to reveal the model's behavior comprehensively, serving as a holistic model analysis tool. Specifically, in the attention module, clustering with the VV graph produces semantic groupings, similar to the "what" visual pathways. Conversely, clustering with the QK graph results in spatial groupings, akin to the "where" visual pathways.

We also demonstrate that the eigenvectors encode hierarchical concepts by ranking the eigenvalues. The leading eigenvalues, displayed in the first row, focus on scene-level representation, while the trailing eigenvectors, displayed in the second row, focus on object-level representation.