Takuya Matsuyama 1,2, Shinji Nishimoto* 1,2, Yu Takagi* 1,2,3
1. Osaka University, Japan, 2.NICT, Japan, 3. NII, Japan
* Equal last auther
TL;DR: Our proposed method, LaVCa, generates text captions that explain voxel selectivity and surpass existing approaches, such as one-hot vector labels, enabling a more detailed description of the properties of visual cortex voxels.
Understanding the property of neural populations (or voxels) in the human brain can advance our comprehension of human perceptual and cognitive processing capabilities and contribute to developing brain-inspired computer models. Recent encoding models using deep neural networks (DNNs) have successfully predicted voxel-wise activity. However, interpreting the properties that explain voxel responses remains challenging because of the black-box nature of DNNs. As a solution, we propose LLM-assisted Visual Cortex Captioning (LaVCa), a data-driven approach that uses large language models (LLMs) to generate natural-language captions for images to which voxels are selective. By applying LaVCa for image-evoked brain activity, we demonstrate that LaVCa generates captions that describe voxel selectivity more accurately than the previously proposed method. Furthermore, the captions generated by LaVCa quantitatively capture more detailed properties than the existing method at both the inter-voxel and intra-voxel levels. Furthermore, a more detailed analysis of the voxel-specific properties generated by LaVCa reveals fine-grained functional differentiation within regions of interest (ROIs) in the visual cortex and voxels that simultaneously represent multiple distinct concepts. These findings offer profound insights into human visual representations by assigning detailed captions throughout the visual cortex while highlighting the potential of LLM-based methods in understanding brain representations.
(a) We construct a voxel-wise encoding model for a human subject’s brain activity data (measured using fMRI) while viewing images, using CLIP -Vision latent representations. The encoding weight is obtained through ridge regression.
(b) We identify the optimal images for a given voxel by calculating the inner product between the CLIP-Vision latent representations of external image datasets and the voxel’s trained encoding weight, selecting the top-N images (the "optimal image set") that produce the highest predicted activation.
(c) Next, we use a Multimodal LLM (MLLM) to generate captions for each optimal image set, allowing an LLM to interpret them.
(d) Finally, we prompt an LLM to extract keywords from the captions, filter these keywords, and feed them into a “Sentence Composer,” producing a concise voxel caption.
We predict brain activity based on sentence similarity to assess how accurately voxel captions describe voxel selectivity (Figure a).
Because sentence-based evaluation can be influenced by non-visual linguistic features (e.g., sentence length), we also assess voxel selectivity using image similarity (Figure b).
Examples of the voxel captions and voxel images generated by LaVCa, Top-1 (caption of the Top-1 optimal image), and BrainSCUBA. “Acc.” indicates the accuracy of the voxel captions (text-level) and the voxel images (image-level), respectively.
LaVCa captions significantly predict voxel activity throughout the visual cortex (Figure a).
LaVCa exceeds BrainSCUBA’s performance throughout the visual cortex (Figure b)
We also demonstrate that LaVCa can generate highly interpretable and accurate captions without sacrificing information from the optimal images.
A word-level analysis (word cloud + histogram) reveals frequent face or person-related terms (e.g., “child,” “people,” “animal”) alongside diverse words (e.g., “food,” “sign”) (Figure a, middle and bottom).
We then project voxel captions and voxel images onto a flatmap, grouping them into eight clusters based on three UMAP dimensions (Figure b).
We find some captions are related to faces (e.g., “face,” “person,” “animal”), while particular voxels encoded more fine-grained features such as “eye,” “tongue,” or “smiling,” and other voxels encoded information like “animal,” “bear,” or “cardinal.” Thus, even within this ROI, there appears to be substantial functional differentiation among inter-voxel that extends beyond a generic “face” category.
Moreover, we observe intra-voxel diversity, where a single caption incorporates multiple ideas (e.g., “A food packaging features a smiling person and a cartoon character”), suggesting that individual voxels can simultaneously encode several distinct concepts.