FAQ

This is an FAQ about the paper. We wrote this FAQ to enable a wide audience to understand our research. We strongly recommend that you read our paper to understand our research.

Last updated:2nd, May, 2023

General

Q: Could you describe your paper briefly?

A: In this paper, we quantitatively describe the relationship between human brain activity and a type of image-generating AI called Stable Diffusion and accomplish the following:

Q: Is this the first study of brain decoding and encoding?

A: No. Both the decoding and the encoding of brains have a long history of research. As introduced in our paper, there are numerous studies applying neurophysiological knowledge to decode visual experiences (Kay et al., 2008, Miyawaki et al., 2008, Naselaris et al., 2009, Nishimoto et al., 2011), and this trend has accelerated with the development of deep learning (Shen et al., 2019 and many others). 

Attempts to model various stimulus-related features and correlate them with brain activity (i.e., encoding) have been made for a long time (Nishimoto et al., 2011, among others). In recent years, there have been many attempts to describe brain activity using features derived from deep learning models (Yamins et al., 2014, Güçlü and van Gerven 2015,  and others), which have been useful in understanding both the brain and deep learning models. In other words, like many other studies, our research was not sudden but was conducted by combining new technology and data with a foundation of long-standing knowledge.

Q: So, what is new in this study? 

A: There have been few examples in past research that explicitly combine both visual features derived from images and semantic features described in text for decoding visual content from brain activity. Furthermore, most studies have required fine-tuning of deep learning models, making their application to brain activity data with small sample sizes generally difficult.

This paper demonstrates that by combining visual structural information decoded from activity in the early visual cortex with semantic features decoded from activity in higher-order areas and by directly mapping the decoded information to the internal representations of a latent diffusion model (LDM; Stable Diffusion) without fine-tuning, it is possible to decode (or generate) images from brain activity. 

Regarding encoding, while previous research has explored the correspondence between a convolutional neural network (CNN) and brain activity, this study is the first to use brain activity to examine the mechanism and dynamics of the LDM, a generative model that has rapidly developed in recent years.

Q: Is this mind-reading?

A: No. The technology presented in this study examines the relationship between perceived content and brain activity, and it is not mind-reading.

Q: Is this paper going to be published in a peer-reviewed journal?

A: We submitted the paper to the peer-reviewed international conference, "The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023," and it was accepted by anonymous experts. Typically, for papers related to AI and machine learning, authors prefer to submit to international conferences like CVPR rather than journals, which is more common in the life sciences.

Methods

Q: What is LDM or Stable Diffusion?

A: The Latent Diffusion Model (LDM) is a type of mathematical model called a diffusion model, which can generate new samples (e.g., images) by learning the statistical properties of massive data samples (such as huge image datasets). Stable Diffusion (Rombach et al., 2022) is an LDM and is known as a type of AI that generates high-quality images. With the release of Stable Diffusion as an open-source model in August 2022, it is now possible to examine its internal representation and not just generate images.

Q: What is the dataset?

A: In this study, we utilized the Natural Scenes Dataset (NSD). NSD is one of the largest publicly available datasets of its kind, where detailed brain activity was recorded using high-field (7T) MRI while each human subject viewed up to 10,000 images multiple times. The images viewed by the subjects are derived from a machine learning dataset called MS COCO, and each image is annotated with text information (not just a category label in the image) that describes the contents of the image. In this study, we modeled the relationship between brain activity and the internal representations of LDM (Stable Diffusion) using both the images and their associated text information. (Note that during the image generation test, decoding was performed using only the brain activity information, without utilizing the images or text information.)

Q: Did you finetune or re-train Stable Diffusion?

A: No. We used the published model (v1.4) without finetuning.

Q: What information did you use during the test phase? Did you use text/image for the test reconstruction?

A: When performing image reconstruction using test data, we use only brain activity data for analysis. Image and associated text information are not used.

Q: Which parts of our brain did you use when subjects looked at an image?

A: The early visual cortex is used for decoding image structures, and the higher visual cortex (ventral cortex) is used for decoding semantic information.

Discussion

Q: Can we use your technique to imagine and dream?

A: It is known that there is a certain similarity between brain activity during perceptual experiences and brain activity during the recall (imagining) or dreaming of these experiences. This property has been used to decode the content of imagination or dreams with a certain level of accuracy. However, signal strength (decoding accuracy) is generally lower for imagination compared to perception. There have been no examples of applying the technique used in this study to brain activity under imagery, and it is currently unclear how accurate it would be.

Q: What information is the semantic decoder decoding?  Are they generating images by decoding visual category labels from brain activity?

A: The latent representation C we used is based on captions (text descriptions of images) used in the CLIP model, which contains general and diverse information about the images. In our paper, we confirmed that decoding accuracy decreases when building a semantic decoder using only image category labels (Supplementary Figure B.2). This suggests that the decoding using the latent representation C in this study is not based solely on information from simple visual categories.

Q: Is this just a semantic decoding with a fancy image generator?

A: We used a combination of visual structural information Z (decoded from early visual areas) and textual semantic information C (decoded from higher-order areas) to generate images. Our analyses demonstrate that using both Z and C results in higher decoding accuracy than using only C (as shown in Figure 5 and some image examples in Figure 3). This suggests the importance of this combination approach.

Q: What are the ethical/privacy issues that stem from brain decoding?

A: First, decoding technology is unlikely to become practical in the near future. In addition, building a decoding model requires a person to spend many hours in a large fMRI scanner. There is also much room for improvement in the accuracy of decoding models. However, both equipment and computational models are improving day by day. Therefore decoding brain activity in the future could raise serious ethical and privacy concerns. We strongly believe that the brain contains extremely sensitive personal information and should not be subjected to any form of analyses without informed consent.

Q: Could the model be transferred to a novel subject?

A: Because the shape of the brain differs from one individual to another, it is not possible to directly apply a model created for one individual to another. However, several methods have been proposed to compensate for these differences, and it would be possible to use such methods to transfer models across subjects with a certain degree of accuracy.

Q: Could the framework be applied to another modality such as EEG/MEG?

A: Our framework is in general applicable to other modalities including EEG. However, the accuracy of such an application is currently unknown. This is because their temporal/spatial resolution and SNR are very different from fMRI.

Q: How broadly is your technique applicable?

A: NSD aims to capture human brain activity in response to a diverse set of natural visual stimuli. Therefore, we believe that our technique proposed in this study has a certain degree of generality. However, in the future, we plan to investigate whether our technique can be applied to a broader range of stimuli, including artificial ones.

Q: Is there any overlap between the data used to train Stable Diffusion and the images presented in fMRI? If there is an overlap, does it affect the quantitative evaluation?

A: After a thorough examination, we discovered that approximately 7% of the images used in the test data were present in LAION-5B (*), which Stable Diffusion utilized for training. Consequently, we excluded the overlapping images and conducted the quantitative evaluation once more. As a result, there was no change in the quantitative assessments (identification accuracy = 74.3 ± 1.7% / 74.3 ± 1.6% [original/new] when using Inception v3. There were no differences in other quantitative measures using CLIP and AlexNet).

To further investigate whether the overlap in the text encoding process might impact the results, we performed the same analysis as in the current study using Stable Diffusion v2.0 trained with OpenCLIP (trained on LAION-5B) instead of CLIP (trained on MS COCO, which is the source of NSD stimuli) as the text encoder. As a result, there were no changes in quantitative evaluations (identification accuracy = 74.3 ± 1.7% / 74.5 ± 2.7% [Original/New] when using Inception v3. There were no differences in other quantitative measures using CLIP and AlexNet).

These results indicate that potential image leakage between Stable Diffusion and NSD didn’t affect our conclusions. The method, image list, and results of the investigation of the overlap between NSD and LAION-5B are available at this URL.

(*) Note that among 35 example images shown in the paper, 3 were included in LAION-5B (Figure 3, line 1 [identical to Sup Fig. B4, column 2, line 7 and Sup Fig. B5, column 1, line 4]; Figure 4, line 3; and Sup. Fig. B4, column 2, line 8).

Q: Where can I find Supplementary Material?

A: Supplementary materials are available at the top right of the bioRxiv page (URL). A non-preprint version will be available on IEEE Xplore by the time of the CVPR.

Q: Can I analyse NSD myself using your methods?

A: Yes, the code is publicly available.

Q: Do you have a table with specific values for the quantitative evaluation (identification accuracy)?

A: Please download from here. Note that, the identification accuracy is calculated using the average of the PSMs of the 5 generated images for each corresponding test image (please see Supplementary Material).