Accepted at EMNLP2024 (Main, Long)
Yuko Nakagi* 1,2, Takuya Matsuyama* 1,2,
Naoko Koide-Majima 1,2, Hiroto Q. Yamaguchi 1,2, Rieko Kubo 1,2,
Shinji Nishimoto** 1,2, Yu Takagi** 1,2,3
1. Osaka University, Japan, 2.NICT, Japan, 3. NII, Japan
* Equal first auther, ** Equal last auther
Abstract
In recent studies, researchers have used large language models (LLMs) to explore semantic representations in the brain; however, they have typically assessed different levels of semantic content, such as speech, objects, and stories, separately. In this study, we recorded brain activity using functional magnetic resonance imaging (fMRI) while participants viewed 8.3 hours of dramas and movies. We annotated these stimuli at multiple semantic levels, which enabled us to extract latent representations of LLMs for this content. Our findings demonstrate that LLMs predict human brain activity more accurately than traditional language models, particularly for complex background stories. Furthermore, we identify distinct brain regions associated with different semantic representations, including multi-modal vision-semantic representations, which highlights the importance of modeling multi-level and multimodal semantic representations simultaneously. We will make our fMRI dataset publicly available to facilitate further research on aligning LLMs with human brain function.
To what extent does each level of semantic content uniquely explain brain activity, compared to other levels?
LLMs have been used to explore semantic representations in the human brain. However previous studies have assessed different semantic levels (e.g., speech, objects, stories) separately.
Methods
a
We collected brain activity data from six participants while they watched 8.3 hours of movies and dramas.
The annotations cover five semantic levels: transcriptions of spoken dialogue (Speech), objects in the scene (Object), background story of the scene (Story), summary of the story (Summary), and information about time and location (TimePlace).
Latent representations were obtained from:
Word2Vec, BERT, GPT2, OPT, and Llama 2 (for semantic levels comparison).
Vicuna-1.5 (Semantic), AST (Audio), CLIP (Vision), and Llava-1.5 (Vision-semantic) (for modality comparison).
b
We build brain encoding models to predict brain activity from latent representations for each semantic content independently.
Overview of our experiment and brain encoding models
Main Results
Semantic encoding models predict brain activity across sensory and high-level cognitive areas
Speech, Object and Story content predict brain activity with higher accuracy than Summary and TimePlace content.
Large models achieve higher prediction performance for high-level background Story content.
The latent representations of different semantic content correspond to spatially distinct brain regions
To determine the extent to which the different types of semantic content uniquely account for brain activity, we evaluated the unique variance explained by each semantic feature using variance partitioning analysis [la Tour et al., 2022] after constructing a brain encoding model that incorporates all the semantic features.
Speech → Auditory cortex
Object → Visual cortex
Story → Higher visual cortex, Precuneus, Frontal cortex
Multimodal features correspond to brain activity better and more uniquely than the unimodal models
Considering the multimodal nature of the stimuli in this study, we quantitatively compared the prediction performance of visual, auditory, and semantic modalities using multi-modal models. We analyzed which modalities best explain brain activity and which brain regions correspond to these modalities, using not only unimodal but also multimodal models.