This piece, by Onno Berkan, was published on 10/01/25. The original text, by d'Ascoli et al., was published on CS arXiv on 06/29/25.
This Meta (Facebook) study aimed to predict brain activity from sensory input. They had people watch TV in an fMRI scanner and attempted to correlate the scans with what they were watching. This was then used to predict neural activity from just the TV show, in multiple modalities – they predicted neural responses to the audio, video, and the actual script (story) simultaneously. Their model won the 2025 Algonauts competition.
Predicting brain activity in response to an input is one of the ultimate goals of neuroscience. What you can expect is that you will have a much easier time understanding. Prediction also opens the door to brain simulation, allowing researchers to test theories and also inspiring designs for better, more efficient AI systems.
Most fMRI brain activity predictors thus far have been quite limited; they either use a single modality (e.g., just visuals, just audio) or involve a single participant. This is due to the hardships in acquiring sufficient data, as well as difficulties in achieving non-linearity, one of this paper's key strengths.
fMRI data is challenging to deal with. When a brain region is activated, blood flow is redirected towards that region; this is what fMRI detects. It still detects activity, but indirectly, as it can pinpoint exactly where the activity occurred; however, it struggles with determining exactly when it happened. Blood flow is not diverted immediately; there is a delay between when a pattern of brain activity occurs and when an fMRI can detect it. This is a problem, especially if you want to have a show and see the corresponding brain activity, as the time delay makes it difficult to relate.
To add to the challenge, this time delay is not constant – the time it takes for fMRI to pick up activity in one region may differ from that in another. The researchers tackle this problem by using transformers, which are used to adapt to latencies in each modality and for each person, without relying on a single static formula for all of them.
The researchers used a dataset where six people watched the first six seasons of Friends, as well as four whole movies, all while in an fMRI scanner. This meant they had an incredible amount of data to work with (over 80 hours of brain scans!)
Transformers were used again to embed the multi-modal data into a shared space. Embeddings for audio, video, and text (script) were generated independently using pre-trained transformers and then combined into a shared embedding space. Don't worry if you got lost – all this means is that the researchers managed to find a way to represent what the participants were watching in a way that an AI can process. This is intended to capture everything the viewer perceives at that point, in small chunks. The researchers split the 80+ hours of data into half-second bits and fed them into the model.
Thus, the training began. I won't get into the specifics, but I recommend reading the paper if you're interested. It's not very long.
Using this system, the researchers built TRIBE (TRi-Modal Brain Encoder), which was able to predict neural responses to said half-second chunks with remarkable accuracy. They entered this year's Algonauts competition with this, and won first place, which is a big deal (perhaps not unexpected from the Facebook team, but still…)
When experimenting with the model, the researchers observed that it performed better when fed multi-modal data. Feeding just visuals or just audio led to worse performance than when both were combined. This illustrates the inadequacy of relying solely on one modality to explain brain activity.
This work represents a significant step forward toward a comprehensive whole-brain encoding. We know that utilizing more modalities enables us to model brain activity more effectively, but there's still a long way to go. This project was also crucial in its usage of data from multiple subjects and TRIBE's ability to generalize across subjects. We're one step closer to simulating a brain… Exciting stuff!
Want to submit a piece? Or trying to write a piece and struggling? Check out the guides here!
Thank you for reading. Reminder: Byte Sized is open to everyone! Feel free to submit your piece. Please read the guides first though.
All submissions to berkan@usc.edu with the header “Byte Sized Submission” in Word Doc format please. Thank you!