Neural Representations of Dynamic Visual Stimuli

Sample Animated Videos

Encoding Models Voxel-wise Prediction Performance on Inflated Cortex

Here we show the voxel-wise fMRI prediction performance, quantified as the Pearson correlation (r) between measured and predicted responses, for the remaining visual encoding models in alphabetical order. We include the HCP-MMP Parcellation Map on an inflated cortex for reference here.

Note that all following plots of encoding accuracy on inflated cortical maps are on the same scale for ease of comparison. 

AF (averaged frames) models consist of model features averaged over frames spanning the video.

CLIP

CLIP AF (averaged frames)

CLIP ConvNeXt (Best Performing Image Encoder)

CLIP ConvNeXt AF (averaged frames)

DINOv1

DINOv2

Hiera Base Plus

Hiera Huge

R3M

R3M AF (averaged frames)

ResNet-50

VC-1 (Best Performing Embodied AI Model)

VC-1 AF (averaged frames)

VideoMAE

VideoMAE Large (Best Performing Model)

VIP

VIP AF (averaged frames)

XCLIP

 HCP-MMP Parcellation Map

Source: Rolls et al., "The human language effective connectome," NeuroImage, 2022. DOI.