Reanimating Images using Neural Representations of Dynamic Stimuli
CVPR 2025, Oral
Sample Animated Videos
For each example, we show the ground truth video, the reanimated video using the flow predicted from the ground truth initial frame, and the reanimated video using the flow predicted from the initial frame generated by MindVideo. The MindVideo initial frame represents the diffusion image as predicted by MindVideo [1] from fMRI data. As such, this frame is often quite different in content from the ground truth, although the objects in each are in similar locations. However, the motion predicted by our model is still consistent with ground truth motion, for example, in the first video the jellyfish retracts backwards at the end clip. The same backwards motion can be seen in the reanimated videos for with both the ground truth initial frame and the MindVideo generated initial frame (which happens to be a boat). This establishes that the motion visualized using DragNUWA [2] is not derived solely from the diffusion model, but rather, incorporates our predicted motion.
Ground Truth
Ground Truth + Ours
MindVideo + Ours
Encoding Models Voxel-wise Prediction Performance on Inflated Cortex
Here we show the voxel-wise fMRI prediction performance, quantified as the Pearson correlation (r) between measured and predicted responses, for the remaining visual encoding models in alphabetical order. We include the HCP-MMP Parcellation Map on an inflated cortex for reference here.
Note that all following plots of encoding accuracy on inflated cortical maps are on the same scale for ease of comparison.
AF (averaged frames) models consist of model features averaged over frames spanning the video.
CLIP
CLIP AF (averaged frames)
CLIP ConvNeXt (Best Performing Image Encoder)
CLIP ConvNeXt AF (averaged frames)
DINOv1
DINOv2
Hiera Base Plus
Hiera Huge
R3M
R3M AF (averaged frames)
ResNet-50
VC-1 (Best Performing Embodied AI Model)
VC-1 AF (averaged frames)
VideoMAE
VideoMAE Large (Best Performing Model)
VIP
VIP AF (averaged frames)
XCLIP
HCP-MMP Parcellation Map
Source: Rolls et al., "The human language effective connectome," NeuroImage, 2022. DOI.