Reanimating Images using Neural Representations of Dynamic Stimuli

CVPR 2025, Oral

Sample Animated Videos

For each example, we show the ground truth video, the reanimated video using the flow predicted from the ground truth initial frame, and the reanimated video using the flow predicted from the initial frame generated by MindVideo. The MindVideo initial frame represents the diffusion image as predicted by MindVideo [1] from fMRI data. As such, this frame is often quite different in content from the ground truth, although the objects in each are in similar locations. However, the motion predicted by our model is still consistent with ground truth motion, for example, in the first video the jellyfish retracts backwards at the end clip. The same backwards motion can be seen in the reanimated videos for with both the ground truth initial frame and the MindVideo generated initial frame (which happens to be a boat). This establishes that the motion visualized using DragNUWA [2] is not derived solely from the diffusion model, but rather, incorporates our predicted motion.

Ground Truth

Ground Truth + Ours

MindVideo + Ours

Encoding Models Voxel-wise Prediction Performance on Inflated Cortex

Here we show the voxel-wise fMRI prediction performance, quantified as the Pearson correlation (r) between measured and predicted responses, for the remaining visual encoding models in alphabetical order. We include the HCP-MMP Parcellation Map on an inflated cortex for reference here.

Note that all following plots of encoding accuracy on inflated cortical maps are on the same scale for ease of comparison. 

AF (averaged frames) models consist of model features averaged over frames spanning the video.

CLIP

CLIP AF (averaged frames)

CLIP ConvNeXt (Best Performing Image Encoder)

CLIP ConvNeXt AF (averaged frames)

DINOv1

DINOv2

Hiera Base Plus

Hiera Huge

R3M

R3M AF (averaged frames)

ResNet-50

VC-1 (Best Performing Embodied AI Model)

VC-1 AF (averaged frames)

VideoMAE

VideoMAE Large (Best Performing Model)

VIP

VIP AF (averaged frames)

XCLIP

 HCP-MMP Parcellation Map

Source: Rolls et al., "The human language effective connectome," NeuroImage, 2022. DOI.