Let's think Frame by Frame with VIP: A Video Infilling and Prediction Dataset for Evaluating Video Chain-of-Thought

VIP: The Video Infilling and Prediction Dataset

An inference-time dataset that is designed to investigate multi-hop, multi-frame video reasoning abilities of LLMs through a video chain-of-thought consisting of video keyframes. The VIP dataset consists of video keyframes representing a wide domain of realistic videos and two textual descriptions for each keyframe.

VIP VS. Other Datasets

VIP introduces ...

A new way of representing video keyframes

Wide variety of videos from realistic video-domains

Distribution of VIP's real-world video domains

Two new Tasks for video reasoning

Video Infilling: Given n previous and next keyframes’ descriptions, predict the p masked keyframes' descriptions

2. Video Prediction: Given n previous keyframes’ descriptions, predict the p next keyframes' descriptions

Automated data creation process to reduce costs and redundancy of handling videos

Data Collection Pipeline

GPT and Vicuna models show significant room for improvement on VIP Tasks

GPT-4's output for VIP's prediction task

Scores on the Infilling and Prediction Task with regards to the number of context frames (given by the number in the task name; ie Infilling-1 uses 1 context frame). For all tasks, the output consists of three keyframes and the scores are the average of that. Best results are underlined.

Scores for the Prediction-3 task. Includes results for each category of structured, FAMOuS description.

Contributors

Vaishnavi Himakunthala*, Andy Ouyang*, Daniel Rose*, Ryan He*, Alex Mei, Yujie Lu, Chinmay Sonar, Michael Saxon, William Yang Wang

*joint first-author

Contact

Please contact vaishnavi@ucsb.edu, andyouyang@ucsb.edu, danielrose@ucsb.edu, or ryanhe@ucsb.edu for any questions!

Citation

If you find our work helpful, please cite our work using the following citation:

@inproceedings{himakunthala2023lets,title={Let's Think Frame by Frame with {VIP}: A Video Infilling and Prediction Dataset for Evaluating Video Chain-of-Thought},author={Vaishnavi Himakunthala and Andy Ouyang and Daniel Philip Rose and Ryan He and Alex Mei and Yujie Lu and Chinmay Sonar and Michael Saxon and William Yang Wang},booktitle={The 2023 Conference on Empirical Methods in Natural Language Processing},year={2023},url={https://openreview.net/forum?id=y6Ej5BZkrR}}