Let's think Frame by Frame with VIP: A Video Infilling and Prediction Dataset for Evaluating Video Chain-of-Thought
VIP: The Video Infilling and Prediction Dataset
An inference-time dataset that is designed to investigate multi-hop, multi-frame video reasoning abilities of LLMs through a video chain-of-thought consisting of video keyframes. The VIP dataset consists of video keyframes representing a wide domain of realistic videos and two textual descriptions for each keyframe.
VIP VS. Other Datasets
VIP introduces ...
A new way of representing video keyframes
Wide variety of videos from realistic video-domains
Distribution of VIP's real-world video domains
Two new Tasks for video reasoning
Video Infilling: Given n previous and next keyframes’ descriptions, predict the p masked keyframes' descriptions
2. Video Prediction: Given n previous keyframes’ descriptions, predict the p next keyframes' descriptions
Automated data creation process to reduce costs and redundancy of handling videos
Data Collection Pipeline
GPT and Vicuna models show significant room for improvement on VIP Tasks
GPT-4's output for VIP's prediction task
Scores on the Infilling and Prediction Task with regards to the number of context frames (given by the number in the task name; ie Infilling-1 uses 1 context frame). For all tasks, the output consists of three keyframes and the scores are the average of that. Best results are underlined.
Scores for the Prediction-3 task. Includes results for each category of structured, FAMOuS description.
Contributors
Vaishnavi Himakunthala*, Andy Ouyang*, Daniel Rose*, Ryan He*, Alex Mei, Yujie Lu, Chinmay Sonar, Michael Saxon, William Yang Wang
*joint first-author
Contact
Please contact vaishnavi@ucsb.edu, andyouyang@ucsb.edu, danielrose@ucsb.edu, or ryanhe@ucsb.edu for any questions!
Citation
If you find our work helpful, please cite our work using the following citation:
@inproceedings{himakunthala2023lets,title={Let's Think Frame by Frame with {VIP}: A Video Infilling and Prediction Dataset for Evaluating Video Chain-of-Thought},author={Vaishnavi Himakunthala and Andy Ouyang and Daniel Philip Rose and Ryan He and Alex Mei and Yujie Lu and Chinmay Sonar and Michael Saxon and William Yang Wang},booktitle={The 2023 Conference on Empirical Methods in Natural Language Processing},year={2023},url={https://openreview.net/forum?id=y6Ej5BZkrR}}