An inference-time dataset that is designed to investigate multi-hop, multi-frame video reasoning abilities of LLMs through a video chain-of-thought consisting of video keyframes. The VIP dataset consists of video keyframes representing a wide domain of realistic videos and two textual descriptions for each keyframe.
Distribution of VIP's real-world video domains
Video Infilling: Given n previous and next keyframes’ descriptions, predict the p masked keyframes' descriptions
2. Video Prediction: Given n previous keyframes’ descriptions, predict the p next keyframes' descriptions
Data Collection Pipeline
GPT-4's output for VIP's prediction task
Scores on the Infilling and Prediction Task with regards to the number of context frames (given by the number in the task name; ie Infilling-1 uses 1 context frame). For all tasks, the output consists of three keyframes and the scores are the average of that. Best results are underlined.
Scores for the Prediction-3 task. Includes results for each category of structured, FAMOuS description.
Vaishnavi Himakunthala*, Andy Ouyang*, Daniel Rose*, Ryan He*, Alex Mei, Yujie Lu, Chinmay Sonar, Michael Saxon, William Yang Wang
*joint first-author
Please contact vaishnavi@ucsb.edu, andyouyang@ucsb.edu, danielrose@ucsb.edu, or ryanhe@ucsb.edu for any questions!
If you find our work helpful, please cite our work using the following citation:
@inproceedings{himakunthala2023lets,title={Let's Think Frame by Frame with {VIP}: A Video Infilling and Prediction Dataset for Evaluating Video Chain-of-Thought},author={Vaishnavi Himakunthala and Andy Ouyang and Daniel Philip Rose and Ryan He and Alex Mei and Yujie Lu and Chinmay Sonar and Michael Saxon and William Yang Wang},booktitle={The 2023 Conference on Empirical Methods in Natural Language Processing},year={2023},url={https://openreview.net/forum?id=y6Ej5BZkrR}}