Track 1: Long-Term Video Question Answering


MovieChat Datasets & Annotation Overview

MovieChat-1K is a new benchmark for long video understanding tasks, which contains 1K high quality video clips sourced from various YouTube videos, movies and TV series with 14K manual annotations

MovieChat-1K 

{

"info": {"num_frame": 11520, "fps": 24, "url": "", "h": 432, "video_path": "10.mp4", "w": 1024, "class": "movie"},

"global": [{"question": " ", "answer": " "}, ...],

"breakpoint": [{"question": " ", "answer": " ", "time": }, ...]

}

Note that the total num_frame here represents the true frame in the test video.

Evaluation Protocol

{

"You are an intelligent chatbot designed for evaluating the correctness of generative outputs for question-answer pairs. "

"Your task is to compare the predicted answer with the correct answer and determine if they match meaningfully. Here's how you can accomplish the task:"

"------"

"##INSTRUCTIONS: "

"- Focus on the meaningful match between the predicted answer and the correct answer.\n"

"- Consider synonyms or paraphrases as valid matches.\n"

"- Evaluate the correctness of the prediction compared to the answer."

"Please evaluate the following video-based question-answer pair:\n\n"

f"Question: {question}\n"

f"Correct Answer: {answer}\n"

f"Predicted Answer: {pred}\n\n"

"Provide your evaluation only as a yes/no and score where the score is an integer value between 0 and 5, with 5 indicating the highest meaningful match. "

"Please generate the response in the form of a Python dictionary string with keys 'pred' and 'score', where value of 'pred' is a string of 'yes' or 'no' and value of 'score' is in INTEGER, not STRING."

"DO NOT PROVIDE ANY OTHER OUTPUT TEXT OR EXPLANATION. Only provide the Python dictionary string. "

"For example, your response should look like this: {'pred': 'yes', 'score': 4.8}."

}

Baselines

Table: Quantitative evaluation for long video question answering on MovieChat-1K test set with the average of GPT-3.5, Claude and human bling rating. The best result is highlighted in bold, and the second best is underlined. 

Baselines (Track 1)

Table: Quantitative evaluation for long video question answering on MovieChat-1K test set with Gemini-Pro. We use Gemini-Pro for LLM-assisted evaluation in this challenge.

Registration

You need to register for the track you want to participate on CodaLab as follows, we’ll verify your information in 1 day.

Submission Format

To submit your results to the leaderboard, you must construct a submission zip file containing one JSON format file for test data, respectively. 

The JSON format is composed of a dictionary containing keys with video names, questions and respective values. For example,  

{

    "test-1.mp4": {

"caption": "...",

"global": [{"question": " ", "answer": "Your Answer"}, ...],

"breakpoint": [{"question": " ", "answer":"Your Answer", "time": }, ...] },

"test-2.mp4": {

"caption": "...",

"global": [{"question": " ", "answer": "Your Answer"}, ...],

"breakpoint": [{"question": " ", "answer":"Your Answer", "time": }, ...] },

     ...

}.

If you have a question about the submission format or if you are still having problems with your submission, please create a topic in the competition forum (rather than contact the organizers directly by e-mail) and we will answer it as soon as possible.

Submission Policy

There is only one phase for the challenge:

How to submit your file?


Open the "Submit/View Results" link in "Participate" and then follow the following steps:


Report Format


Report Submission Portal


For report submission, please send an email to loveu.cvpr@gmail.com

For more details, please refer to our Challenge White Paper.


Timeline


Communication & QA

Organizers


Enxin Song 

Zhejiang University

Wenhao Chai 

University of Washington 

Tian Ye 

The Hong Kong University of Science and Technology (Guangzhou) 

Gaoang Wang 

Zhejiang University

Jenq-Neng Hwang 

University of Washington