Track 1: Long-Term Video Question Answering

This track aims at encouraging our participants to advance long-term video understanding system.
The competition is based on MovieChat test set only.
Top 3 winners will be mentioned at the workshop and formally recognized.

MovieChat Datasets & Annotation Overview

MovieChat-1K is a new benchmark for long video understanding tasks, which contains 1K high quality video clips sourced from various YouTube videos, movies and TV series with 14K manual annotations

MovieChat-1K

- Our Train Set contains 826 videos. Due to copyright restrictions, we share the clip features extracted by eva_vit_g, containing 8192 frames of each video. Please note that since the length of videos always exceeds 8192 frames, we report the true breakpoint timestep in the annotations. Users should calculate the corresponding frame in the released feature themselves. Train Set can be downloaded from here.
- Our Test Set contains 170 videos and respective annotation files, which can be downloaded from here. The JSON format annotation consists of 3 parts, including "info", "global", and "breakpoint". Compared to the Train Set, we delete the captions and answers for questions in two modes. An example of the annotation is provided below:

{

"info": {"num_frame": 11520, "fps": 24, "url": "", "h": 432, "video_path": "10.mp4", "w": 1024, "class": "movie"},

"global": [{"question": " ", "answer": " "}, ...],

"breakpoint": [{"question": " ", "answer": " ", "time": }, ...]

}

Note that the total num_frame here represents the true frame in the test video.

Evaluation Protocol

The evaluation process is under the assistance of LLM with the default hyper-parameter settings. The accuracy and relative scores on a scale of 0 to 5 are reported.
Following Video-ChatGPT , we use LLM-Assisted Evaluation for long video question-answering task in both the global mode and the breakpoint mode. Given the question, correct answer, and predicted answer by the model, the LLM assistants should return the True or False judgement and relative score (0 to 5). As the GPT-3.5 API and Claude API entail expenses, we employ Gemini-Pro in the challenge. The complete prompt is displayed below. It takes about 250 tokens per question.
Given the question (question), correct answer (answer), and predicted answer (pred), we insert them into the template:

{

"You are an intelligent chatbot designed for evaluating the correctness of generative outputs for question-answer pairs. "

"Your task is to compare the predicted answer with the correct answer and determine if they match meaningfully. Here's how you can accomplish the task:"

"------"

"##INSTRUCTIONS: "

"- Focus on the meaningful match between the predicted answer and the correct answer.\n"

"- Consider synonyms or paraphrases as valid matches.\n"

"- Evaluate the correctness of the prediction compared to the answer."

"Please evaluate the following video-based question-answer pair:\n\n"

f"Question: {question}\n"

f"Correct Answer: {answer}\n"

f"Predicted Answer: {pred}\n\n"

"Provide your evaluation only as a yes/no and score where the score is an integer value between 0 and 5, with 5 indicating the highest meaningful match. "

"Please generate the response in the form of a Python dictionary string with keys 'pred' and 'score', where value of 'pred' is a string of 'yes' or 'no' and value of 'score' is in INTEGER, not STRING."

"DO NOT PROVIDE ANY OTHER OUTPUT TEXT OR EXPLANATION. Only provide the Python dictionary string. "

"For example, your response should look like this: {'pred': 'yes', 'score': 4.8}."

}

We evaluate performance on MovieChat-1K Test Set in this competition, video list can be found here.
We have provided starter baseline code for Track 1 at Github

Baselines

Table: Quantitative evaluation for long video question answering on MovieChat-1K test set with the average of GPT-3.5, Claude and human bling rating. The best result is highlighted in bold, and the second best is underlined.

Baselines (Track 1)

Table: Quantitative evaluation for long video question answering on MovieChat-1K test set with Gemini-Pro. We use Gemini-Pro for LLM-assisted evaluation in this challenge.

Registration

You need to register for the track you want to participate on CodaLab as follows, we’ll verify your information in 1 day.

Submission Format

To submit your results to the leaderboard, you must construct a submission zip file containing one JSON format file for test data, respectively.

The JSON format is composed of a dictionary containing keys with video names, questions and respective values. For example,

{

"test-1.mp4": {

"caption": "...",

"global": [{"question": " ", "answer": "Your Answer"}, ...],

"breakpoint": [{"question": " ", "answer":"Your Answer", "time": }, ...] },

"test-2.mp4": {

"caption": "...",

"global": [{"question": " ", "answer": "Your Answer"}, ...],

"breakpoint": [{"question": " ", "answer":"Your Answer", "time": }, ...] },

...

If you have a question about the submission format or if you are still having problems with your submission, please create a topic in the competition forum (rather than contact the organizers directly by e-mail) and we will answer it as soon as possible.

Submission Policy

There is only one phase for the challenge:

Long-form video question answering phase (corresponding to the test split) will be launched on Mar 25th. If you want to be considered in the final leaderboard, only submission to this phase will be taken into account for the challenge. Submission limits: 40 submissions in total, no more than 2 per day.

How to submit your file?

Open the "Submit/View Results" link in "Participate" and then follow the following steps:

Choose a competition phase: At the top of the page, click on the phase (Long-form video question answering) to which you want to submit your result file.
Fill out the details of your submission (optionally).
Press the submit button and select the ZIP file you want to submit.
Wait a few moments for the submission to execute. Click on the "Refresh Status" button to check the status. Your submission should go through various status, including "Submitting", "Submitted" and "Running". After this, if the submission is successful, it will show the status "Finished". Otherwise, it will show a status of "Failed".
We regret to inform you that due to CodaLab's inability to support external API calls, the Leaderboard will be published on an alternate website. Updates will occur every 24 hours.

Report Format

Use CVPR style (double column) in the form of 3-6 pages or NeurIPS style (single column) in the form of 6-10 pages inclusive of any references. Please explain clearly what data, supervision, pre-trained models you have used so that we can make sure your results are comparable to others.
Please include your github link in the report. The top 2 winners are required to release their codebases and final models so that other people can reproduce in the future. Please contact us if you have any questions.

Report Submission Portal

For report submission, please send an email to loveu.cvpr@gmail.com.

Format of email subject: “YourName-Submission-LOVEU24-Track1”;
Attach your technical report and other relevant materials in the email;
Include your Codalab account (registered email) and username for our challenge in the email. Include meta info like team members, institution, etc.

For more details, please refer to our Challenge White Paper.

Timeline

Mar 25, 2023 (11:59PM Pacific Time): evaluation server open for the test set, with leaderboard available.
Jun 09, 2024 (11:59PM Pacific Time): evaluation server close.
Jun 13, 2024 (11:59PM Pacific Time): report submission due.