Track 1A: Video Detailed Captioning Challenge

This track invites participants to advance video understanding by generating rich and structured captions that cover multiple aspects of each video.
The competition is based on VDC and evaluated using VDCScore.
The Top Winner will be mentioned at the workshop and formally recognized.
If you encounter computational resource limitations during the evaluation, we can assist with testing.

📊 VDC Overview

VDC is a new benchmark designed to evaluate video captioning models’ ability to generate long, fine-grained, and structured descriptions. It consists of over 1K open-domain video clips, annotated with rich multi-part captions that reflect diverse visual cues and high-level understanding.

Each video is annotated with five structured caption components:

1. Camera Caption: Describes camera motion, shot types, angles, and transitions.

2. Short Caption: One-sentence summary of the video.

3. Background Caption: Describes background elements such as setting, weather, and objects.

4. Main Object Caption: Captures appearance and interactions of key subjects.

5. Detailed Caption: A long-form, narrative-style description combining all aspects.

Annotations are generated using GPT-4o with hierarchical prompting to ensure consistency, informativeness, and accuracy.

🧪 Evaluation Protocol

We evaluate predicted captions using VDCscore, a novel large language model (LLM)-based metric introduced in AuroraCap (ICLR 2025). Unlike conventional captioning metrics (e.g., BLEU, CIDEr), VDCscore offers a factual and interpretable evaluation by decomposing long-form captions into QA pairs.

We use Llama-3.1-8B as the LLM evaluation assistant. The VDCscore evaluation follows a three-step QA-based pipeline:

1. Caption Generation

A model generates a detailed caption for each video. This serves as the input to the next stages.

2. Answer Extraction from Predicted Caption

For each video, we define a fixed set of 20 QA questions derived from the ground-truth caption. Llama-3.1-8B is prompted to extract answers using only the model’s predicted caption.

👉 Prompt used in this stage:

'''

You are an intelligent chatbot designed for providing accurate answers to questions related to the content based on a detailed description of a video or image.

——

##INSTRUCTIONS:

- Read the detailed description carefully.

- Answer the question only based on the detailed description.

- The answer should be a short sentence or phrase.

Please provide accurate answers to questions related to the content based on a detailed description of a video or image:

detailed description: {Your MODEL TEXT}

question: {GT QUESTION}

DO NOT PROVIDE ANY OTHER OUTPUT TEXT OR EXPLANATION. Only provide short but accurate answer.

'''

3. LLM-based Scoring of QA Triplets

Each video yields 20 QA triplets in the form of <Question, Ground-Truth Answer, Predicted Answer>. For each triplet, Llama-3.1-8B is used to assign:

• A correctness judgement ('yes' or 'no')

• A quality score (clarity and fluency) range from 0 to 5

The final VDCscore for a video is computed by averaging the scores over all 20 triplets.

👉 Prompt used in this stage:

'''

You are an intelligent chatbot designed for evaluating the correctness of generative outputs for questionanswer pairs. Your task is to compare the predicted answer with the correct answer and determine if they match meaningfully. Here’s how you can accomplish the task:

——

##INSTRUCTIONS:

- Focus on the meaningful match between the predicted answer and the correct answer.

- Consider synonyms or paraphrases as valid matches.

- Evaluate the correctness of the prediction compared to the answer.

Please evaluate the following video-based question-answer pair:

Question: {GT QUESTION}

Correct Answer: {GT ANSWER}

Predicted Answer: {Your EXTRACTED ANSWER}

Provide your evaluation only as a yes/no and score where the score is an integer value between 0 and 5, with 5 indicating the highest meaningful match.

Please generate the response in the form of a Python dictionary string with keys ’pred’ and ’score’, where value of ’pred’ is a string of ’yes’ or ’no’ and value of ’score’ is in INTEGER, not STRING.

DO NOT PROVIDE ANY OTHER OUTPUT TEXT OR EXPLANATION. Only provide the Python dictionary string.

For example, your response should look like this: {’pred’: ’yes’, ’score’: 4.8}.

'''

🛠️ How to Run VDCscore Evaluation

We have integrated the full VDCscore pipeline into two widely used evaluation platforms, VLMEvalKit and lmms-eval. Both platforms support loading your evaluated captioning model and the LLM evaluation assistant onto GPU simultaneously. This setup enables seamless end-to-end evaluation using VDCscore.

⚠️ Note on Resources

If your local computational resources do not support loading both models concurrently, we recommend running caption generation first, saving the outputs, and then conducting post-evaluation using VDCscore.

For detailed instructions, including code and prompt usage, please refer to the official AuroraCap GitHub repository.

📊 Public Leaderboard and Baselines

We provide a public leaderboard featuring over 20 baseline models, covering a diverse range of Large Multimodal Models. These results offer participants a reference point to understand the task difficulty and how different models perform across multiple captioning dimensions (e.g., Detailed, Camera, Background, Object).

🧭 Challenge-specific Leaderboard:

For submissions to this challenge, we maintain a dedicated leaderboard:

👉 Challenge Leaderboard

⚠️ Note:

Final winners will not be ranked against the public baselines.

Instead, submitted models are evaluated independently. Of course, strong performance—especially in VDCscore—is encouraged.

✅ Submission Checklist (Required for Every Submission)

For every submission to the challenge leaderboard, participants are required to also complete a short Google Form to ensure result traceability and evaluation transparency.

Please prepare the following for each submission:

1. ✅ The evaluation platform used (e.g., VLMEvalKit, lmms-eval, or other)

2. ✅ The final submission score

3. ✅ The complete evaluation log. If your evaluation was performed in multiple stages (e.g., Captions generated for each video → QA extraction from captions → VDCscore scoring results (LLM outputs)), then each of the three components above must be submitted explicitly.

📄 You must submit this information via the following form each time you upload a result:

👉 Submit via the Google Form

⚠️ Submissions without a completed form may be disqualified from the leaderboard or final evaluation.

🧪 Code Submission for the Winner

• The 1st place team will be required to submit code and instructions for final verification.

• Organizers will rerun the model to verify consistency.

• If the submission is consistent and meets all requirements, the team will receive the award.

• If not, the award will be passed to the next qualifying team.

💡 Need clarification?

Please refer to the Track 1 FAQ for details about evaluation rules, scoring, and submission requirements.

📘 Report Requirement & Contribution to Research

While only the top 3 teams are required to submit a technical report and release their code for final verification and reproducibility, we strongly encourage all participants to submit a report.

These reports will help us collectively analyze trends, challenges, and progress in the task of video detailed captioning. With participants’ permission, we may highlight exemplary insights or findings in our workshop summary or future publications.

🧠 Report Format: Use CVPR style (double column) in the form of 3-6 pages or NeurIPS style (single column) in the form of 6-10 pages inclusive of any references. Please explain clearly what data, supervision, pre-trained models you have used so that we can make sure your results are comparable to others.

If you are not in the top 3 but still wish to submit your report, please follow the same instructions listed below and send it to loveu.cvpr2025.track1 at gmail.com.

🏆 Award Structure

Thanks to the generous support from Lambda, Inc., the top-3 teams in each track will receive the following Lambda Cloud Credit Awards:

🥇 1st Place: $5,000 in Lambda Cloud Credits
🥈 2nd Place: $3,000 in Lambda Cloud Credits
🥉 3rd Place: $1,000 in Lambda Cloud Credits

After the workshop, winners will be contacted to claim their prizes. Lambda will coordinate directly with the team representatives via email.

📜 Award Policy

Final rankings will be determined solely based on VDCscore performance on the test set.
However, to be eligible for official recognition and certificates, teams must submit a technical report by the deadline.
If a top-ranked team fails to submit their report, their position will be skipped and the award passed to the next eligible team.
The top 3 valid teams will each receive an official certificate of recognition from the organizers.

✅ Please make sure to follow the report format and submission instructions to ensure eligibility.

Timeline

April 15, 2025 (11:59PM Pacific Time): evaluation server open for the test set, with leaderboard available.
Jun 06, 2025 (11:59PM Pacific Time): evaluation server close.
Jun 10, 2025 (11:59PM Pacific Time): report submission due.

Organizers

Wenhao Chai

University of Washington

Enxin Song

Zhejiang University

Jianwen Xie

Lambda, Inc.

Sponsor

Lambda, Inc.

Winner

Report Link: https://arxiv.org/abs/2507.01492

Page updated

Google Sites

Report abuse