Track 1B: Multi-Discipline Lecture Understanding Challenge

This track evaluates models’ ability to understand and reason over lecture videos in mathematics, physics, and chemistry.
The task consists of video detailed captioning(Review Notes) and question answering (Take Quiz), both required for final evaluation.
The benchmark is based on the Video-MMLU.
The final score is the average of the captioning and QA scores.
The top winner will be recognized at the workshop and awarded official certificates.
If you encounter computational resource limitations during the evaluation, we can assist with testing.

Video-MMLU Datasets & Annotation Overview

Video-MMLU is a benchmark designed for evaluating lecture video understanding in mathematics, physics, and chemistry. It consists of over 1,000 high-quality educational videos sourced from YouTube, annotated with structured review notes and question-answer pairs for multi-level comprehension testing. The dataset contains 1,065 lecture videos from 10 different channels. Videos are selected to emphasize dense reasoning, formula demonstrations, and animated problem-solving. All videos are under 4 minutes and include transcribed English subtitles.

Detailed Caption (Review Notes)

Each video is paired with a multi-paragraph, structured caption that simulates the type of lecture notes a student might take. The captions describe both visual and verbal content, covering formulas, diagrams, and temporal transitions throughout the lecture. These captions are evaluated using VDCscore, a QA-based metric that assesses factual consistency against ground-truth question-answer pairs. For each caption, we generate 15 surface-level open-ended question-answer pairs.

Question-Answer Pairs (Take Quiz)

In addition to the caption, each video includes 15 deeper-level open-ended question-answer pairs that test the model’s ability to comprehend and reason over lecture content. These questions are derived from the ground-truth caption using Claude-3.5-sonnet and are specifically crafted to span both surface-level perception and deeper conceptual understanding. The questions are natural in form and require reasoning across visual sequences and lecture narration. Each answer is constrained to a maximum of 15 words to ensure fairness and reliability in scoring. We require participants to set the LLM’s max_tokens parameter to 64 during answer generation. This enforces concise responses and prevents verbose or unconstrained outputs from disproportionately affecting evaluation scores.

Evaluation Protocol

Each model is required to produce:

A structured review note (caption) describing the content of the video;
A set of short-form answers to 15 open-ended reasoning questions.

For evaluating caption quality, we follow AuroraCap and adopt VDCscore as the metric. VDCscore transforms each caption into a set of short question-answer pairs and evaluates the model’s output via factual correctness. Specifically, for each reference caption, we pre-generate 15 open-ended QA pairs using Claude-3.5-sonnet. These questions cover key concepts and visual details in the lecture. Then, for each model-generated caption, we use a separate LLM to extract corresponding answers and compute alignment scores between the predicted and reference answers.

For evaluating the QA task, we directly compare model-generated answers to the ground-truth answers using either exact match accuracy or LLM-based rubric scoring, depending on the evaluation setting.

To ensure consistency and open access, we conduct evaluations using Qwen2.5-72B as the default LLM evaluator, with a fixed temperature of 0. Evaluation is fully integrated into both LMMS-Eval and VLMEvalKit.

🧮 Final Scoring Protocol

For each participating team, we compute:

1. Caption Score

We evaluate the generated review notes using VDCscore, separately on three discipline including Mathematics, Physics and Chemistry. The Caption Score is the average of the three discipline-specific scores.

Note: Since the captioning task primarily targets surface-level visual features, such as layouts, symbols, and formulas, we place greater emphasis on OCR precision. To reflect this, we modify the original VDCscore prompt used to evaluate the <question, reference answer, predicted answer> triplets as follows:

'''

You are an intelligent chatbot designed for evaluating the correctness of generative outputs for question-answer pairs.

Your task is to compare the predicted answer with the correct answer and determine if they match

meaningfully. The evaluation criteria differ based on the type of question:

——

##INSTRUCTIONS:

1. For OCR-related questions:

- Perform a strict letter-by-letter comparison.

- Any difference in characters (including case, punctuation, or letter substitution) must result in ’no’.

- Minor spelling errors or missing characters should not be accepted.

2. For non-OCR-related questions:

- Focus on the meaningful match between the predicted answer and the correct answer.

- Synonyms or paraphrases can be considered valid matches.

- Minor spelling differences or alternative expressions should not be penalized.

User Please evaluate the following video-based question-answer pair:

Question: {GT QUESTION}

Correct Answer: {GT ANSWER}

Predicted Answer: {Your EXTRACTED ANSWER}

Provide your evaluation only as a yes/no and score where the score is an integer value between 0 and 5, with 5 indicating the highest meaningful match.

Please generate the response in the form of a Python dictionary string with keys ’pred’ and ’score’, where value of ’pred’ is a string of ’yes’ or ’no’ and value of ’score’ is in INTEGER, not STRING.

DO NOT PROVIDE ANY OTHER OUTPUT TEXT OR EXPLANATION. Only provide the Python dictionary string.

For example, your response should look like this: {’pred’: ’yes’, ’score’: 4.8}.

'''

2. QA Score

We evaluate the 15 predicted answers per video using Exact Match Accuracy (or optional GPT-based scoring), also averaged across the three domains.The QA Score is the average across mathematics, physics, and chemistry.

Note: Since the QA task focuses on deeper-level reasoning, such as logical deduction and multi-step problem-solving, we adopt a distinct VDCscore prompt tailored to reasoning accuracy and conceptual alignment when evaluating (question, reference answer, predicted answer) triplets.

'''

You are an intelligent chatbot designed for evaluating the correctness of generative outputs for reasoning-based question-answer pairs.

Your task is to compare the predicted answer with the correct answer based on the following rules:

——

##INSTRUCTIONS:

1. Evaluate Reasoning Tasks Strictly:

- The predicted answer must capture all critical concepts and details mentioned in the correct answer.

- If the correct answer mentions specific concepts or examples (e.g., ’odd numbers accumulate to form perfect squares’), the predicted answer must include these concepts or examples.

- Even if the phrasing differs, the key meaning and concepts must be preserved. However, omitting or altering key concepts or examples is not acceptable.

- Example 1: If the correct answer is ’The construction method shows how odd numbers accumulate

to form perfect squares,’ the predicted answer must include ’odd numbers’ and ’perfect squares.’ - Example 2: If the correct answer is ’To eliminate HBr and form an alkene,’ the predicted answer must address the elimination of HBr as well.

- Minor differences in phrasing are acceptable as long as the key information is retained.

- Critical Detail: If any essential element (e.g., key terms, concepts, or examples) is missing from the predicted answer, the answer is considered incorrect.

- Do not introduce new, unrelated information in the predicted answer.

——

##INSTRUCTIONS:

- Focus on the meaningful match between the predicted answer and the correct answer.

- Consider synonyms or paraphrases as valid matches.

- Evaluate the correctness of the prediction compared to the answer.

Please evaluate the following video-based question-answer pair:

Question: {GT QUESTION}

Correct Answer: {GT ANSWER}

Predicted Answer: {Your EXTRACTED ANSWER}

Provide your evaluation only as a yes/no and score where the score is an integer value between 0 and 5, with 5 indicating the highest meaningful match.

DO NOT PROVIDE ANY OTHER OUTPUT TEXT OR EXPLANATION. Only provide the Python

dictionary string.

For example, your response should look like this: {’pred’: ’yes’, ’score’: 4.8}.

'''

3. Final Score

To determine the final ranking, we average the two task scores:

Final Score = (Caption Score + QA Score) / 2

📌 Only submissions that complete both captioning and QA will appear in the final Track 1B leaderboard.

🛠️ How to Run Evaluation

We have integrated the full evaluation pipeline into VLMEvalKit and lmms-eval (available on Video-MMLU Github repo). Both platforms support loading your evaluated captioning model and the LLM evaluation assistant onto GPU simultaneously. This setup enables seamless end-to-end evaluation.

⚠️ Note on Resources

If your local computational resources do not support loading both models concurrently, we recommend running output generation first, saving the outputs, and then conducting post-evaluation pipeline.

For detailed instructions, including code and prompt usage, please refer to the official Video-MMLU GitHub repository.

📊 Public Leaderboard and Baselines

We provide a public leaderboard featuring over 90 baseline models, including proprietary models, open-source models and vision-blind baselines. These results offer participants a reference point to understand the task difficulty and how different models perform across multi-discipline lecture perception and understanding.

🧭 Challenge-specific Leaderboard:

For submissions to this challenge, we maintain a dedicated leaderboard:

👉 Challenge Leaderboard

⚠️ Note:

Final winners will not be ranked against the public baselines.

Instead, submitted models are evaluated independently. Of course, strong performance is encouraged.

✅ Submission Checklist (Required for Every Submission)

For every submission to the challenge leaderboard, participants are required to also complete a short Google Form to ensure result traceability and evaluation transparency.

Please prepare the following for each submission:

1. ✅ The evaluation platform used (e.g., VLMEvalKit, lmms-eval, or other)

2. ✅ The final submission score

3. ✅ The complete evaluation log. If your evaluation was performed in multiple stages (e.g., Captions generated for each video → QA extraction from captions → VDCscore scoring results (LLM outputs)), then each of the three components above must be submitted explicitly.

📄 You must submit this information via the following form each time you upload a result:

👉 Submit via the Google Form

⚠️ Submissions without a completed form may be disqualified from the leaderboard or final evaluation.

🧪 Code Submission for the Winner

• The 1st place team will be required to submit code and instructions for final verification.

• Organizers will rerun the model to verify consistency.

• If the submission is consistent and meets all requirements, the team will receive the award.

• If not, the award will be passed to the next qualifying team.

💡 Need clarification?

Please refer to the Track 1 FAQ for details about evaluation rules, scoring, and submission requirements.

📘 Report Requirement & Contribution to Research

While only the top 3 teams are required to submit a technical report and release their code for final verification and reproducibility, we strongly encourage all participants to submit a report.

These reports will help us collectively analyze trends, challenges, and progress in the task of video detailed captioning. With participants’ permission, we may highlight exemplary insights or findings in our workshop summary or future publications.

🧠 Report Format: Use CVPR style (double column) in the form of 3-6 pages or NeurIPS style (single column) in the form of 6-10 pages inclusive of any references. Please explain clearly what data, supervision, pre-trained models you have used so that we can make sure your results are comparable to others.

If you are not in the top 3 but still wish to submit your report, please follow the same instructions listed below and send it to loveu.cvpr2025.track1 at gmail.com.

🏆 Award Structure

Thanks to the generous support from Lambda, Inc., the top-3 teams in each track will receive the following Lambda Cloud Credit Awards:

🥇 1st Place: $5,000 in Lambda Cloud Credits
🥈 2nd Place: $3,000 in Lambda Cloud Credits
🥉 3rd Place: $1,000 in Lambda Cloud Credits

After the workshop, winners will be contacted to claim their prizes. Lambda will coordinate directly with the team representatives via email.

📜 Award Policy

Final rankings will be determined solely based on VDCscore performance on the test set.
However, to be eligible for official recognition and certificates, teams must submit a technical report by the deadline.
If a top-ranked team fails to submit their report, their position will be skipped and the award passed to the next eligible team.
The top 3 valid teams will each receive an official certificate of recognition from the organizers.

✅ Please make sure to follow the report format and submission instructions to ensure eligibility.

Timeline

April 15, 2025 (11:59PM Pacific Time): evaluation server open for the test set, with leaderboard available.
Jun 06, 2025 (11:59PM Pacific Time): evaluation server close.
Jun 10, 2025 (11:59PM Pacific Time): report submission due.