Each model is required to produce:
For evaluating caption quality, we follow AuroraCap and adopt VDCscore as the metric. VDCscore transforms each caption into a set of short question-answer pairs and evaluates the model’s output via factual correctness. Specifically, for each reference caption, we pre-generate 15 open-ended QA pairs using Claude-3.5-sonnet. These questions cover key concepts and visual details in the lecture. Then, for each model-generated caption, we use a separate LLM to extract corresponding answers and compute alignment scores between the predicted and reference answers.
For evaluating the QA task, we directly compare model-generated answers to the ground-truth answers using either exact match accuracy or LLM-based rubric scoring, depending on the evaluation setting.
To ensure consistency and open access, we conduct evaluations using Qwen2.5-72B as the default LLM evaluator, with a fixed temperature of 0. Evaluation is fully integrated into both LMMS-Eval and VLMEvalKit.
For each participating team, we compute:
1. Caption Score
We evaluate the generated review notes using VDCscore, separately on three discipline including Mathematics, Physics and Chemistry. The Caption Score is the average of the three discipline-specific scores.
Note: Since the captioning task primarily targets surface-level visual features, such as layouts, symbols, and formulas, we place greater emphasis on OCR precision. To reflect this, we modify the original VDCscore prompt used to evaluate the <question, reference answer, predicted answer> triplets as follows:
'''
You are an intelligent chatbot designed for evaluating the correctness of generative outputs for question-answer pairs.
Your task is to compare the predicted answer with the correct answer and determine if they match
meaningfully. The evaluation criteria differ based on the type of question:
——
##INSTRUCTIONS:
1. For OCR-related questions:
- Perform a strict letter-by-letter comparison.
- Any difference in characters (including case, punctuation, or letter substitution) must result in ’no’.
- Minor spelling errors or missing characters should not be accepted.
2. For non-OCR-related questions:
- Focus on the meaningful match between the predicted answer and the correct answer.
- Synonyms or paraphrases can be considered valid matches.
- Minor spelling differences or alternative expressions should not be penalized.
User Please evaluate the following video-based question-answer pair:
Question: {GT QUESTION}
Correct Answer: {GT ANSWER}
Predicted Answer: {Your EXTRACTED ANSWER}
Provide your evaluation only as a yes/no and score where the score is an integer value between 0 and 5, with 5 indicating the highest meaningful match.
Please generate the response in the form of a Python dictionary string with keys ’pred’ and ’score’, where value of ’pred’ is a string of ’yes’ or ’no’ and value of ’score’ is in INTEGER, not STRING.
DO NOT PROVIDE ANY OTHER OUTPUT TEXT OR EXPLANATION. Only provide the Python dictionary string.
For example, your response should look like this: {’pred’: ’yes’, ’score’: 4.8}.
'''
2. QA Score
We evaluate the 15 predicted answers per video using Exact Match Accuracy (or optional GPT-based scoring), also averaged across the three domains.The QA Score is the average across mathematics, physics, and chemistry.
Note: Since the QA task focuses on deeper-level reasoning, such as logical deduction and multi-step problem-solving, we adopt a distinct VDCscore prompt tailored to reasoning accuracy and conceptual alignment when evaluating (question, reference answer, predicted answer) triplets.
'''
You are an intelligent chatbot designed for evaluating the correctness of generative outputs for reasoning-based question-answer pairs.
Your task is to compare the predicted answer with the correct answer based on the following rules:
——
##INSTRUCTIONS:
1. Evaluate Reasoning Tasks Strictly:
- The predicted answer must capture all critical concepts and details mentioned in the correct answer.
- If the correct answer mentions specific concepts or examples (e.g., ’odd numbers accumulate to form perfect squares’), the predicted answer must include these concepts or examples.
- Even if the phrasing differs, the key meaning and concepts must be preserved. However, omitting or altering key concepts or examples is not acceptable.
- Example 1: If the correct answer is ’The construction method shows how odd numbers accumulate
to form perfect squares,’ the predicted answer must include ’odd numbers’ and ’perfect squares.’ - Example 2: If the correct answer is ’To eliminate HBr and form an alkene,’ the predicted answer must address the elimination of HBr as well.
- Minor differences in phrasing are acceptable as long as the key information is retained.
- Critical Detail: If any essential element (e.g., key terms, concepts, or examples) is missing from the predicted answer, the answer is considered incorrect.
- Do not introduce new, unrelated information in the predicted answer.
——
##INSTRUCTIONS:
- Focus on the meaningful match between the predicted answer and the correct answer.
- Consider synonyms or paraphrases as valid matches.
- Evaluate the correctness of the prediction compared to the answer.
Please evaluate the following video-based question-answer pair:
Question: {GT QUESTION}
Correct Answer: {GT ANSWER}
Predicted Answer: {Your EXTRACTED ANSWER}
Provide your evaluation only as a yes/no and score where the score is an integer value between 0 and 5, with 5 indicating the highest meaningful match.
Please generate the response in the form of a Python dictionary string with keys ’pred’ and ’score’, where value of ’pred’ is a string of ’yes’ or ’no’ and value of ’score’ is in INTEGER, not STRING.
DO NOT PROVIDE ANY OTHER OUTPUT TEXT OR EXPLANATION. Only provide the Python
dictionary string.
For example, your response should look like this: {’pred’: ’yes’, ’score’: 4.8}.
'''
3. Final Score
To determine the final ranking, we average the two task scores:
Final Score = (Caption Score + QA Score) / 2
📌 Only submissions that complete both captioning and QA will appear in the final Track 1B leaderboard.