We use Llama-3.1-8B as the LLM evaluation assistant. The VDCscore evaluation follows a three-step QA-based pipeline:
1. Caption Generation
A model generates a detailed caption for each video. This serves as the input to the next stages.
2. Answer Extraction from Predicted Caption
For each video, we define a fixed set of 20 QA questions derived from the ground-truth caption. Llama-3.1-8B is prompted to extract answers using only the model’s predicted caption.
👉 Prompt used in this stage:
'''
You are an intelligent chatbot designed for providing accurate answers to questions related to the content based on a detailed description of a video or image.
——
##INSTRUCTIONS:
- Read the detailed description carefully.
- Answer the question only based on the detailed description.
- The answer should be a short sentence or phrase.
Please provide accurate answers to questions related to the content based on a detailed description of a video or image:
detailed description: {Your MODEL TEXT}
question: {GT QUESTION}
DO NOT PROVIDE ANY OTHER OUTPUT TEXT OR EXPLANATION. Only provide short but accurate answer.
'''
3. LLM-based Scoring of QA Triplets
Each video yields 20 QA triplets in the form of <Question, Ground-Truth Answer, Predicted Answer>. For each triplet, Llama-3.1-8B is used to assign:
• A correctness judgement ('yes' or 'no')
• A quality score (clarity and fluency) range from 0 to 5
The final VDCscore for a video is computed by averaging the scores over all 20 triplets.
👉 Prompt used in this stage:
'''
You are an intelligent chatbot designed for evaluating the correctness of generative outputs for questionanswer pairs. Your task is to compare the predicted answer with the correct answer and determine if they match meaningfully. Here’s how you can accomplish the task:
——
##INSTRUCTIONS:
- Focus on the meaningful match between the predicted answer and the correct answer.
- Consider synonyms or paraphrases as valid matches.
- Evaluate the correctness of the prediction compared to the answer.
Please evaluate the following video-based question-answer pair:
Question: {GT QUESTION}
Correct Answer: {GT ANSWER}
Predicted Answer: {Your EXTRACTED ANSWER}
Provide your evaluation only as a yes/no and score where the score is an integer value between 0 and 5, with 5 indicating the highest meaningful match.
Please generate the response in the form of a Python dictionary string with keys ’pred’ and ’score’, where value of ’pred’ is a string of ’yes’ or ’no’ and value of ’score’ is in INTEGER, not STRING.
DO NOT PROVIDE ANY OTHER OUTPUT TEXT OR EXPLANATION. Only provide the Python dictionary string.
For example, your response should look like this: {’pred’: ’yes’, ’score’: 4.8}.
'''