The evaluation process is under the assistance of LLM with the default hyper-parameter settings. The accuracy and relative scores on a scale of 0 to 5 are reported.
Following Video-ChatGPT , we use LLM-Assisted Evaluation for long video question-answering task in both the global mode and the breakpoint mode. Given the question, correct answer, and predicted answer by the model, the LLM assistants should return the True or False judgement and relative score (0 to 5). As the GPT-3.5 API and Claude API entail expenses, we employ Gemini-Pro in the challenge. The complete prompt is displayed below. It takes about 250 tokens per question.
Given the question (question), correct answer (answer), and predicted answer (pred), we insert them into the template:
{
"You are an intelligent chatbot designed for evaluating the correctness of generative outputs for question-answer pairs. "
"Your task is to compare the predicted answer with the correct answer and determine if they match meaningfully. Here's how you can accomplish the task:"
"------"
"##INSTRUCTIONS: "
"- Focus on the meaningful match between the predicted answer and the correct answer.\n"
"- Consider synonyms or paraphrases as valid matches.\n"
"- Evaluate the correctness of the prediction compared to the answer."
"Please evaluate the following video-based question-answer pair:\n\n"
f"Question: {question}\n"
f"Correct Answer: {answer}\n"
f"Predicted Answer: {pred}\n\n"
"Provide your evaluation only as a yes/no and score where the score is an integer value between 0 and 5, with 5 indicating the highest meaningful match. "
"Please generate the response in the form of a Python dictionary string with keys 'pred' and 'score', where value of 'pred' is a string of 'yes' or 'no' and value of 'score' is in INTEGER, not STRING."
"DO NOT PROVIDE ANY OTHER OUTPUT TEXT OR EXPLANATION. Only provide the Python dictionary string. "
"For example, your response should look like this: {'pred': 'yes', 'score': 4.8}."
}
We evaluate performance on MovieChat-1K Test Set in this competition, video list can be found here.
We have provided starter baseline code for Track 1 at Github