30.03.2025: Evaluation begins
19.06.2025: Final evaluation policy updated
We use Codabench for evaluation, based on two test sets: a public test set and a private test set. Participants must submit their generated results in the specified format for leaderboard ranking.
In this challenge, we will use the four most commonly employed metrics for evaluating video captioning: BLEU@4, METEOR, CIDEr, and ROUGE-L.
The final ranking will be based on the average scores from both the public and private test sets across four evaluation metrics.
For each metric, teams will be ranked from 1st to 3rd. Points are awarded as follows:
🥇 1st place: 3 points
🥈 2nd place: 2 points
🥉 3rd place: 1 point
The total score across all four metrics determines the final ranking. The top 3 teams with the highest total points will be selected as winners.
If two teams have the same total score, ties will be broken using the following order:
CIDEr – best reflects how well the caption matches human intent
METEOR – considers synonyms and meaning
BLEU@4 – checks word overlap (less important for intent)
If still tied, the organizing team may conduct a manual review to determine the final selection, if necessary.
Participants need to submit a .zip file that includes two JSON files: result_public.json and result_private.json. The file should be submitted to the benchmark platform. You can refer to the sample files to see the public and private video IDs.
- sample_result_public.json: Contains the video ID for the Public Test Set.
- sample_result_private.json: Contains the video ID for the Private Test Set.
Each file must follow the json format shown below:
{
"version": "Submission File Example VERSION 1.0",
"captions":{
"airplane-1": "a small white airplane descends towards a runway amid the mountainous terrain",
"airplane-2": "large brown airplane takes off and ascends into cloudy sky while emitting trails"
}
}
The submission should be packaged as follows:
result.zip
├── result_public.json
└── result_private.json