The DVU development & testing datasets are now available (please submit the data agreement to access testing dataset)

Dataset & Queries

Deep Video Understanding (DVU) dataset is split into a development data of 19 movies from the 2020-2022 versions of this challenge. With 14 of these having Creative Commons licenses and used in the 2020-2021 versions of the challenge, and 5 movies licensed from KinoLorberEdu platform which were part of the test set for the 2022 version. The test set will be comprised of 5 movies licensed from KinoLorberEdu platform. One of these movies is reused from the 2022 version of this challenge while the other 4 have not been used in this challenge before. The challenge will continue working with the same ontology used in 2021 iteration which made use of the MovieGraphs vocabulary (Vicol, Paul, et al. "Moviegraphs: Towards understanding human-centric situations from videos." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018) to include character attributes, interactions, scene locations, and situations.

If you make use of the DVU dataset or otherwise participate in the challenge please cite this paper using the following bibtex:

@inproceedings{curtis2020hlvu,

title={HLVU: A New Challenge to Test Deep Understanding of Movies the Way Humans do},

author={Curtis, Keith and Awad, George and Rajput, Shahzad and Soboroff, Ian},

booktitle={Proceedings of the 2020 International Conference on Multimedia Retrieval},

pages={355--361},

year={2020}

}

Movie Dataset

The full Deep Video Understanding training set of 19 movies and total duration of ~25 hours is available from this link. To access the Kinolorber licensed full movies, each team (participant) needs to first sign and submit back to organizers the data agreement form located HERE. After receiving the signed form, the organizers will send access information to the Kinolorber licensed movies. This training set has been annotated by human assessors and final ground truth, both at the overall movie level (Ontology of relations, entities, actions & events, Knowledge Graph, and names and images of all main characters), and the individual scene level (Ontology of locations, people/entities, interactions and their order between people, sentiments, and text summary) has been be provided for the training set to participating researchers for training and development of their systems. More information about movies' genres and duration are provided:

testing dataset

To request the testing dataset, each team (participant) needs to first sign and submit back to organizers the data agreement form located HERE. After receiving the signed form, the organizers will send access information to the 2023 testing dataset (original movies, segmented scenes, master scene reference table files). The ontology/vocabulary of classes used in the movie-level (relationships) and scene-level (interactions, sentiments, etc) annotations and testing dataset can be found here (also as a pdf here).

Resources by PAST participating teams

Automatically generated transcripts by university of Zurich is available from HERE. Please cite the team's 2020 system paper: https://dl.acm.org/doi/10.1145/3394171.3416292

Speech and person/face bounding box annotations for subset of the HLVU dataset are available by TokyoTech team from HERE. Please cite the team's 2020 system paper: https://dl.acm.org/doi/abs/10.1145/3395035.3425639

Scene annotations and resources (for 2020 DVU edition) by Nanjing University are available from HERE together with a README file. Please cite the team's 2020 system paper:

https://dl.acm.org/doi/10.1145/3394171.3416303

Resources by Nanjing University (updating 2020 data resources and adding new resources for 2021 testing dataset) are available from HERE. Please cite the team's paper if you made use of their tools and/or outputs: https://dl.acm.org/doi/10.1145/3474085.3479214

Subtitles & tracking/recognition results for 2022 testing dataset are available from HERE. Please cite the team's paper if you will use their outputs in your research: https://dl.acm.org/doi/abs/10.1145/3503161.3551600

Movie-Level

Query Types

Fill in the graph space: Fill in spaces in the Knowledge Graph (KG). Given the listed relationships, events or actions for certain nodes, where some nodes are replaced by variables X, Y, etc., solve for X, Y etc. Example of The Simpsons: X Married To Marge. X Friend Of Lenny. Y Volunteers at Church. Y Neighbor Of X. Solution for X and Y in that case would be: X = Homer, Y = Ned Flanders.
Question Answering: This query type represents questions on the resulting KG, including actions and events, of the movies in the described dataset. For example, we may ask 'How many children does Person A have?', in which case participating researchers should count the 'Parent Of' relationships Person A has in the Knowledge Graph. These are multiple choice questions.

Movie-Level

Metrics

Results will be treated as ranked list of result items per each unknown variable and the Reciprocal Rank score will be calculated per unknown variable and Mean Reciprocal Rank (MRR) per query.
Scores for this query will be calculated by the number of Correct Answers / number of Total Questions.

Scene-Level

Query Types

Find the Unique Scene: Given a full, inclusive list of interactions, unique to a specific scene in the movie, teams should find which scene this is.
Fill in the graph space: Find the person in a specific scene with the following attributes and interactions with others. Participating teams will be given a scene number, a list of person attributes, and a list of interactions to and from other people. Teams should find the only person in that scene with those attributes and interactions.
Find next or previous interaction: Given a specific scene and a specific interaction between person X and person Y, participants will be asked to return either the previous interaction or the next interaction, in either direction, between person X and Person Y. This can be specifically the next or previous interaction within the same scene, or over the entire movie. These will be multiple choice questions selected from a list of possible interactions, only one of which will be correct.
Find the 1-to-1 relationship between scenes and natural language descriptions: Given a set of scenes, and a set of natural language descriptions of movie scenes, match the correct natural language description for each scene.
Classify scene sentiment from a given scene: Given a specific movie scene and a set of possible sentiments, classify the correct sentiment label for each given scene.

Scene-Level

Metrics

Results will be treated as ranked list of result items per each unknown variable and the Reciprocal Rank score will be calculated per unknown variable and Mean Reciprocal Rank (MRR) per query.
Results will be treated as ranked list of result items per each unknown variable and the Reciprocal Rank score will be calculated per unknown variable and Mean Reciprocal Rank (MRR) per query.
Scores for this query will be calculated by the number of Correct Answers / number of Total Questions.
Scores for this query will be calculated by the number of Correct Answers / number of Total Questions.
Scores for this query will be calculated by the number of Correct Answers / number of Total Questions.

Testing queries : will be available later in the next coming few month (~ 1.5 month before submission)

An addition to the 2023 challenge is a new robustness sub-task that systems can choose to submit solutions to. To measure the robustness of the multi-modal systems participating in the challenge this year, we will provide a secondary version of the testing dataset after introducing various types of perturbations and corruptions observed in real-world multi-modal data. This will allow teams and organizers to measure how much systems performance can get affected by the introduced noise.

Submission format and specifications are below (Please check regularly for the latest updates)

Run Submission types and conditions

Teams can submit to either the Movie-level queries only, the scene-level queries only, or both the movie and scene queries.
For movie-level queries, teams can choose to submit results to either or both groups of:
- Group 1 - Question 1 (fill in the graph space)
- Group 2 - Question 2 (question answering)
For scene-level queries, teams can choose to submit results to either or both groups of:
- Group 1 - Questions 1, 2, & 3 (all related to interactions between characters)
- Group 2 - Questions 4 & 5 (matching of scenes and their descriptions in text, and sentiment classification)

Please see sample XML response files for a movie-level run and scene-level run. Please make sure your runs validates against the DTD files for both movie and scene query results. Two DTD files for movie-level and scene-level results are available : Movie-level DTD , Scene-level DTD.

Please make sure your movie-level and scene-level xml files include this 1st line to link to the correct DTD file:

Movie-level:

<!DOCTYPE DeepVideoUnderstandingResults SYSTEM "https://www-nlpir.nist.gov/projects/tv2023/dtds/DeepVideoUnderstandingResults.dtd">

Scene-level:

<!DOCTYPE DeepVideoUnderstandingSceneResults SYSTEM "https://www-nlpir.nist.gov/projects/tv2023/dtds/DeepVideoUnderstandingSceneResults.dtd">

Testing queries are now available (xml query files for movie-level, scene-level, and image snapshots of main entities). Please download from HERE . Provided is a readme file. If you have not done yet and planning to join the challenge, please make sure to submit the data agreement form to be able to download the testing dataset.

Sample Queries and Responses (Movie-level)

Fill in the graph space:

Sample Query:

</DeepVideoUnderstandingTopicQuery>

</DeepVideoUnderstandingTopicQuery>

Sample Response: