Call for Participation

Deep video understanding is a difficult task which requires systems to develop a deep analysis and understanding of the relationships between different entities in video, to use known information to reason about other, more hidden information, and to populate a knowledge graph (KG) representation with all acquired information. To work on this task, a system should take into consideration all available modalities (speech, image/video, and in some cases text). The aim of this challenge series is to push the limits of multimodal extraction, fusion, and analysis techniques to address the problem of analyzing long duration videos holistically and extracting useful knowledge to utilize it in solving different types of queries. The target knowledge includes both visual and non-visual elements. As videos and multimedia data are getting more and more popular and usable by users in different domains and contexts, the research, approaches and techniques we aim to be applied in this Grand Challenge will be very relevant in the coming years and near future.

Challenge Overview:

Interested participants are invited to apply their approaches and methods on an extended novel Deep Video Understanding (DVU) dataset being made available by the challenge organizers. The dataset is split into a development data of 19 movies from the 2020-2022 versions of this challenge, 14 of these with Creative Commons licenses, and 5 movies licensed from KinoLorberEdu platform which were part of the test set for the 2022 version. The test set will be comprised of 5 movies licensed from KinoLorberEdu platform. One of these movies is reused from the 2022 version of this challenge while the other 4 have not been used in this challenge before. The development data includes: original whole videos, segmented scene shots, image examples of main characters and locations, movie-level KG representation of the relationships between main characters, relationships between characters key-locations, scene-level KG representation of each scene in a movie (location type, characters, interactions between them, order of interactions, sentiment of scene, and a short textual summary), and a global shared ontology of locations, relationships (family, social, work), interactions and sentiments.

The organizers will support evaluation and scoring for a hybrid of main query types, at the overall movie level and at the individual scene level distributed with the dataset. Participants will be given the choice to submit results for either the movie-level or scene-level queries, or both. And for each category, queries are grouped for more flexible submission options (please refer to the dataset webpage for more details):

Example Question types at Overall Movie Level:

- Multiple choice question answering on part of Knowledge Graph for selected movies.
- Fill in the Graph Space - Given a partial graph, systems will be asked to fill in the graph space.

Example Question types at Individual Scene Level:

- Find next or previous interaction, given two people, a specific scene, and the interaction between them.
- Find a unique scene given a set of interactions and a scene list.
- Fill in the Graph Space - Given a partial graph for a scene, systems will be asked to fill in the graph space.
- Match between selected scenes and set of scene descriptions written in natural language .
- Scene sentiment classification.

An addition to the 2023 challenge is a new robustness sub-task that systems can choose to submit solutions to. To measure the robustness of the multi-modal systems participating in the challenge this year, we will provide a secondary version of the testing dataset after introducing various types of perturbations and corruptions observed in real-world multi-modal data. This will allow teams and organizers to measure how much systems performance can get affected by the introduced noise.

Submission

Run submission XML files should be emailed directly to George Awad (george.awad@nist.gov). Please indicate ACMMM Grand Challenge in the subject line. Please refer to the Supported datasets page for XML sample queries, response files and DTD required (Please check regularly for latest updates about submission format) to follow when submitting your results.

Grand Challenge papers (Top-3 Solutions) will go through a single-blind review process. (Author names and affiliations should be included.) Papers should be limited to 4 pages in length + up to 2 extra pages for references only. Please check the main ACMMM2023 conference website for further details.

Papers should be submitted directly via the main conference submission site: https://cmt3.research.microsoft.com/ACMMMGrandChallenge2023/Submission/Index

Each Grand Challenge Submitted paper should be formatted as 4-page short paper and will be included in the main conference proceeding.

Important Dates

DVU development data release: Available now for download from HERE
Testing dataset release : Available now after submitting the data agreement form located HERE
Testing queries release: Available from this URL. Please check the readme file provided
Paper submission deadline: July 14, 2023 - submission link is HERE
Submissions of solutions to organizers: July 14, 2023
Results released back to participants: July 24 2023
Notification to authors: July 24, 2023 (Top-3 teams should prepare their camera ready papers)
Camera-ready submission: July 31, 2023
Grand Challenge at ACM Multimedia: TBD