Holistic video understanding has long been a topic of interest in computer vision. While the field has made tremendous progress in recent years, current state-of-the-art methods are limited to video sequences with a total length measured in tens of seconds. Considering that narrative videos have an expected duration of minutes to hours, these methods are still not yet capable of covering a broad range of use cases. While machines are not yet able to deal with such long-form content effectively, a collaborative team of humans and machines can be well supported by software applications in this endeavor.
The aim of the 1st international Workshop on Interactive Video Search and Exploration is to explore means to overcome current limitations in fully automated methods by focusing on human-machine teaming approaches for long-form video understanding, a topic that has not yet been extensively explored by the computer vision community. It will provide a venue to compare fully automated end-to-end approaches for video understanding and ones where humans and machines collaborate. The workshop will be centered around a challenge on text-based video retrieval and visual question answering in a large collection of long videos (3min to 30min per video, 1000h combined runtime). The challenge will be split into two tracks; the fully-automated track where queries are made available to participants beforehand and they have to solve them automatically without direct human intervention, and the interactive track where queries will be made available to participants during the workshop and they have to solve them interactively in a human-machine team under a strict time limit of five minutes. This challenge format builds upon established challenges such as TRECVID and DVU on the fully-automatic side, and the Video Browser Showdown and the Lifelog Search Challenge on the interactive side. Authors can choose to participate in one or both of the tracks.
The challenge will make use of the first shard of the Vimeo Creative Commons Collection (V3C), consisting of 7475 Creative Commons-licensed videos obtained from Vimeo with a combined duration of 1000h. The dataset has already been used for several years by TRECVID and the Video Browser Showdown. In order to download the dataset (which is provided by NIST), please complete this data agreement form and send a scan to angela.ellis@nist.gov with CC to gawad@nist.gov and ks@itec.aau.at. You will be provided with a link for downloading the data.
In addition to the dataset, the results from TRECVID's Ad-hoc Video Retrieval Task and the Archive from previous instances of the Video Browser Showdown can serve as a basis for system development.
Below is a preview of the types of content that can be found in V3C.
There are two types of tasks in the challenge: Known-Item Search (KIS) and Question Answering (QA).
For KIS tasks, a textual description of a segment within a video is given, based on which the video has to be retrieved and the relevant segment identified. Such a description might look like the following: "We see a girl in a dark dress pushing the door of a convenience store, after it closes, she runs away. There are two bikes and four trash cans in front of the shop windows. The store's brand colors are green, white and blue." The target segment can be anywhere from 2 to 20 seconds in length and is uniquely identified within the dataset by the provided textual description. An answer to this type of task consists of a video id and a time interval relative to that video. For an answer to be considered correct, it needs to correctly identify the video and the specified interval needs to be completely inside the target (i.e., intervals that are too short but are completely enclosed by the target are considered to be correct, intervals that overlap the target only partially are considered to be incorrect).
For QA Tasks, a textual description of a video together with a question is given. The textual description contains sufficient information to uniquely identify the video within the dataset based on which the question can be answered. An example of such a task might look like the following: "A hiker wearing a blue short-sleeved shirt, a purple backpack and sunglasses walks up a steep hill while holding a camera pointing to himself and talking to it. When the hiker reaches the summit, there are two other hikers that carry their backpacks. What are the colors of these backpacks?" The answer to such a task is to be given in textual form, e.g.: "red and green". The correctness of answers is assessed by a human judge.
For the fully-automated track, the queries are available here. For the interactive track, the queries are presented one-by-one during the workshop and have to be solved within five minutes per task.
Participants can choose to partake in one or both tracks of the challenge and focus on one or both types of tasks. Teams are, however, encouraged to take place in both tracks. Answers to the fully-automated track are to be submitted together with the workshop paper describing their generation. Tasks in the interactive track have to be solved live during the workshop on-site. The paper has to be prepared in accordance with the submission instructions of CVPR.
Answers to the fully automated track have to be submitted in one CSV file. Participating teams can submit their top 10 results per task and submit up to five different sets of results (runs) generated in different ways. The final submission consists of two CSV files per run with their filenames identifying the team name, task type, and run, e.g.: 'exampleteam_kis_3.csv' containing the results of the 3rd run of 'exampleteam' for the KIS tasks. The columns for KIS tasks are task, index, video, start, end, describing the task number, result rank (1 - max 10), video number, and start and end time in milliseconds, respectively. The columns for the QA tasks are task, index, video, answer, describing the task number, result rank (1 - max 10), video number, and the textual answer, respectively. For each run, teams should indicate if they used the query text exactly as provided or if they manually assisted their system by introducing manual query input modifications. Manual interventions on the output of a system are not allowed in this track.
The interactive track uses the Distributed Retrieval Evaluation Server (DRES) to receive and evaluate answers to tasks. DRES provides various facilities for the interactive evaluation of retrieval and similar tasks. It receives submissions via a RESTful API and examples on how to interact with it are available for several programming languages. Similar to the Video Browser Showdown, participating teams need to be able to submit their answers to interactive tasks during the workshop. User accounts for interactive participants together with a test instance of DRES to validate their implementation will be made available before the workshop.
For both task types in the fully-automated track, the score for each task is computed as 1 - (r - 1) / n where r is the rank of the correct answer and n is the number of possible answers per task. We fix n = 10 as outlined above. Tasks with no correct submissions are awarded a score of 0. The total score is the sum over all task scores.
In the interactive track, the scoring also incorporates the time required until the first correct submission, analogously to the video browser showdown. The score is computed as (1 - t / 2d) - (r - 1) / n where t is the time until the submission of the correct answer relative to the task start time, and d is the total duration of the task (set to be 5 minutes). For ease of readability during the interactive event, scores will be displayed multiplied by 1000 and rounded to three significant digits.
3rd of February 2025: Queries for the Fully-automated track released
24th of March 2025: Results for Fully-automated track and Workshop Paper submission deadline (Submit via OpenReview)
3rd of April 2025: Paper reviews and scores of Fully-automated track released
14th of April 2025: Camera Ready deadline for Workshop Papers
12th of June 2025: Workshop and Challenge
The workshop and interactive evaluation will take place on the morning of the 12th of June 2025.
08:00 to 08:10 Organizer's Welcome
08:10 to 08:20 CadenceRAG: Context-Aware and Dependency-Enhanced Retrieval Augmented Generation for Holistic Video Understanding
08:20 to 08:30 Toward Automation in Text-based Video Retrieval with LLM Assistance
08:30 to 08:40 An LLM Framework for Long-form Video Retrieval and Audio-Visual Question Answering Using Qwen2/2.5
08:40 to 08:50 VRAG: Retrieval-Augmented Video Question Answering for Long-Form Videos
08:50 to 09:00 Can Relevance Feedback, Conversational Search and Foundation Models Work Together for Interactive Video Search and Exploration?
09:00 to 09:10 A Lightweight Moment Retrieval System with Global Re-Ranking and Robust Adaptive Bidirectional Temporal Search
09:10 to 09:20 A Unified Framework for Multi-Granularity Models and Temporal Reranking
09:20 to 09:30 AI-based Video Content Understanding for Automatic and Interactive Multimedia Retrieval
09:30 to 09:45 Break (Interactive teams start preparation)
09:45 to 10:15 Keynote by Hazel Doughty: "Towards Detailed Video Understanding"
10:15 to 12:15 Interactive Evaluation
12:15 to 12:30 Results & Closing
IViSE 2025 is organized by
George Awad, NIST, USA
Werner Bailer, Joanneum Research, Austria
Cathal Gurrin, Dublin City University, Ireland
Björn Þór Jónsson, Reykjavík University, Iceland
Jakub Lokoč, Charles University, Czech Republic
Luca Rossetto, Dublin City University, Ireland
Stevan Rudinac, University of Amsterdam, The Netherlands
Klaus Schoeffmann, Klagenfurt University, Austria
In case of questions about the challenge, feel free to contact luca.rossetto@dcu.ie