Guest Track: Cross-Modal Video Retrieval with Reading Comprehension

Audience: This will be published as a sub-page on the CVPR LOVEU 2023 workshop website.
Codalab for the competition: https://codalab.lisn.upsaclay.fr/competitions/13317
Official Baseline: StarVR

For more details, please refer to our Challenge White Paper. For any questions about codalab, please contact weijiawu@zju.edu.cn, and cc loveu.cvpr@gmail.com.

Introduction

The Large Cross-Modal Video Retrieval Dataset with Reading Comprehension (TextVR) benchmark is an OCR related large scale video retrieval dataset for training, evaluating, and analyzing systems for understanding both video and text on the video(OCR). TextVR consists of:

TextVR provides 42.2k sentence queries, which include semantic information from text/OCR tokens (e.g., ‘CARBON ’, ‘020773491 ’) and visual context (e.g., ‘Tourist’, ‘restaurant’) simultaneously.
10.5k videos are collected from YouTube and existing video retrieval datasets (e.g., ActivityNet [1]), and each video on the dataset includes at least 10 text instances.
Videos on TextVR present high resolution 1080p: 72%, 720p: 19%, others: 9%), which enables more accurate text/OCR tokens from the text reading model.

[1] F. Caba Heilbron, V. Escorcia, B. Ghanem, and J. Carlos Niebles, “Activitynet: A large-scale video benchmark for human activity understanding,” in Proceedings of the ieee conference on computer vision and pattern recognition, 2015, pp. 961–9

Demo

As shown in the Figure, given a text/OCR-related sentence query, the participants are expected to retrieve and return the related videos.

Data Download

10k original training video (100G)
Resized videos and annotation files : caption json, detected OCR results(TransDETR, Google OCR, kuaishou OCR), and resized videos.

Evaluation Protocol

Following previous video retrieval benchmarks[1][2], we adopt the average recall at K(R@K), median rank(MdR), and mean rank(MnR) over all queries as the metric. We consider a prediction correct if the predicted video matches the ground-truth video. Generally, the higher R@K and lower MdR, MnR show better performance.

[1] Xu, Jun, Tao Mei, Ting Yao, and Yong Rui. "Msr-vtt: A large video description dataset for bridging video and language." In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5288-5296. 2016.

[2] Luo, Huaishao, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. "CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning." Neurocomputing 508 (2022): 293-304.

Submission Format

Please submit a npy file: textvr.npy

textvr.npy is a ndarray S of shape 2727 x 2727 (2727 is the size of the test set), where S(x, y) denotes the similarity score between the x-th caption and the y-th video. Here is an example of textvr.npy, its format should look like this:

S(1,1),S(1,2),...,S(1,2727)

S(2,1),S(2,2),...,S(2,2727)

...........................

S(2726,1),....,S(2726,2727)

S(2727,1),....,S[2727,2727)

Baselines

We are introducing the baseline StarVR for a quick start:

The baseline model of TextVR is built upon the foundation of the frozen-in-time video retrieval model. The entire pipeline consists of four primary components: a space-time transformer encoder for visual feature extraction, a space-time and scene text encoder for text/OCR token feature extraction, a fusion encoder for combining features, and a caption encoder for processing query sentences.

Report Format

Use CVPR style (double column) in the form of 3-6 pages or NeurIPS style (single column) in the form of 6-10 pages inclusive of any references. Please explain clearly what data, supervision, pre-trained models you have used so that we can make sure your results are comparable to others.
Please include your github link in the report. The top 2 winners are required to release their codebases and final models so that other people can reproduce in the future. Please contact us if you have any questions.

Report Submission Portal

For report submission, please send an email to loveu.cvpr@gmail.com.

Format of email subject: “YourName-Submission-LOVEU23-Guest Track”;
Attach your technical report and other relevant materials in the email;
Include your Codalab account (registered email) and username for our challenge in the email. Include meta info like team members, institution, etc.

For more details, please refer to our Challenge White Paper.

Timeline

May 1, 2023: The competition data and baseline code becomes available
June 9, 2023 (11:59PM Pacific Time): Deadline for submitting your result
Jun 13, 2023 (11:59PM Pacific Time): report submission due.
June 18, 2023: LOVEU 2023 Workshop. Presentations by winner and runner-up.