Guest Track: Cross-Modal Video Retrieval with Reading Comprehension

For more details, please refer to our Challenge White Paper. For any questions about codalab, please contact weijiawu@zju.edu.cn, and cc  loveu.cvpr@gmail.com

Introduction

The Large Cross-Modal Video Retrieval Dataset with Reading Comprehension (TextVR) benchmark is an OCR related large scale video retrieval dataset for training, evaluating, and analyzing systems for understanding both video and text on the video(OCR). TextVR consists of:


[1] F. Caba Heilbron, V. Escorcia, B. Ghanem, and J. Carlos Niebles, “Activitynet: A large-scale video benchmark for human activity understanding,” in Proceedings of the ieee conference on computer vision and pattern recognition, 2015, pp. 961–9


Demo

As shown in the Figure,  given a text/OCR-related sentence query, the participants are expected to retrieve and return the related videos. 




Data Download 


Evaluation Protocol 

Following previous video retrieval benchmarks[1][2], we adopt the average recall at K(R@K), median rank(MdR), and mean rank(MnR) over all queries as the metric. We consider a prediction correct if the predicted video matches the ground-truth video. Generally, the higher R@K and lower MdR, MnR show better performance.


[1] Xu, Jun, Tao Mei, Ting Yao, and Yong Rui. "Msr-vtt: A large video description dataset for bridging video and language." In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5288-5296. 2016.

[2] Luo, Huaishao, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. "CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning." Neurocomputing 508 (2022): 293-304.

Submission Format

Please submit a npy file: textvr.npy

textvr.npy is a ndarray S of shape 2727 x 2727 (2727 is the size of the test set), where S(x, y) denotes the similarity score between the x-th caption and the y-th video. Here is an example of textvr.npy, its format should look like this:

S(1,1),S(1,2),...,S(1,2727)

S(2,1),S(2,2),...,S(2,2727)

...........................

S(2726,1),....,S(2726,2727)

S(2727,1),....,S[2727,2727)


Baselines


We are introducing the baseline StarVR for a quick start:


The baseline model of TextVR is built upon the foundation of the frozen-in-time video retrieval model. The entire pipeline consists of four primary components: a space-time transformer encoder for visual feature extraction, a space-time and scene text encoder for text/OCR token feature extraction, a fusion encoder for combining features, and a caption encoder for processing query sentences.

Report Format


Report Submission Portal


For report submission, please send an email to loveu.cvpr@gmail.com

For more details, please refer to our Challenge White Paper.


Timeline