Movie Annotation and Retrieval

Movie Multiple-Choice Test

Movie Retrieval

Introduction

Natural language-based video and image search has been a long standing topic of research among information retrieval, multimedia, and computer vision communities. Several existing on-line platforms (e.g. Youtube) rely on massive human curation efforts, manually assigned tags, click counts and surrounding text to match largely unstructured search phrases in order to retrieve ranked list of relevant videos from a stored library. However, as the amount of unlabeled video content grows, with advent of inexpensive mobile recording devices (e.g. smart phones), the focus is rapidly shifting to automated understand, tagging and search. In this challenge, we would like to explore a variety of different joint language-visual learning models for video annotation and retrieval task.

Challenge description

The majority of the LSMDC captions contain human activities description. The main goal of this challenge is to evaluate different visual-language models performance to annotate and search videos based on natural sentences for variety of human activities. Our challenge has two main tracks as described below, participants can participate in either one or both tracks:

  • Multiple-Choice Test: Given a video query and 5 captions, find the correct caption for the video among 5 possible choices. We have labeled each caption in the dataset with one or multiple activity-phrase labels. The correct sentence is the ground-truth caption and four other sentences are disctractors that have been randomly picked from the rest of the captions, with the condition that they are labeled with different activity-phrase label than the correct answer. In multiple-choice test, we compute accuracy over whole public-test multiple-choice test data with 10,053 questions provided in download page. The accuracy is the percentage of correctly answered 10,053 questions.
  • Movie Retrieval: We compute Recall@1, Recall@5, Recall@10, and Median Rank for video retrieval (given caption rank videos) on 1000 video/sentence from original captions from variety of human activity categories from LSMDC2016 public-test data provided in download page. Recall@k means the percentage of ground-truth videos in the first k videos and Median Rank (MedR) means the median rank of ground-truth videos.

Data

Our movie dataset has only one description per video, we provide new complete/simplified descriptions for subset of training data and whole public test data based on paraphrases (i.e. summarized or main aspect of what is described in the original long description) that potentially could be used as additional data for training.

  • Original Audio Description (AD) Sentences : Original data provided for movie description
  • Para-Phrase AD Sentences: Paraphrases for the long captions (more than ~15 words) in our LSMDC2016 training data. Most "long" descriptions are very detailed and complex. The paraphrases are the summarized or main aspect of what is described in the original long description containing 3-10 words. submission instruction will be announced later.

Download

Data can be downloaded here.

Submission server

  • For movie multiple-choice test track you can submit here.
  • For movie retrieval track you can submit here.

If you have any question about movie retrieval and movie multiple choice challenges please contact to torabi.atousa@gmail.com

Citations

If you intend to publish results that use the data and resources provided by this challenge, please include the following reference:

Movie annotation and retrieval paper:

@article{lsmdc2016MovieAnnotationRetrieval, author = {Torabi, Atousa and Tandon, Niket and Sigal,Leon}, title = {Learning Language-Visual Embedding for Movie Understanding with Natural-Language}, journal = {arXiv preprint}, year = {2016}, url = {http://arxiv.org/pdf/1609.08124v1.pdf}}

Movie dataset paper:

@article{lsmdc,author = {Rohrbach, Anna and Torabi, Atousa and Rohrbach, Marcus and Tandon, Niket and Pal, Chris and Larochelle, Hugo and Courville, Aaron and Schiele, Bernt},title = {Movie Description},journal={International Journal of Computer Vision},year = {2017},url = {http://link.springer.com/article/10.1007/s11263-016-0987-1?wt_mc=Internal.Event.1.SEM.ArticleAuthorOnlineFirst}}