Linguistics Meets Image and Video Retrieval

Seoul, Korea, at ICCV'19

October 28th, 2019

For researchers, practitioners and other interested audience from both academia & industry to share their opinions and experience in the emerging space of visual content retrieval with natural language interface.


Image and video retrieval systems have been one of the widely studied areas in computer vision for decades. With the tremendous growth of searchable visual media in recent years, the need for effective retrieval systems has intensified, finding its use in many application domains, such as e-commerce, surveillance and Internet search. Over the past few years, the advent of deep learning has propelled the research of visual content retrieval and the field has been evolving at a fast pace. Amongst progress on core topics in image retrieval such as efficient search, ranking algorithms, and recommender systems, there has been a burgeoning trend on exploiting natural language understanding in the context of visual media retrieval.

The initial attempts at the intersection of visual content retrieval systems and natural language understanding have explored topics such as interactive search using natural language feedback, image and video retrieval based on natural language queries, and task-oriented visual dialog agents for image retrieval. These recent works are opening up new paths forward, centering around open issues such as a) how can comprehension and communication of language enhance visual search? and b) how can information retrieval (IR) tools, algorithms and infrastructure assist multimodal knowledge acquisition, interaction and interpretability? Such problems range from purely technical to cultural and personal. For example, images and language may help to guide and influence users' preference and sentiments in online retrieval systems, and retrieval systems can leverage both structured domain knowledge along with unstructured multimodal context from past interactions and behavioral analysis of the user.

The goal of the workshop is to bring together emerging research in the areas of information retrieval, computer vision and natural language understanding to discuss open challenges and opportunities and study the different synergistic relations in this interdisciplinary area. Linking these three areas together is especially relevant in today’s setting as most real-world applications in different domains (retail, travel, healthcare, education etc.) require interactions through different modalities: searching or retrieving multimodal information and interacting through multimodal responses. To provide rich opportunities to share opinions and experience in such an emerging space, we will accept paper submission on established and novel ideas, host invited talks as well as an apt competition on image retrieval based on natural language interaction.

Fashion IQ Challenge

We introduce a novel dataset and challenge for interactive image retrieval based on natural language feedback. The dataset consists of pairs of fashion images (a search target and a reference image) and human written sentences describing the visual differences between the target image and the reference image. The dataset consists of four subsets, each featuring a specific fashion category (shoes, men’s shirts, women’s tops and dresses). The challenge designed on this dataset will involve retrieving images conditioned on a reference image and the associated natural language annotation. Please visit our challenge page for more information.


Codalab server is back online. We encourage participants to resubmit their results online.

Important Dates

  • June 1st: training and validation data release
  • June 15th: evaluation server opens
  • Sep 30th: evaluation server closes
  • Oct 5th: deadline for submitting technical reports
  • Oct 28th: workshop date and winners' announcement

Paper Submission

We solicit paper submissions on novel approaches for visual media retrieval which leverages contextual language understanding, deep semantic image understanding with external knowledge, and multimodal user interactions. We also encourage interesting analytical works providing insights into negative results or papers reproducing previous published works in this area. In additional, we also look forward to submissions aiming to publish new resources, knowledge-bases and datasets that facilitate conversational search and retrieval in a domain-specific or bring out the challenges of an open-domain setting.

We call for extended abstracts submissions of at most two pages. We encourage submission of work that has been previously published, and work in progress on relevant topics of the workshop. Accepted abstracts will be linked at the workshop webpage and will not appear in ICCV workshop proceedings. Accepted papers will be presented at the poster session and top papers will be awarded. Manuscripts should follow the ICCV 2019 paper template final version, and should be submitted through the CMT system.

Paper submission Link:

Important Dates

  • Paper submission deadline: Sep 15th (11:59PM PST)
  • Notification to authors: Oct 1st

Note to Authors

  • For authors who want to submit their accepted work at this workshop to a different journal or conference, please check their double submission rules thoroughly.
  • For authors who need to apply for Visa to attend the conference, if you prefer a faster paper review process, please send us an email for this special request.

Topics of Interest

We solicit submissions of papers focused on:

New algorithmic approaches: Novel algorithms to facilitate multimodal search by improving ranking & retrieval performance, through a combination of neural & symbolic reasoning, external knowledge, user modeling etc.

Resources and Datasets: Publish new datasets, knowledge repositories, annotation & evaluation tools to facilitate multimodal search & retrieval

Evaluation methods: New metrics or methods for more relevant, insightful, unbiased evaluation of multimodal retrieval systems, meaningful explainability or interpretability methods

Domain specific applications: Experience facing domain specific challenges or problems in real world settings, interesting insights or solutions, and non-traditional cross-domain multimodal search and retrieval (other than vision+text)

Some of the suggested topics for submission include (but are not restricted to)

  • Dialog and Question-Answering Systems for cross-modal/multimodal search and retrieval
  • Neural-Symbolic reasoning for strategizing searching and ranking
  • Image and video retrieval using external knowledge graphs
  • Language acquisition and grounding by learning to search/rank/retrieve

  • Mining and modeling user preference and personalizing interactive visual search
  • Domain knowledge mining and semantics understanding for multimedia search and ranking
  • Information extraction and content analysis from rich content sources, including images, videos, news, advertisement, digital libraries, etc.

  • Novel user interfaces which integrate classical (clicks, keywords) and new (attention, natural language, spoken dialogs) interfaces for more effective retrieval performance
  • Novel applications involving multimodal retrieval in specific domains (such as education, health care) or modalities (such as sensors, IoT , wearable devices time-series)

Invited Speakers

Kate Saenko

Associate Professor

Boston University

Vicente Ordonez

Assistant Professor

University of Virginia

Jeffrey Siskind

Associate Professor

Purdue University



Amrita Saha

IBM Research AI

Hui Wu

IBM Research AI

Adriana I. Kovashka

University of Pittsburgh

Andrei Barbu


Xiaoxiao Guo

IBM Research AI

Samarth Bharadwaj

IBM Research AI

Yupeng Gao

IBM Research AI

Steering Committee

Rogerio Feris

IBM Research AI

Soumen Chakrabarti

Indian Institute of Technology Bombay


  • For general questions regarding the workshop, please contact Amrita Saha and Hui Wu.