Interactive image retrieval systems have been intensely studied in computer vision, with wide application domains such as Internet search, e-commerce and surveillance. Existing systems demonstrated the merit of incorporating fixed-form user feedback, such as indicating image relevance or visual attribute constraints. However, the difficulty in obtaining user’s search intent and bridging the gap between visual representations and high-level semantic concepts remain an open research quest.

Recently, there has been a surge in combining visual and semantic information for interactive image search [2,3,4]. Especially, several new works demonstrated the potential of incorporating natural language input in image search [2,3]. To push forward research in this area, and to set up a fair performance progress measure, we present Fashion IQ dataset [1] and challenge.

The challenge is an image retrieval task, where the input query is specified in the form of a candidate image and two natural language expressions which describe the visual differences of the search target. For example, we may present an image of a white dress, and ask the system to find images which are in the same style and color, but are modified in small ways, such as adding a belt in the waist. The retrieval interface setup with natural language user input offers a natural and effective interface for image retrieval, serving as an important step towards developing full-blown conversational image search retrieval systems [5].


  • Feb 22nd 2020: We are announcing Fashion IQ challenge at CVPR 2020. Please join us!
  • October 14th 2019: The results for Fashion IQ competition is online.
  • August 2nd 2019: We offer cash prize of $2.5k for the challenge winner!
  • July 20th 2019: Baseline code is available.
  • July 15th 2019: Codalab server is back online. We encourage participants to resubmit their results online.

Final Results

Leaderboard (top-5)

Link to the CodaLab page:

Technical reports

[1] 🥇Multimodal Ensemble of Diverse Models for Image Retrieval Using Natural Language Feedback, Changsheng Zhao*, Vasili Ramanishka*, Tong Yu, Yilin Shen, Siyang Yuan, Hongxia Jin. [pdf]

[2] 🥈 CurlingNet: Compositional Learning between Images and Text, Youngjae Yu, Seunghwan Lee, Yuncheol Choi, Gunhee Kim. [pdf]

[3] 🥉Designovel’s system description for Fashion-IQ challenge 2019, Jianri Li, Jae-whan Lee, Woo-sang Song, Ki-young Shin, Byung-hyun Go. Arxiv 2019. [pdf]

[4] Transforming image representations via attribute operators. Jack Culpepper*, Eric Dodds*, Simao Herdade*. [pdf]

[5] ResEFNet: Technical Report for the Fashion-IQ Interactive Image Retrieval Challenge. Zheyuan Liu, Cristian Rodriguez-Opazo, Stephen Gould. [pdf]

Submission and Phases

Please submit your results via our Codalab submission site. Please register through the website for your team and read carefully the instructions.

  • Dev-Challenge: Unlimited number of submissions.
  • Test-Challenge: Limited to 30 submissions in total.



We offer $4k USD in total to challenge winners, sponsored by IBM Research.


Please view the dataset here.

  • Training queries and targets: both the ground truth target images and the reference images are identified by their unique image ID. Each query consists of one reference image and two human written captions describing the visual differences between the target and the reference image. Data is available in json format. A snapshot of the data format is shown below. The dataset consists of three subsets, each featuring a specific fashion category (women's tops, women's dresses, men's shirts).
  • Validation queries and targets: same format as the training data; associated ground truth target IDs are available to participants.
  • Testing queries: testing queries have the same format as the training and validation queries. The associated ground truth target image Id for each query is not available to participants.

Please refer to the paper to learn more details (statistics, annotation procedure) of the dataset. A snapshot of the data format is shown below.


The retrieval systems are evaluated by recall metrics on the test splits of the dataset. For each of the three fashion categories (dresses, tops, shirts), Recall@10 and Recall@50 will be computed on all test queries. The overall performance is computed by taking the average of Recall@10 and Recall@50 on the three fashion categories.


[1] The Fashion IQ Dataset: Retrieving Images by Combining Side Information and Relative Natural Language Feedback. Xiaoxiao Guo*, Hui Wu*, Yupeng Gao, Steve Rennie and Rogerio Feris. Arxiv. 2019 [Pdf] [Project Page]

[2] Dialog-based interactive image retrieval. Xiaoxiao Guo*, Hui Wu*, Yu Cheng, Steve Rennie, Gerald Tesauro and Rogerio Feris. NeurIPS 2018. [Pdf] [Project Page]

[3] Composing Text and Image for Image Retrieval - An Empirical Odyssey. Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, James Hays. CVPR 2019.

[4] Memory-augmented attribute manipulation networks for interactive fashion search. Zhao, Bo, et al. CVPR 2017

[5] Towards Building Large Scale Multimodal Domain-Aware Conversation Systems. Saha, Amrita, Mitesh M. Khapra, and Karthik Sankaranarayanan. AAAI. 2018.


For general inquiries regarding the workshop, please contact Amrita Saha. For FAQs regarding the Fashion IQ challenge, please visit the Codalab Forum.