Fashion IQ Challenge
Interactive image retrieval systems have been intensely studied in computer vision, with wide application domains such as Internet search, e-commerce and surveillance. Existing systems demonstrated the merit of incorporating fixed-form user feedback, such as indicating image relevance or visual attribute constraints. However, the difficulty in obtaining user’s search intent and bridging the gap between visual representations and high-level semantic concepts remain an open research quest.
Recently, there has been a surge in combining visual and semantic information for interactive image search [2,3,4]. Especially, several new works demonstrated the potential of incorporating natural language input in image search [2,3]. To push forward research in this area, and to set up a fair performance progress measure, we present Fashion IQ dataset  and challenge.
The challenge is an image retrieval task, where the input query is specified in the form of a candidate image and two natural language expressions which describe the visual differences of the search target. For example, we may present an image of a white dress, and ask the system to find images which are in the same style and color, but are modified in small ways, such as adding a belt in the waist. The retrieval interface setup with natural language user input offers a natural and effective interface for image retrieval, serving as an important step towards developing full-blown conversational image search retrieval systems .
We accept challenge submissions via Codalab.
Submission and Phases
Please submit your results via our Codalab submission site. Please register through the website for your team and read carefully the instructions.
- Dev-Challenge: Unlimited number of submissions.
- Test-Challenge: Limited to 30 submissions in total.
We are offering awards to top contenders on the leaderboard!
Please view the dataset here.
- Training queries and targets: both the ground truth target images and the reference images are identified by their unique image ID. Each query consists of one reference image and two human written captions describing the visual differences between the target and the reference image. Data is available in json format. A snapshot of the data format is shown below. The dataset consists of three subsets, each featuring a specific fashion category (women's tops, women's dresses, men's shirts).
- Validation queries and targets: same format as the training data; ground truth target images are available to participants.
- Testing queries: testing queries have the same format as the training and validation queries; ground truth target images are not available to participants.
Please refer to the project page to learn more details (statistics, annotation procedure, baselines) of the dataset. A snapshot of the data format is shown below.
The retrieval systems are evaluated by recall metrics on the test splits of the dataset. For each of the three fashion categories (dresses, tops, shirts), Recall@10 and Recall@50 will be computed on all test queries. The overall performance is computed by taking the average of Recall@10 and Recall@50 on the three fashion categories.
 The Fashion IQ Dataset: Retrieving Images by Combining Side Information and Relative Natural Language Feedback. Xiaoxiao Guo*, Hui Wu*, Yupeng Gao, Steve Rennie and Rogerio Feris. Arxiv. 2019 [Pdf] [Project Page]
 Composing Text and Image for Image Retrieval - An Empirical Odyssey. Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, James Hays. CVPR 2019. [pdf]
 Memory-augmented attribute manipulation networks for interactive fashion search. Zhao, Bo, et al. CVPR 2017
 Towards Building Large Scale Multimodal Domain-Aware Conversation Systems. Saha, Amrita, Mitesh M. Khapra, and Karthik Sankaranarayanan. AAAI. 2018.