Fashion IQ 2020

Introduction

Interactive image retrieval systems have been intensely studied in computer vision, with wide application domains such as Internet search, e-commerce and surveillance. Existing systems demonstrated the merit of incorporating fixed-form user feedback, such as indicating image relevance or visual attribute constraints. However, the difficulty in obtaining user’s search intent and bridging the gap between visual representations and high-level semantic concepts remain an open research quest. This challenge propelled the recent research interest combining visual and semantic information for interactive image search [2,3,4]. Especially, several new works demonstrated the potential of incorporating natural language input in image search [2,3,4]. Unmatched with the rising interest in this area, we observe a lack of natural, realistic dataset in this domain. Therefore, we present Fashion IQ dataset [1] and evolve our 2019 challenge, to push further research in this area and set up a fair progress measure.

The format of the challenge is similar to its previous version. Specifically, the challenge is an image retrieval task, where the input query is specified in the form of a candidate image and two natural language expressions which describe the visual differences of the search target. For example, we may present an image of a white dress, and ask the system to find images which are in the same style and color, but are modified in small ways, such as adding a belt in the waist. The retrieval interface setup with natural language user input offers a natural and effective interface for image retrieval, serving as an important step towards developing full-blown conversational image search retrieval systems [5].

Dataset

Please view the dataset here. Please refer to the paper to learn more details (statistics, annotation procedure) of the dataset.

Training queries and targets: both the ground truth target images and the reference images are identified by their unique image ID. Each query consists of one reference image and two human written captions describing the visual differences between the target and the reference image. Data is available in json format. A snapshot of the data format is shown below. The dataset consists of three subsets, each featuring a specific fashion category (women's tops, women's dresses, men's shirts).
Validation queries and targets: same format as the training data; associated ground truth target IDs are available to participants.
Testing queries: testing queries have the same format as the training and validation queries. The associated ground truth target image IDs are not available to participants.

Evaluation and Rules

The retrieval systems are evaluated by recall metrics on the test splits of the dataset. For each of the three fashion categories (dresses, tops, shirts), Recall@10 and Recall@50 will be computed on all test queries. The overall performance is computed by taking the average of Recall@10 and Recall@50 on the three fashion categories.

Different from last year, side information (attribute tags, Amazon product metadata, etc.) can not be used during the testing phase. Side information is considered as privileged information, only available during training. Therefore, only images and relative captions are allowed during testing.

Submission

The Codalab page is now open. Please register through the website for your team and read carefully the instructions.

Dev-Challenge: Unlimited number of submissions.
Test-Challenge: Limited to 30 submissions in total.

Dates

Feb 14th: evaluation server opens for validation set
April 1st: evaluation server opens for testing set
May 27th: evaluation server closes
June 3rd: deadline for submitting technical reports
June 14th -19th: winners' announcement

Final Results

leaderboard

Technical report

[1] Cycled Compositional Learning between Images and Text, Jongseok Kim*, Youngjae Yu* Seunghwan Lee, GunheeKim [pdf] [code]

[2] Fashion-IQ 2020 Challenge 2nd Place Team’s Solution, Minchul Shin, Yoonjae Cho, Seongwuk Hong [pdf] [code]

[3] RUC-AIM3: Improved TIRG Model for Fashion-IQ Challenge 2020, Yida Zhao, Shizhe Chen, Zhihao Zhang, Qin Jin. [pdf] [code]

References

[1] Fashion IQ: A New Dataset towards Retrieving Images by Natural Language Feedback. Xiaoxiao Guo*, Hui Wu*, Yupeng Gao, Steve Rennie and Rogerio Feris. Arxiv. 2019

[2] Dialog-based interactive image retrieval. Xiaoxiao Guo*, Hui Wu*, Yu Cheng, Steve Rennie, Gerald Tesauro and Rogerio Feris. NeurIPS 2018.

[3] Drill-down: Interactive Retrieval of Complex Scenes using Natural Language Queries. Fuwen Tan, Paola Cascante-Bonilla, Xiaoxiao Guo, Hui Wu, Song Feng, Vicente Ordonez. NeurIPS 2019.

[4] Composing Text and Image for Image Retrieval - An Empirical Odyssey. Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, James Hays. CVPR 2019.

[5] Memory-augmented attribute manipulation networks for interactive fashion search. Zhao, Bo, et al. CVPR 2017

[6] Towards Building Large Scale Multimodal Domain-Aware Conversation Systems. Saha, Amrita, Mitesh M. Khapra, and Karthik Sankaranarayanan. AAAI. 2018.

Contact

For FAQs regarding the Fashion IQ challenge, please contact Yupeng Gao or Xiaoxiao Guo.