Challenge 2: Dishcovery: Mission II
Background
The goal of this challenge is to develop a Vision Language Model that can accurately understand food images and match them to the correct textual descriptions. This is a demanding test of fine grained visual recognition and multimodal alignment in one of the most diverse domains. Top solutions will capture subtle cues that distinguish similar dishes, ingredients, and preparation styles, enabling applications such as dietary tracking and restaurant automation.
To support participants, we provide a training set of 400,000 food image caption pairs collected using the Precision at Scale framework. The dataset includes synthetic and noisy real captions, and some images may also be synthetic. Each sample includes its caption and a link to download the corresponding image. We strongly encourage participants to download the images as early as possible and to report any issues quickly so they can be addressed. Since some images originate from LAION, a small portion may be removed during the challenge.
Participants may use external data sources and pre trained models as long as they comply with the license terms of this challenge and any third party data they use. Pay and use models, which are usually proprietary models, are completely forbidden. The focus is on achieving the most accurate and detailed alignment between food images and text while allowing freedom to explore modeling and augmentation strategies.
Description
The challenge will be divided in two different phases.
Phase one, which is common for all participants, involves two public test sets.
For the first test set, participants are encouraged to link every image to N captions. Note that N could range from 0 to the total number of captions in the test set. This linkage should be provided in a sparse matrix index format.
For the second test set, the format should be the exact same. However, each image will only have a single correct caption.
The submission file must consist of a single CSV file with the prediction for the Test 1 and Test 2 tasks concatenated (Test 1 followed by Test 2).
Best 4 participants will move to Phase Two and they will be asked to provide information about the data used to train the model.
In this phase participants are required to submit their model, all relevant scripts and a txt file that contains the source of information used to train the model. Organizers will check for any limitations and rules violation and they will reproduce the results obtained by the provided model in Phase one.
Note that results that are not reproducible will lead to a disqualification.
Additionally, we will evaluate all the models on our own private test set to decide the winner and runner-up.
Selected Winner and Runner-up will be asked to provide all the data used to train their model.
For detailed challenge instruction and regulations, please refer to the Kaggle page:
https://www.kaggle.com/competitions/dishcovery-mission-ii-cvpr-2026
Dishcovery Mission II Challenge Organizers
Dr. Petia Radeva
Universitat de Barcelona
Dr. Bhalaji Nagarajan
Barcelona Supercomputing Center
Mr. Imanol G. Estepa
Universitat de Barcelona
Mr. Jesús M Rodríguez-de-Vera
Universitat de Barcelona