The DISC2021 (aka DISC21) dataset is provided as zip files on AWS S3.
For the larger subsets, we split the data into several zip files to make it easier to download in batch. Thus, the zip files are around 8GiB in size.
Development query images (50k images): dev_queries.zip
Final query images (50k images): final_queries.zip
Reference images (20 × 50k images):
references_0.zip ... references_19.zip
Training images (20 × 50k images):
Metatdata files are provided in CSV format. We expect the format to be self-explanatory.
Development queries ground truth: dev_ground_truth.csv
Final queries ground truth: final_ground_truth.csv
Attributions of all images (Flickr user that created each image): disc21_yfcc_attributions.csv disc21_testset_yfcc_attributions.csv
Metadata for the image manipulation process: metadata_final_10k.csv
See https://www.drivendata.org/competitions/79/competition-image-similarity-1-dev/page/381/