Sergio Escalera - University of Barcelona and Computer Vision Center & Julio C. S. Jacques Junior - Computer Vision Center
Popular Computer Vision Datasets and Benchmarks
Additional resources:
Search engine (datasets): https://paperswithcode.com/datasets?q=&v=lst&o=match
Chalearn (LAP) Looking at People repository: https://chalearnlap.cvc.uab.cat/
2022
Open Images dataset (V7)
Link to resource: https://storage.googleapis.com/openimages/web/factsfigures_v7.html#overview
Reference and link to paper describing the dataset/benchmark and baseline results:
Ref: A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, A. Kolesnikov, T. Duerig, and V. Ferrari. "The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale," IJCV, 2020.
R. Benenson and V. Ferrari. “From couloring-in to pointillism: revisiting semantic segmentation supervision,” Arxiv, 2022.
Link: https://arxiv.org/abs/1811.00982, https://storage.googleapis.com/openimages/web_v7/2022_pointillism_arxiv.pdf
Date created: 2022
Comments: Open Images is a dataset of ~9M images annotated with image-level labels, object bounding boxes, object segmentation masks, visual relationships, and localized narratives. It contains a total of 16M bounding boxes for 600 object classes on 1.9M images. The boxes have been largely manually drawn by professional annotators to ensure accuracy and consistency. The images are very diverse and often contain complex scenes with several objects. Open Images also offers visual relationship annotations, indicating pairs of objects in particular relations (e.g. "woman playing guitar", "beer on table"), object properties (e.g. "table is wooden"), and human actions (e.g. "woman is jumping"). In total it has 3.3M annotations from 1,466 distinct relationship triplets. In V5 they added segmentation masks for 2.8M object instances in 350 classes. In V6 they added 675k localized narratives: multimodal descriptions of images consisting of synchronized voice, text, and mouse traces over the objects being described. In v7 they added 66.4M point-level labels over 1.4M images, covering 5,827 classes. These labels provide sparse pixel-level localization and are suitable for zero/few-shot semantic segmentation training and evaluation. Finally, the dataset is annotated with 61.4M image-level labels spanning 20,638 classes.
Purpose: image classification, object detection, visual relationship detection, instance segmentation, and multimodal image descriptions.
Quantitative numbers:
Number of examples: ~9M images
Size: 591 GB (see Github repo)
Number of classes or labels: ~9M images images annotated with image-level labels, object bounding boxes, object segmentation masks, visual relationships, and localized narratives; 16M bounding boxes for 600 object classes on 1.9M images; 3.3M annotations from 1,466 distinct relationship triplets. Segmentation masks for 2.8M object instances in 350 classes; 675k localized narratives; 66.4M point-level labels over 1.4M images; 61.4M image-level labels spanning 20,638 classes.
Ego4D dataset
Link to resource: https://ego4d-data.org/
Reference and link to paper describing the dataset/benchmark and baseline results:
Ref: Kristen Grauman et al. "Ego4D: Around the World in 3,000 Hours of Egocentric Video," CVPR 2022.
Link: https://openaccess.thecvf.com/content/CVPR2022/papers/Grauman_Ego4D_Around_the_World_in_3000_Hours_of_Egocentric_Video_CVPR_2022_paper.pdf
Date created: 2022
Comments: Ego4D is a massive-scale Egocentric dataset of unprecedented diversity. It consists of 3,670 hours of video collected by 923 unique participants from 74 worldwide locations in 9 different countries. The project brings together 88 researchers, in an international consortium, to dramatically increases the scale of egocentric data publicly available by an order of magnitude, making it more than 20x greater than any other data set in terms of hours of footage. Ego4D aims to catalyse the next era of research in first-person visual perception. The dataset is diverse in its geographic coverage, scenarios, participants and captured modalities. Data was captured using seven different off-the-shelf head-mounted cameras: GoPro, Vuzix Blade, Pupil Labs, ZShades, OR- DRO EP6, iVue Rincon 1080, and Weeview. In addition to video, portions of Ego4D offer other data modalities: 3D scans, audio, gaze, stereo, multiple synchronized wearable cameras, and textual narrations.
Purpose: first-person visual perception; episodic memory, hands & object interaction, audio-visual diarization, social interactions, forecasting tasks.
Quantitative numbers:
Volume:
Full Primary Dataset ~7.1 TB
Entire Dataset 30+ TB
Number of examples: 3,670 hours of video
Number of classes or labels: Millions of annotations supporting multiple complex tasks, ranging from temporal, spatial, and semantic labels, to dense textual narrations of activities, natural language queries, and speech transcriptions.
2021
UDIVA v0.5 dataset
Link to resource: https://chalearnlap.cvc.uab.cat/dataset/41/description/
Reference and link to paper describing the dataset/benchmark and baseline results:
Ref: Cristina Palmero, Javier Selva, Sorina Smeureanu, Julio C. S. Jacques Junior, Albert Clapés, Alexa Moseguí, Zejian Zhang, David Gallardo, Georgina Guilera, David Leiva, Sergio Escalera. "Context-Aware Personality Inference in Dyadic Scenarios: Introducing the UDIVA Dataset", WACVW, 2021.
Link: https://openaccess.thecvf.com/content/WACV2021W/HBU/papers/Palmero_Context-Aware_Personality_Inference_in_Dyadic_Scenarios_Introducing_the_UDIVA_Dataset_WACVW_2021_paper.pdf
Date created: 2021
Comments: The UDIVA dataset aims to move beyond automatic individual behavior detection and focus on the development of automatic approaches to study and understand the mechanisms of influence, perception and adaptation to verbal and nonverbal social signals in dyadic interactions, taking into account individual and dyad characteristics as well as other contextual factors. The UDIVA v0.5 dataset is a preliminary version of the UDIVA dataset, including a subset of the participants, sessions, synchronized views, and annotations of the complete UDIVA dataset.
Purpose: human behavior in dyadic interactions
Quantitative numbers:
Number of examples: 145 dyadic interaction sessions divided in 4 different tasks each: Talk, Lego, Ghost, and Animals. Such sessions are performed by 134 participants (ranging from 17 to 75 years old, 55.2% male), who can participate in up to 5 sessions with different participants. Spanish is the majority language (73.1%), followed by Catalan (17.25%) and English (9.65%).
Number of classes or labels: personality labels (self-reported and perceived), meta-data, transcripts, pseudo-labels: face/body/hand landmarks, 3D eye gaze vectors
EPIC-KITCHENS-100 dataset
Link to resource: https://epic-kitchens.github.io/2022
Reference and link to paper describing the dataset/benchmark and baseline results:
Ref: Damen, Dima and Doughty, Hazel and Farinella, Giovanni Maria and Furnari, Antonino and Ma, Jian and Kazakos, Evangelos and Moltisanti, Davide and Munro, Jonathan and Perrett, Toby and Price, Will and Wray, Michael. "Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100," International Journal of Computer Vision (IJCV), 2022.
Link: https://link.springer.com/article/10.1007/s11263-021-01531-2, https://arxiv.org/pdf/2006.13256v4.pdf
Date created: 2021
Comments: The large-scale dataset in first-person (egocentric) vision; multi-faceted, audio-visual, non-scripted recordings in native environments - i.e. the wearers' homes, capturing all daily activities in the kitchen over multiple days. Annotations are collected using a novel 'Pause-and-Talk' narration interface. Characteristics: 45 kitchens - 4 cities, Head-mounted camera, 100 hours of recording - Full HD, 20M frames, Multi-language narrations, 90K action segments, 20K unique narrations, 97 verb classes, 300 noun classes, 5 challenges.
Purpose: Action Recognition, Action Detection, Action Anticipation, Domain Adaptation for Action Recognition, Multi-Instance Retrieval
Quantitative numbers:
Volume:
Extended Sequences (+RGB Frames, Flow Frames, Gyroscope + accelerometer data): 740GB (zipped).
Original Sequences (+RGB and Flow Frames): 1.1TB (zipped).
Automatic annotations (masks, hands and objects): 10 GB.
Number of examples: 100 hours of recording - Full HD, 20M frames
Number of classes or labels: Multi-language narrations, 90K action segments, 20K unique narrations, 97 verb classes, 300 noun classes, automatic annotations (masks, hands and objects).
2020
Cityscapes 3D dataset
Link to resource: https://www.cityscapes-dataset.com/cityscapes-3d-benchmark-online/
Reference and link to paper describing the dataset/benchmark and baseline results:
Ref: Gählert, Nils; Jourdan, Nicolas; Cordts, Marius; Franke, Uwe; Denzler, Joachim. "Cityscapes 3D: Dataset and Benchmark for 9 DoF Vehicle Detection," CVPR Workshop on Scalability in Autonomous Driving, 2020.
Date created: 2020
Comments: Cityscapes 3D is an extension of the original Cityscapes with 3D bounding box annotations for all types of vehicles as well as a benchmark for the 3D detection task.
Purpose: 3D object detection
Quantitative numbers:
Number of examples: 5,000 images with fine annotations and 20,000 images with coarse annotations.
Number of classes or labels: 30 classes grouped into 8 categories
CelebAMask-HQ
Link to resource: https://github.com/switchablenorms/CelebAMask-HQ
Reference and link to paper describing the dataset/benchmark and baseline results:
Ref: Lee, Cheng-Han and Liu, Ziwei and Wu, Lingyun and Luo, Ping. "MaskGAN: Towards Diverse and Interactive Facial Image Manipulation," IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
Link: https://openaccess.thecvf.com/content_CVPR_2020/papers/Lee_MaskGAN_Towards_Diverse_and_Interactive_Facial_Image_Manipulation_CVPR_2020_paper.pdf
Date created: 2020
Comments: CelebAMask-HQ is a large-scale face image dataset that has 30,000 high-resolution face images selected from the CelebA dataset by following CelebA-HQ. Each image has segmentation mask of facial attributes corresponding to CelebA. The masks of CelebAMask-HQ were manually-annotated with the size of 512 x 512 and 19 classes including all facial components and accessories such as skin, nose, eyes, eyebrows, ears, mouth, lip, hair, hat, eyeglass, earring, necklace, neck, and cloth.
Purpose: face parsing, face recognition, and GANs for face generation and editing.
Quantitative numbers:
Number of examples: 30,000 high-resolution face images
Number of classes or labels: 512 x 512 face mask (per image) and 19 classes including all facial components and accessories such as skin, nose, eyes, eyebrows, ears, mouth, lip, hair, hat, eyeglass, earring, necklace, neck, and cloth.
CelebA-Spoof
Link to resource: https://github.com/ZhangYuanhan-AI/CelebA-Spoof
Reference and link to paper describing the dataset/benchmark and baseline results:
Ref: Zhang, Yuanhan and Yin, Zhenfei and Li, Yidong and Yin, Guojun and Yan, Junjie and Shao, Jing and Liu, Ziwei. "CelebA-Spoof: Large-Scale Face Anti-Spoofing Dataset with Rich Annotations," European Conference on Computer Vision (ECCV), 2020.
Link: https://link.springer.com/chapter/10.1007/978-3-030-58610-2_5 | https://arxiv.org/abs/2007.12342
Date created: 2020
Comments: CelebA-Spoof is a large-scale face anti-spoofing dataset that has 625,537 images from 10,177 subjects, which includes 43 rich attributes on face, illumination,environment and spoof types. Live image selected from the CelebA dataset. We collect and annotate spoof images of CelebA-Spoof. Among 43 rich attributes, 40 attributes belong to Live images including all facial components and accessories such as skin, nose, eyes, eyebrows, lip, hair, hat, eyeglass. 3 attributes belong to spoof images including spoof types, environments and illumination conditions.
Purpose: face anti-spoofing.
Quantitative numbers:
Number of examples: 625,537 images from 10,177 subjects
Number of classes or labels: 43 rich attributes on face, illumination,environment and spoof types
CelebA-Dialog
Link to resource: https://github.com/ziqihuangg/CelebA-Dialog
Reference and link to paper describing the dataset/benchmark and baseline results:
Ref: Yuming Jiang, Ziqi Huang, Xingang Pan, Chen Change Loy and Ziwei Liu. "Talk-to-Edit: Fine-Grained Facial Editing via Dialog," IEEE International Conference on Computer Vision (ICCV), 2021.
Link: https://openaccess.thecvf.com/content/ICCV2021/papers/Jiang_Talk-To-Edit_Fine-Grained_Facial_Editing_via_Dialog_ICCV_2021_paper.pdf
Date created: 2020
Comments: CelebA-Dialog is a large-scale visual-language face dataset with the following features. Facial images are annotated with rich fine-grained labels, which classify one attribute into multiple degrees according to its semantic meaning. Accompanied with each image, there are textual captions describing the attributes and a user editing request sample.
Purpose: fine-grained facial attribute recognition, fine-grained facial manipulation, text-based facial generation and manipulation, face image captioning, natural language based facial recognition and manipulation, and broader multi-modality learning tasks.
Quantitative numbers:
Number of examples: 10,177 number of identities, 202,599 number of face images
Number of classes or labels: 5 fine-grained attributes annotations per image: Bangs, Eyeglasses, Beard, Smiling, and Age, Textual captions and a user editing request per image.
Kinetics-400/600/700
Link to resource: https://www.deepmind.com/open-source/kinetics
Reference and link to paper describing the dataset/benchmark and baseline results:
Ref: Lucas Smaira, João Carreira, Eric Noland, Ellen Clancy, Amy Wu, Andrew Zisserman. "A Short Note on the Kinetics-700-2020 Human Action Dataset," arXiv, 2020.
Date created: 2020 (Kinetics-700)
Comments: A collection of large-scale, high-quality datasets of URL links of up to 650,000 video clips that cover 400/600/700 human action classes, depending on the dataset version. The videos include human-object interactions such as playing instruments, as well as human-human interactions such as shaking hands and hugging. Each action class has at least 400/600/700 video clips. Each clip is human annotated with a single action class and lasts around 10 seconds.
Purpose: action recognition
Quantitative numbers:
Number of examples: URL links of up to 650,000 video clips
Number of classes or labels: 400/600/700 human action classes, depending on the dataset version
2019
YouTube-8M Segments Dataset
Link to resource: https://research.google.com/youtube8m/
Reference and link to paper describing the dataset/benchmark and baseline results:
Ref: Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, Sudheendra Vijayanarasimhan. "YouTube-8M: A Large-Scale Video Classification Benchmark", arXiv, 2016.
Link: https://arxiv.org/abs/1609.08675; https://research.google.com/youtube8m/workshop2018/index.html; https://research.google.com/youtube8m/workshop2017/index.html
Date created: 2019
Comments: The YouTube-8M Segments dataset is an extension of the YouTube-8M dataset with human-verified segment annotations. In addition to annotating videos, the data creators would like to temporally localize the entities in the videos, i.e., find out when the entities occur. They collected human-verified labels on about 237K segments on 1000 classes from the validation set of the YouTube-8M dataset. Each video will again come with time-localized frame-level features so classifier predictions can be made at segment-level granularity. We encourage researchers to leverage the large amount of noisy video-level labels in the training set to train models for temporal localization.
Purpose: temporal action localization; video understanding, representation learning, noisy data modeling, transfer learning, and domain adaptation approaches for video
Quantitative numbers:
Number of examples: 237K video segments
Number of classes or labels: 1000 classes
2018
SoccerNet dataset
Link to resource: https://www.soccer-net.org/home
Reference and link to paper describing the dataset/benchmark and baseline results:
Ref: S. Giancola, M. Amine, T. Dghaily and B. Ghanem, "SoccerNet: A Scalable Dataset for Action Spotting in Soccer Videos," 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2018
Link:
Date created: 2018
Comments: SoccerNet is a large-scale dataset for soccer video understanding. It has evolved over the years to include various tasks such as action spotting, camera calibration, player re-identification and tracking. It is composed of 550 complete broadcast soccer games and 12 single camera games taken from the major European leagues.
Purpose: soccer video understanding, action spotting, camera calibration, player re-identification and tracking.
Quantitative numbers:
Number of examples: 550 complete broadcast soccer games and 12 single camera games
Number of classes or labels: 17 action classes; ~300k annotations temporally anchored within SoccerNet’s 764 hours of video; 158,493 camera change timestamps; 32,932 replay shots are associated with their corresponding action.
2017
YouTube-8M Dataset
Link to resource: https://research.google.com/youtube8m/download.html
Reference and link to paper describing the dataset/benchmark and baseline results:
Ref: Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, Sudheendra Vijayanarasimhan. "YouTube-8M: A Large-Scale Video Classification Benchmark", arXiv, 2016.
Link: https://arxiv.org/abs/1609.08675; https://research.google.com/youtube8m/workshop2018/index.html; https://research.google.com/youtube8m/workshop2017/index.html
Date created: 2017
Comments: YouTube-8M is a large-scale labeled video dataset that consists of millions of YouTube video IDs, with high-quality machine-generated annotations from a diverse vocabulary of 3,800+ visual entities. It comes with precomputed audio-visual features from billions of frames and audio segments, designed to fit on a single hard disk. This makes it possible to train a strong baseline model on this dataset in less than a day on a single GPU! At the same time, the dataset's scale and diversity can enable deep exploration of complex audio-visual models that can take weeks to train even in a distributed fashion.
Purpose: video understanding, representation learning, noisy data modeling, transfer learning, and domain adaptation approaches for video
Quantitative numbers:
Number of examples: 6.1M video IDs; 350,000 hours of video; 2.6Bilion audio-visual features
Number of classes or labels: 3862 classes
Something-Something (v2) dataset
Link to resource: https://developer.qualcomm.com/software/ai-datasets/something-something
Reference and link to paper describing the dataset/benchmark and baseline results:
Ref: Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzyńska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, Florian Hoppe, Christian Thurau, Ingo Bax, Roland Memisevic. "The "something something" video database for learning and evaluating visual common sense," ICCV 2017.
Link: https://openaccess.thecvf.com/content_ICCV_2017/papers/Goyal_The_Something_Something_ICCV_2017_paper.pdf
Date created: 2017
Comments: The Something-Something dataset (version 2) is a collection of 220,847 labeled video clips of humans performing pre-defined, basic actions with everyday objects. It is designed to train machine learning models in fine-grained understanding of human hand gestures like putting something into something, turning something upside down and covering something with something.
Purpose: fine-grained understanding of human hand gestures
Quantitative numbers:
Volume = The video data is provided as one large TGZ archive, split into parts of 1 GB maximum. The total download size is 19.4 GB. The archive contains webm-files, using the VP9 codec, with a height of 240px.
Number of examples: 220,847 videos
"Intrinsic dimension": does not apply.
Number of classes or labels: object annotation in addition to the video label, if applicable. For example, for a label like "Putting [something] onto [something]," there is also an annotated version, such as "Putting a cup onto a table." In total, there are 318,572 annotations involving 30,408 unique objects.
Places dataset
Link to resource: http://places2.csail.mit.edu/
Reference and link to paper describing the dataset/benchmark and baseline results:
Ref: Zhou, Bolei and Lapedriza, Agata and Khosla, Aditya and Oliva, Aude and Torralba, Antonio. "Places: A 10 million Image Database for Scene Recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2017.
Date created: 2017
Comments: The Places dataset is designed following principles of human visual cognition. Their goal was to build a core of visual knowledge that can be used to train artificial systems for high-level visual understanding tasks, such as scene context, object recognition, action and event prediction, and theory-of-mind inference. The semantic categories of Places are defined by their function: the labels represent the entry-level of an environment. To illustrate, the dataset has different categories of bedrooms, or streets, etc, as one does not act the same way, and does not make the same predictions of what can happen next, in a home bedroom, an hotel bedroom or a nursery. In total, Places contains more than 10 million images comprising 400+ unique scene categories. The dataset features 5000 to 30,000 training images per class, consistent with real-world frequencies of occurrence.
Purpose: scene recognition; scene-centric benchmarks
Quantitative numbers:
Number of examples: more than 10 million images comprising 400+ unique scene categories.
Number of classes or labels: 356 scene categories
EMOTIC dataset
Link to resource: http://sunai.uoc.edu/emotic/index.html
Reference and link to paper describing the dataset/benchmark and baseline results:
Ref: R. Kosti, J.M. Álvarez, A. Recasens and A. Lapedriza, "Emotion Recognition in Context", Computer Vision and Pattern Recognition (CVPR), 2017
Date created: 2017
Comments: The EMOTIC dataset, named after EMOTions In Context, is a database of images with people in real environments, annotated with their apparent emotions. The images are annotated with an extended list of 26 emotion categories combined with the three common continuous dimensions Valence, Arousal and Dominance.
Purpose: emotion recognition
Quantitative numbers:
Number of examples: 18,316 images having 23,788 annotated people.
Number of classes or labels: 26 emotion categories combined with the three common continuous dimensions Valence, Arousal and Dominance
2016
Cityscapes dataset
Link to resource: https://www.cityscapes-dataset.com/
Reference and link to paper describing the dataset/benchmark and baseline results:
Ref: Cordts, Marius; Omran, Mohamed; Ramos, Sebastian; Rehfeld, Timo; Enzweiler, Markus; Benenson, Rodrigo; Franke, Uwe; Roth, Stefan and Schiele, Bernt. "The Cityscapes Dataset for Semantic Urban Scene Understanding," IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
Link: https://openaccess.thecvf.com/content_cvpr_2016/papers/Cordts_The_Cityscapes_Dataset_CVPR_2016_paper.pdf
Date created: 2016
Comments: Cityscapes is an open-sourced large-scale dataset focused on semantic understanding of urban street scenes which contains a diverse set of stereo video sequences recorded in street scenes from 50 different cities. Annotated for 30 classes grouped into 8 categories. It includes high-quality pixel-level annotations of 5,000 frames in addition to a larger set of 20,000 weakly annotated frames.
Purpose: Semantic understanding of urban street scenes
Quantitative numbers:
Number of examples: 5,000 images with fine annotations and 20,000 images with coarse annotations.
Number of classes or labels: 30 classes grouped into 8 categories
2015
CelebA (CelebFaces Attributes Dataset)
Link to resource: http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html
Reference and link to paper describing the dataset/benchmark and baseline results:
Ref: Liu, Ziwei; Luo, Ping; Wang, Xiaogang; Tang, Xiaoou. "Deep Learning Face Attributes in the Wild," IEEE International Conference on Computer Vision (ICCV), 2015.
Link: https://openaccess.thecvf.com/content_iccv_2015/papers/Liu_Deep_Learning_Face_ICCV_2015_paper.pdf
Date created: 2015
Comments: CelebFaces Attributes Dataset (CelebA) is a large-scale face attributes dataset with more than 200K celebrity images, each with 40 attribute annotations. The images in this dataset cover large pose variations and background clutter. CelebA has large diversities, large quantities, and rich annotations, including 10,177 number of identities, 202,599 number of face images, and 5 landmark locations, 40 binary attributes annotations per image.
Purpose: face attribute recognition, face recognition, face detection, landmark (or facial part) localization, and face editing & synthesis.
Quantitative numbers:
Number of examples: 10,177 number of identities, 202,599 number of face images
Number of classes or labels: 5 landmark locations, 40 binary attributes annotations per image.
IMDB-Wiki Dataset
Link to resource: https://data.vision.ee.ethz.ch/cvl/rrothe/imdb-wiki/
Reference and link to paper describing the dataset/benchmark and baseline results:
Ref: Rasmus Rothe and Radu Timofte and Luc Van Gool. "DEX: Deep EXpectation of apparent age from a single image," IEEE International Conference on Computer Vision Workshops (ICCVW), 2015.
Link: https://openaccess.thecvf.com/content_iccv_2015_workshops/w11/papers/Rothe_DEX_Deep_EXpectation_ICCV_2015_paper.pdf
Date created: 2015
Comments: The data creators took the list of the most popular 100,000 actors as listed on the IMDb website and (automatically) crawled from their profiles date of birth, name, gender and all images related to that person. Additionally they crawled all profile images from pages of people from Wikipedia with the same meta information. Assuming that the images with single faces are likely to show the actor and that the timestamp and date of birth are correct, they were able to assign to each such image the biological (real) age. Of course, they can not vouch for the accuracy of the assigned age information. Besides wrong timestamps, many images are stills from movies - movies that can have extended production times.
Purpose: real age prediction and age perception
Quantitative numbers:
Volume = ~280Gb
Number of examples: 460,723 face images from 20,284 celebrities from IMDb and 62,328 from Wikipedia, thus 523,051 in total.
Number of classes or labels: gender and age labels
2014
MPII Human Pose Dataset
Link to resource: http://human-pose.mpi-inf.mpg.de/
Reference and link to paper describing the dataset/benchmark and baseline results:
Ref: Mykhaylo Andriluka and Leonid Pishchulin and Peter Gehler and Schiele, Bernt. "2D Human Pose Estimation: New Benchmark and State of the Art Analysis," IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
Link: http://human-pose.mpi-inf.mpg.de/contents/andriluka14cvpr.pdf
Date created: 2014
Comments: MPII Human Pose dataset is a benchmark for evaluation of articulated human pose estimation. The images were systematically collected using an established taxonomy of every day human activities. Overall the dataset covers 410 human activities and each image is provided with an activity label. Each image was extracted from a YouTube video and provided with preceding and following un-annotated frames. In addition, for the test set we obtained richer annotations including body part occlusions and 3D torso and head orientations.
Purpose:articulated human pose estimation; action recognition
Quantitative numbers:
Number of examples: around 25K images containing over 40K people with
Number of classes or labels: annotated body joints and activity label.
MS Coco
Link to resource: https://cocodataset.org/#home
Reference and link to paper describing the dataset/benchmark and baseline results:
Ref: Lin, TY. et al. (2014). Microsoft COCO: Common Objects in Context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds) Computer Vision – ECCV 2014. ECCV 2014. Lecture Notes in Computer Science, vol 8693.
Link: https://link.springer.com/content/pdf/10.1007/978-3-319-10602-1_48.pdf
Date created: 2014
Comments: The MS COCO (Microsoft Common Objects in Context) dataset is a large-scale object detection, segmentation, key-point detection, and captioning dataset. The dataset consists of 328K images. Splits: The first version of MS COCO dataset was released in 2014. It contains 164K images split into training (83K), validation (41K) and test (41K) sets. In 2015 an additional test set of 81K images was released, including all the previous test images and 40K new images. Based on community feedback, in 2017 the training/validation split was changed from 83K/41K to 118K/5K. The new split uses the same images and annotations. The 2017 test set is a subset of 41K images of the 2015 test set. Additionally, the 2017 release contains a new unannotated dataset of 123K images. The dataset contains photos of 91 objects types which is easily recognisable and has a total of 2.5 million labeled instances in 328k images.
Quantitative numbers:
Volume =
Images
2014 Train images [83K/13GB]
2014 Val images [41K/6GB]
2014 Test images [41K/6GB]
2015 Test images [81K/12GB]
2017 Train images [118K/18GB]
2017 Val images [5K/1GB]
2017 Test images [41K/6GB]
2017 Unlabeled images [123K/19GB]
Annotations
2014 Train/Val annotations [241MB]
2014 Testing Image info [1MB]
2015 Testing Image info [2MB]
2017 Train/Val annotations [241MB]
2017 Stuff Train/Val annotations [1.1GB]
2017 Panoptic Train/Val annotations [821MB]
2017 Testing Image info [1MB]
2017 Unlabeled Image info [4MB]
Number of examples: The dataset consists of 328K images.
Number of classes or labels: bounding boxes and per-instance segmentation masks with 80 object categories; per-pixel segmentation masks with 91 stuff categories; 11 super-categories.
... 2009
ImageNet
Link to resource: https://image-net.org/
Reference and link to paper describing the dataset/benchmark and baseline results:
Ref: J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li and L. Fei-Fei, ImageNet: A Large-Scale Hierarchical Image Database. IEEE Computer Vision and Pattern Recognition (CVPR), 2009.
Link: https://image-net.org/static_files/papers/imagenet_cvpr09.pdf
Date created: 2009
Comments: One of the popular datasets for Computer Vision projects, ImageNet provides an accessible image database. The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), a benchmark in image classification and object detection. The publicly released dataset contains a set of manually annotated training images. A set of test images is also released, with the manual annotations withheld.
Purpose: Image classification and object detection
Quantitative numbers:
Number of examples: 1,281,167 training images, 50,000 validation images and 100,000 test images.
Number of classes or labels: 1000 object classes
The CIFAR-10 dataset
Link to resource: https://www.cs.toronto.edu/~kriz/cifar.html
Reference and link to paper describing the dataset/benchmark and baseline results:
Ref: Krizhevsky, Alex and Hinton, Geoffrey. "Learning multiple layers of features from tiny images," Technical report, University of Toronto, 2009.
Link: https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf
Date created: 2009
Comments: CIFAR-10 is a popular computer-vision dataset collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. This dataset is used for object recognition and it consists of 60,000 32×32 color images in 10 classes, with 6,000 images per class. It is divided into five training batches and one test batch, each with 10,000 images which means there are 50,000 training images and 10,000 test images.
Purpose: Object recognition
Quantitative numbers:
Volume = 163 MB (python version), compressed with .tar.gz.
Number of examples: 60,000 32×32 color images
Number of classes or labels: 10 classes, with 6,000 images per class
...1998
MNIST dataset
Link to resource: http://yann.lecun.com/exdb/mnist/
Reference and link to paper describing the dataset/benchmark and baseline results:
Ref: Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. "Gradient-based learning applied to document recognition." Proceedings of the IEEE, 86(11):2278-2324, November 1998.
Date created: 1998
Comments: The MNIST database of handwritten digits. It has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST dataset. The digits have been size-normalized and centered in a fixed-size image.
Purpose: pattern recognition
Quantitative numbers:
Volume =
training set images (9912422 bytes)
training set labels (28881 bytes)
test set images (1648877 bytes)
test set labels (4542 bytes)
Number of examples: 60,000 examples, and a test set of 10,000 examples
Number of classes or labels: The labels values are 0 to 9.
Preliminary notes (Isabelle); ignore/over-write
LAION-400-million 10TB, 400 million images, Ave 25 bytes/instance