In progress, please contribute. Add your name if you contribute!
Contributor: Chen-Lin Zhang
Work in progress.
Multi-modal benchmarks tend to be clustered based on the different modalities. Here are the major benchmarks per category.
Vision-only datasets
RGB + Depth (2D + 3D):
RGB + Gaze:
GTEA Series (GTEA/EGTEA/GTEA Gaze/EGTEA Gaze+): a series of first-person vision datasets which include RGB and gaze information. This series is a fundamental dataset in first-person vision.
Vision+Language datasets
COCO: COCO is a common object dataset which includes common objects and related captions.
ActivityNet: ActivityNet contains a wide range of complex human activities that are of interest to people in their daily living. Extra captions are provided.
NUS-WIDE: NUS-WIDE contains 269,648 images and the associated tags from Flickr, with a total number of 5,018 unique tags;
Flickr-30K: The Flickr30k dataset contains 31,000 images collected from Flickr, together with 5 reference sentences provided by human annotators.
Visual Genome: Visual Genome is a knowledge database for connecting image concepts to language.
LAION-5B: LAION-5B is the biggest paired large-scale dataset with images and texts. It can be trained for AI-generated contents and unified vision-language models.
Vision+Audio datasets
AVA: AVA is a project that provides audiovisual annotations of video for improving our understanding of human activity. AVA contains an extra speaker identification dataset.
VGG-Sound: VGG-Sound is an audio-visual correspondent dataset consisting of short clips of audio sounds, extracted from videos uploaded to YouTube.
ACAV100M: ACAV100M is an automatically curated dataset of 10-seconds clips with high audio-visual correspondence.
Vision+Audio+Language datasets
Ego4D: Ego4D is A massive-scale, egocentric dataset and benchmark suite collected across 74 worldwide locations and 9 countries, with over 3,670 hours of daily-life activity video.