Last updated: 01/02/2024
10 Great Places to Find Free Datasets for Your Next Project, https://careerfoundry.com/en/blog/data-analytics/where-to-find-free-datasets/
Google data search: https://datasetsearch.research.google.com/
70 Amazing Free Data Sources You Should Know, https://www.kdnuggets.com/2017/12/big-data-free-sources.html
Registry of Open Data on AWS, https://registry.opendata.aws/
More than 4000 datasets available via AWS Data Exchange,
UCI Machine Learning Repository, https://archive.ics.uci.edu/ml/index.php
10 Popular Datasets For Sentiment Analysis, https://analyticsindiamag.com/10-popular-datasets-for-sentiment-analysis/
US Census Bureau, https://www.census.gov/
Kaggle, https://www.kaggle.com/datasets
Awesome Public Datasets, https://github.com/awesomedata/awesome-public-datasets?ck_subscriber_id=2064140805
Machine learning datasets, https://www.datasetlist.com/
ICPSR Sharing data to advance science, https://www.icpsr.umich.edu/web/pages/
The Social Science Data Archive at UCLA, https://dataverse.harvard.edu/dataverse/ssda_ucla
UN Data, https://data.un.org/
Baltimore Neighborhood Indicators Alliance, https://bniajfi.org/
World Bank Data, https://data.worldbank.org/
Data Is Plural: Search its archive via a Google Sheet or web app.
The home of the U.S. Government's open data, https://www.data.gov/. It includes over 197,747 data sets which, among others, include health, public safety, and scientific research data sets from across the Federal Government.
ROPER for public opinion research at Cornell, https://ropercenter.cornell.edu/
Data and metadata for OECD countries and selected non-member economies, https://stats.oecd.org/
Links to various Poverty & Social Justice datasets, https://elon.libguides.com/c.php?g=553597&p=5095797#s-lg-box-8785116
Social Justice & Big Data Repository at Grand Valley State University, https://www.gvsu.edu/bigdata/social-justice-big-data-repository-29.htm
CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI) dataset is the largest dataset of multimodal sentiment analysis and emotion recognition to date, http://multicomp.cs.cmu.edu/resources/cmu-mosei-dataset/#:~:text=CMU%20Multimodal%20Opinion%20Sentiment%20and,The%20dataset%20is%20gender%20balanced.
DataKind - Harnessing the power of data science in the service of humanity, https://www.datakind.org/
Statistics Without Borders, https://swb.wildapricot.org/
OpenNeuro, a free and open platform for validating and sharing BIDS-compliant MRI, PET, MEG, EEG, and iEEG data, https://openneuro.org/
OpenFDA, launched by the U.S. Food and Drug Administration, allows developers to access public FDA data through open APIs, provides raw data downloads, and offers documentation and examples. https://open.fda.gov/
VAERS - Vaccine Adverse Event Reporting System, https://vaers.hhs.gov/index.html
Centers for Disease Control and Prevention, National Center for Health Statistics
Diabetes, https://www.cdc.gov/diabetes/index.html
Heart Disease, https://www.cdc.gov/nchs/fastats/heart-disease.htm
Center for Disease Control and Prevention datasets, https://www.cdc.gov/datastatistics/index.html
National Health Interview Survey, https://www.cdc.gov/nchs/nhis/index.htm
Behavior Risk Factor Surveillance System (BRFSS), https://www.cdc.gov/brfss/index.html
National Health and Nutrition and Examination Survey (NHANES), https://www.cdc.gov/nchs/nhanes/index.htm
Medical Expenditure Panel Survey (MEPS), https://www.meps.ahrq.gov/mepsweb/
Center for Aging and Population Health, https://www.caph.pitt.edu/research/epidemiologic-research/
Health Information National Trends Survey (HINTS), https://hints.cancer.gov/
COVID-19 data from Johns Hopkins JUH, https://github.com/CSSEGISandData
COVID-19 real-time information, reporting on cases, testing, and exposure sites in Australia, https://crisper.net.au/
An integrated database of CRISPR-CAS9 screening experiments for human cell lines, https://www.kobic.re.kr/icsdb/
Bioinformatics Databases, https://subjectguides.lib.neu.edu/c.php?g=948457&p=6839134
PhysioNet is a repository of freely-available medical research data, managed by the MIT Laboratory for Computational Physiology,
For example, MIMIC-III Clinical Database, https://physionet.org/content/mimiciii/1.4/
Maryland Medicaid DataPort (Not open to the public), https://hilltopinstitute.org/data/dataport/
Asclepius-Synthetic-Clinical-Notes, https://huggingface.co/datasets/starmpcc/Asclepius-Synthetic-Clinical-Notes
Augmented-clinical-notes, https://huggingface.co/datasets/AGBonnet/augmented-clinical-notes?row=12
National Cancer Institute (NC) Data Catalog, https://datascience.cancer.gov/resources/nci-data-catalog
Surveillance, Epidemiology, and End Results (SEER) Program, https://seer.cancer.gov/data-software/
Data.World (There are 42 cancer datasets available , https://data.world/datasets/cancer
Cancer Data and Statistics, https://www.cdc.gov/cancer/dcpc/data/index.htm
Cancer Genomics Cloud, https://www.cancergenomicscloud.org/
The mini-MIAS database of mammograms, http://peipa.essex.ac.uk/info/mias.html
Cancer Imaging Archive, https://www.cancerimagingarchive.net/
OpenNeuro for MRI, MEG, EEG, iEEG, ECoG, ASL, and PET data, https://openneuro.org/
Radiology AI Lab, https://rail.jhu.edu/
STructured Analysis of the Retina, https://cecas.clemson.edu/~ahoover/stare/
Objaverse, a massive open dataset of text-paired 3D objects, https://huggingface.co/datasets/allenai/objaverse , https://arxiv.org/abs/2212.08051
Open Images Dtaset V7 and Extensions, https://storage.googleapis.com/openimages/web/index.html
Open Images is a dataset of ~9M images that have been annotated with image-level labels and object bounding boxes, https://www.tensorflow.org/datasets/catalog/open_images_v4
Open Image Dataset: https://storage.googleapis.com/openimages/web/index.html
Dataset from fundus images for the study of diabetic retinopathy (Downloadable data set of images), https://www.sciencedirect.com/science/article/pii/S2352340921003528
Best 13 Free Financial Datasets for Machine Learning, https://www.iguazio.com/blog/best-13-free-financial-datasets-for-machine-learning/
50 free Machine Learning datasets: finance and economics, https://blog.cambridgespark.com/50-free-machine-learning-datasets-part-two-financial-and-economic-datasets- 6620274ee593
Data sets available on data.world, https://data.world/datasets/finance
Yahoo Finance on Crypto: https://finance.yahoo.com/crypto/
25 Best Retail, Sales, and Ecommerce Datasets for Machine Learning, https://imerit.net/blog/25-best-retail-sales-and-ecommerce-datasets-for-machine-learning-all-pbm/
The UCI Machine Learning Repository is a database of datasets for machine learning research. The repository includes a number of retail and sales datasets, including datasets on online retail and sales forecasting.
Kaggle is a platform for machine learning and data science competitions. Kaggle hosts a number of retail and sales datasets, including datasets on eCommerce and customer behavior.
266 food datasets available on data.world, https://data.world/datasets/food
Machine Learning Food Datasets Collection, https://hackernoon.com/machine-learning-food-datasets-collection-db21e38ea225
3 food-related datasets & ideas for analyzing them, https://medium.com/visual-analytics-field-notes/3-food-related-datasets-ideas-for-analyzing-them-29496dc441df
Sentiment140, http://help.sentiment140.com/for-students/
Online test data generator, https://www.onlinedatagenerator.com/
Synthetic data, https://mostly.ai/synthetic-data-platform/
Center for Alzheimer's and Related Dementias (CARD), also a part of NIH, https://card.nih.gov/data-resources/access-data
There are many data sources and tools available at CARD. Large population/cohort scale data:
Biobank datasets, https://www.ukbiobank.ac.uk/enable-your-research/research-analysis-platform
GP2, https://gp2.org/
NIA epidemiological cohorts, https://www.nia.nih.gov/health/alzheimers,
Mexican biobank, http://www.mxbiobank.org/
Alzheimer’s disease data initiative, ADDI https://www.alzheimersdata.org/
AMP-PD/GP2 - https://amp-pd.org/federated-cohorts/gp2 (minimal paperwork required)
AMP-AD - https://adknowledgeportal.synapse.org/Explore/Programs/DetailsPage?Program=AMP-AD (minimal paperwork required)
Fox Insight - https://foxden.michaeljfox.org/insight/explore/insight.jsp (minimal paperwork required)
Deep molecular data
iNDI, https://card.nih.gov/research-programs/ipsc-neurodegenerative-disease-initiative
FOUNDIN, https://www.foundinpd.org/#Foundinpd
CRISPRBrain, https://crisprbrain.org/
Accelerating Medicines Partnerships, https://www.nih.gov/research-training/accelerating-medicines-partnership-amp
These can be accessed through LON (https://ida.loni.usc.edu/)
PPMI
ADNI
A4 study
Public expression data- GEO: https://www.ncbi.nlm.nih.gov/geo/