Trang chủ‎ > ‎IT‎ > ‎Data Mining‎ > ‎

List of public available datasets

This list of public data sources are collected and tidied from blogs, answers, and user responses. Most of the data sets listed below are free, however, some are not. Other amazingly awesome lists can be found in the awesome-awesomeness andsindresorhus's awesome list.

Agriculture

Biology

Climate/Weather

Complex Networks

Computer Networks

Contextual Data

Data Challenges

Earth Science

Economics

Education

Energy

Finance

GIS

Government

Healthcare

Image Processing

Machine Learning

Museums

Natural Language

Neuroscience

Physics

Psychology/Cognition

Public Domains

Search Engines

Social Networks

Social Sciences

Software

Sports

Time Series

Transportation

Complementary Collections


Here are many of the links mentioned so far:

Cross-disciplinary data repositories, data collections and data search engines:

  1. https://www.kaggle.com/datasets
  2. http://www.assetmacro.com
  3. http://usgovxml.com
  4. http://aws.amazon.com/datasets
  5. http://databib.org
  6. http://datacite.org
  7. http://figshare.com
  8. http://linkeddata.org
  9. http://reddit.com/r/datasets
  10. http://thewebminer.com/
  11. http://thedatahub.org alias http://ckan.net
  12. http://quandl.com
  13. Social Network Analysis Interactive Dataset Library (Social Network Datasets)
  14. Datasets for Data Mining
  15. http://enigma.io
  16. http://www.ufindthem.com/
  17. http://NetworkRepository.com - The First Interactive Network Data Repository
  18. http://MLvis.com
  19. Open Data Inception - A Comprehensive List of 2500+ Open Data Portals in the World
  20. http://data.opendatasoft.com OpenDataSoft catalog

Single datasets and data repositories

  1. http://archive.ics.uci.edu/ml/
  2. http://crawdad.org/
  3. http://data.austintexas.gov
  4. http://data.cityofchicago.org
  5. http://data.govloop.com
  6. http://data.gov.uk/
  7. data.gov.in
  8. http://data.medicare.gov
  9. http://data.seattle.gov
  10. http://data.sfgov.org
  11. http://data.sunlightlabs.com
  12. https://datamarket.azure.com/
  13. http://developer.yahoo.com/geo/g...
  14. http://econ.worldbank.org/datasets
  15. http://en.wikipedia.org/wiki/Wik...
  16. http://factfinder.census.gov/ser...
  17. http://ftp.ncbi.nih.gov/
  18. http://gettingpastgo.socrata.com
  19. http://googleresearch.blogspot.c...
  20. http://books.google.com/ngrams/
  21. http://medihal.archives-ouvertes.fr
  22. http://public.resource.org/
  23. http://rechercheisidore.fr
  24. http://snap.stanford.edu/data/in...
  25. http://timetric.com/public-data/
  26. https://wist.echo.nasa.gov/~wist...
  27. http://www2.jpl.nasa.gov/srtm
  28. http://www.archives.gov/research...
  29. http://www.bls.gov/
  30. http://www.crunchbase.com/
  31. http://www.dartmouthatlas.org/
  32. http://www.data.gov/
  33. http://www.datakc.org
  34. http://dbpedia.org
  35. http://www.delicious.com/jbaldwi...
  36. http://www.faa.gov/data_research/
  37. http://www.factual.com/
  38. http://research.stlouisfed.org/f...
  39. http://www.freebase.com/
  40. http://www.google.com/publicdata...
  41. http://www.guardian.co.uk/news/d...
  42. http://www.infochimps.com
  43. http://www.kaggle.com/
  44. http://build.kiva.org/
  45. http://www.nationalarchives.gov....
  46. http://www.nyc.gov/html/datamine...
  47. http://www.ordnancesurvey.co.uk/...
  48. http://www.philwhln.com/how-to-g...
  49. http://www.imdb.com/interfaces
  50. http://imat-relpred.yandex.ru/en...
  51. http://www.dados.gov.pt/pt/catal...
  52. http://knoema.com
  53. http://daten.berlin.de/
  54. http://www.qunb.com
  55. http://databib.org/
  56. http://datacite.org/
  57. http://data.reegle.info/
  58. http://data.wien.gv.at/
  59. http://data.gov.bc.ca
  60. https://pslcdatashop.web.cmu.edu/ (interaction data in learning environments)
  61. http://www.icpsr.umich.edu/icpsrweb/CPES/ - Collaborative Psychiatric Epidemiology Surveys: (A collection of three national surveys focused on each of the major ethnic groups to study psychiatric illnesses and health services use)
  62. http://www.dati.gov.it
  63. http://dati.trentino.it
  64. http://www.databagg.com/
  65. http://networkrepository.com - Network/ML data repository w/ visual interactive analytics
  66. Home (United Nations Environment Programme Grid Genava a lot of GIS datasets)


Some others:

More than 1 TB

  • The 1000 Genomes project makes 260 TB of human genome data available [13]
  • The Internet Archive is making an 80 TB web crawl available for research [17]
  • The TREC conference made the ClueWeb09 [3] dataset available a few years back. You'll have to sign an agreement and pay a nontrivial fee (up to $610) to cover the sneakernet data transfer. The data is about 5 TB compressed.
  • ClueWeb12 [21] is now available, as are the Freebase annotations, FACC1 [22]
  • CNetS at Indiana University makes a 2.5 TB click dataset available [19]
  • ICWSM made a large corpus of blog posts available for their 2011 conference [2]. You'll have to register (an actual form, not an online form), but it's free. It's about 2.1 TB compressed.
  • The Yahoo News Feed dataset is 1.5 TB compressed, 13.5 TB uncompressed
  • The Proteome Commons makes several large datasets available. The largest, the Personal Genome Project [11], is 1.1 TB in size. There are several others over 100 GB in size.

More than 1 GB

  • The Reference Energy Disaggregation Data Set [12] has data on home energy use; it's about 500 GB compressed.
  • The Tiny Images dataset [10] has 227 GB of image data and 57 GB of metadata.
  • The ImageNet dataset [18] is pretty big.
  • The MOBIO dataset [14] is about 135 GB of video and audio data
  • The Yahoo! Webscope program [7] makes several 1 GB+ datasets available to academic researchers, including an 83 GB data set of Flickr image features and the dataset used for the 2011 KDD Cup [9], from Yahoo! Music, which is a bit over 1 GB.
  • Google made a dataset mapping words to Wikipedia URLs (i.e., concepts) [15]. The dataset is about 10 GB compressed.
  • Yandex has recently made a very large web search click dataset available [1]. You'll have to register online for the contest to download. It's about 5.6 GB compressed.
  • Freebase makes regular data dumps available [5]. The largest is their Quad dump [4], which is about 3.6 GB compressed.
  • The Open American National Corpus [8] is about 4.8 GB uncompressed.
  • Wikipedia made a dataset containing information about edits available for a recent Kaggle competition [6]. The training dataset is about 2.0 GB uncompressed.
  • The Research and Innovative Technology Administration (RITA) has made available a dataset about the on-time performance of domestic flights operated by large carriers. The ASA compressed this dataset and makes it available for download [16].
  • The wiki-links data made available by Google is about 1.75 GB total [20].

[1] http://imat-relpred.yandex.ru/en...

[2] http://www.icwsm.org/2011/data.php

[3] http://lemurproject.org/clueweb0...

[4] http://wiki.freebase.com/wiki/Da...

[5] http://download.freebase.com/dat...

[6] http://www.kaggle.com/c/wikichal...

[7] http://webscope.sandbox.yahoo.co...

[8] http://americannationalcorpus.or...

[9] http://kddcup.yahoo.com/datasets...

[10] http://horatio.cs.nyu.edu/mit/ti...

[11] https://proteomecommons.org/data...

[12] http://redd.csail.mit.edu/

[13] http://www.1000genomes.org/ftpse...

[14] https://www.idiap.ch/dataset/mobio

[15] http://www-nlp.stanford.edu/pubs...

[16] http://stat-computing.org/dataex...

[17] http://blog.archive.org/2012/10/...

[18] http://www.image-net.org/index

[19] http://cnets.indiana.edu/groups/...

[20] wiki-links - Wikipedia Links Data - Google Project Hosting

[21] The ClueWeb12 Dataset

[22] ClueWeb12 Related Data:


Since we get asked this question by our Machine Learning oriented users very frequently, my company (BigML) has compiled a list with over 250 sources here:
List of Public Data Sources Fit for Machine Learning

You may also want to check out the related blog post for some more context:
Data, Data, Data: Thousands of Public Data Sources


Here is a useful link.
Finding Data on the Internet

Finding Data on the InternetBy RevoJoe
 on October 6, 2011

The following list of data sources has been modified as of 8/19/13. Most of the data sets listed below are free, however, some are not.

If an (R) appears after source this means that the data are already in R format or there exist R commands for directly importing the data from R. (Seeexamples :: intro for some code.) Otherwise, I have limited the list to data sources for which there is a reasonably simple process for importing csv files. What follows is a list of data sources organized into categories that are not mutually exclusive but which reflect what's out there.

Economics

Finance

Government

Health Care

Machine Learning

Public Domain Collections

Science

Social Sciences

Time Series

Universities


Here are a list of open Datasets

Socrata hosts open data websites for a number of governments, government agencies, and non-profits including:

There are also over 100K datasets available on our public data portal, http://opendata.socrata.com


Comments