Machine learning /Data Mining /Natural Language Processing

I maintain this list of popular machine learning datasets. I do not host these data sets but only provide a link to the source.
For datasets that I used in my papers, please refer to the "publications" page.

Text Corpora

  • TechTC (Technion repository of text categorization datasets)

    Pre-processed versions (mostly as text file or matlab files)
        If you are mostly concerned with the machine learning part and do not want to bother with the processing (like me), here are some of the pre-processed datasets in matrix format
    • 20-Newsgroup
    • Reuters-21578
    • WebKB
    • Cade 12
    • Word Counts from Encyclopedia Articles
    • PNAS titles
    • NIPS conference papers (vol 1-12)
    • TDT2 (top 30 categories)
    • 20-Newsgroup

Gene Expression Analysis Datasets
  • MLL (mixed lineage luekemia) dataset
  • Yeast gene regulation prediction dataset (from KDD Cup 2002)
Other Datasets
  • Faces
    • Umass. labeled faces in the wild
    • Fre Faces
    • Olivetti Faces
    • Umist Faces