Machine learning /Data Mining /Natural Language Processing
I maintain this list of popular machine learning datasets. I do not host these data sets but only provide a link to the source.
For datasets that I used in my papers, please refer to the "publications" page.
Pre-processed versions (mostly as text file or matlab files)
If you are mostly concerned with the machine learning part and do not want to bother with the processing (like me), here are some of the pre-processed datasets in matrix format
Gene Expression Analysis Datasets