Trang chủ‎ > ‎IT‎ > ‎Data Mining‎ > ‎

20 Weird & Wonderful Datasets for Machine Learning

They say great data is 95% of the problem in machine learning. We saw first hand at Udacity that this is the case, with the amazing reception from the machine learning community when we open sourced over 250GB of driving data. But, finding interesting data is really hard, and actively holds the industry back from progress. In trying to learn more about this problem I searched far and wide, and cataloged just a sliver of the datasets I found.

In the hope that others might find this catalog useful, here’s 20 weird and wonderful datasets you could (perhaps) use in machine learning.

NamePurposeFile SizeLink
20 NewsgroupsThe text from 20000 messages taken from 20 Usenet newsgroups for text analysis, classification, etc.61.6MB
Amazon ReviewsOver 142 million product reviews for sentiment analysis, recommender systems, and more.20GB
Football StrategyThousands of scenarios to make the best coaching decisions.876KB
Horses for CoursesHorse-racing data for predicting race results.19MB
Human Activity Recognition with SmartphonesSensor data for recognizing the human activity - walking, sitting, etc.25MB
Labeled Faces in the Wild13,000 named faces for facial recognition. Multiple training and test sets.173MB
National Survey on Drug Use and HealthPredict drug use based on health survey questions.2GB
NORB 3D Object RecognitionBinocular images of 50 toy figurines for 3D object recognition from image.Multiple files, over 5GB total
One Million SongsAudio features and metadata for a subset (10,000) of the one million popular songs dataset for recognition/classification.1.8GB
SMS Spam CollectionA collection of 5,574 SMS (text) messages, some spam, some normal, for spam filtering.204KB
Hate Speech IdentificationA sampling of Twitter posts that have been judged based on whether they are offensive or contain hate speech, as a training set for text analysis.2.66MB
Hidden Beauty of Flickr Pictures15,000 Flickr photo IDs that have received ratings based on aesthetics, for image analysis.138KB, use Flickr API to get images
Yahoo Instant Messenger Friends Connectivity GraphConnections between Yahoo users who communicate with each other using Yahoo messenger, can be used to identify key social contacts/influencers. Add dataset to cart to access.28MB
Record of Heart SoundRecordings of normal and abnormal heartbeats, used to recognize heart murmur, etc.47.7MB
Prostate CancerTumor and nontumor samples, used to recognize prostate cancer.4.8MB
Wine QualityChemical properties of red and white wines (separately) and quality, for classification.3 files, 343KB total
Mushroom IdentificationFor hypothetically classifying mushrooms as edible or poisonous based on its characteristics.3 files, 480KB
UFO Reports80,000 historic reports for classification or regression. This dataset has been standardized from the source data at
Militarized Interstate DisputesNearly 200 years of international threats, conflicts, etc. for modelling or prediction. Includes action taken, level of hostility, fatalities, and outcomes.Multiple datasets, e.g., 962KB, 179KB
NBA & MLB StatsCurrent and past season stats for teams and players for fantasy sports predictions.Multiple datasets, e.g., 2016 MLB batters = 65KB

Caveat: I haven’t validated that all of these datasets are actually useful for machine learning (in terms of size or accuracy). Use your own judgement when playing with them (and check licenses)!

My favorite? The 80,000+ UFO reports dataset:

10/10/1949 20:30san marcostxuscylinder270045 minutesThis event took place in early fall around 1949-50.
It occurred after a Boy Scout meeting in the Baptist Church.
The Baptist Church sit
10/10/1949 21:00lackland afbtxlight72001-2 hrs1949 Lackland AFB&#44 TX. Lights racing across the sky & making 90 degree turns on a dime.12/16/200529.38421-98.581082
10/10/1955 17:00chester (uk/england)gbcircle2020 secondsGreen/Orange circular disc over Chester&#44 England1/21/200853.2-2.916667
10/10/1956 21:00ednatxuscircle201/2 hourMy older brother and twin sister were leaving the only Edna theater at about 9 PM&#44...we had our bikes and I took a different route home1/17/200428.9783333-96.6458333
10/10/1960 20:00kaneohehiuslight90015 minutesAS a Marine 1st Lt. flying an FJ4B fighter/attack aircraft on a solo night exercise&#44 I was at 50&#44000&#39 in a "clean" aircraft (no ordinan1/22/200421.4180556-157.8036111
10/10/1961 19:00bristoltnussphere3005 minutesMy father is now 89 my brother 52 the girl with us now 51 myself 49 and the other fellow which worked with my father if he&#39s still livi4/27/200736.5950000-82.1888889
10/10/1965 21:00penarth (uk/wales)gbcircle180about 3 minspenarth uk circle 3mins stayed 30ft above me for 3 mins slowly moved of and then with the blink of the eye the speed was unreal2/14/200651.434722-3.18
10/10/1965 23:45norwalkctusdisk120020 minutesA bright orange color changing to reddish color disk/saucer was observed hovering above power transmission lines.10/2/199941.1175000-73.4083333
10/10/1966 20:00pell cityalusdisk1803 minutesStrobe Lighted disk shape object observed close&#44 at low speeds&#44 and low altitude in Oct 1966 in Pell City Alabama3/19/200933.5861111-86.2861111
10/10/1966 21:00live oakflusdisk120several minutesSaucer zaps energy from powerline as my pregnant mother receives mental signals not to pass info5/11/200530.2947222-82.9841667
10/10/1968 13:00hawthornecauscircle3005 min.ROUND &#44 ORANGE &#44 WITH WHAT I WOULD SAY WAS POLISHED METAL OF SOME KIND AROUND THE EDGES .10/31/200333.9163889-118.3516667
10/10/1968 19:00brevardncusfireball1803 minutessilent red /orange mass of energy floated by three of us in western North Carolina in the 60s6/12/200835.2333333-82.7344444

I’ve also been fascinated with the militarized interstates disputes dataset, which includes 200 years of international threats and conflicts. It includes the action taken, level of hostility, fatalities, and outcomes.