I400/590 : Large-Scale Social Phenomena - Data Mining Demo
For your mid-term hack-a-thons, you will be expected to quickly acquire, analyze and draw conclusion from some real-world datasets. The goal of this tutorial is to provide you with some tools that will hopefully enable you to spend less time debugging and more time generating and testing interesting ideas.
Here, I chose to focus on Python. It is beautiful language that is quickly developing an ecosystem of powerful and free scientific computing and data mining tools (e.g. the Homogenization of scientific computing, or why Python is steadily eating other languages' lunch). For this reason, as well as my own familiarity with it, I encourage (though certainly not require) you to use it for your mid-term hack-a-thons. From my own experience, getting comfortable with these tools will pay off in terms of making many future data analysis projects (including perhaps your final projects) easier & more enjoyable.
Hopefully you already have Python installed. If you are new to it, search around for good introductory tutorials -- I'd say it has a forgiving learning curve, comparatively speaking.
IPython is a kind of add-on for Python that brings several improvements, most importantly for us are its interactive, graphical notebooks which provide a great way to quickly develop and share code. A gallery of interesting notebooks is provided here, including:
- An introduction to IPython notebooks
- Replications of Nate Silver's results in predicting the 2012 presidential elections, along with other assignments from Harvard's CS109 Data Science course.
- Open-content for self-directed learning in Data Science - great notebooks illustrating some basic concepts in machine learning (Linear Regression, Logistic Regression, Random Forests, K-Mean Clustering)
- A set of notebooks for Psychological model fitting and testing
In this demo, we will demonstrate the basic functionality of several useful toolboxes, including:
- Numpy - a library for manipulating matrices and performing linear algebra operations (matrix products, computing covariance matrices, etc.). All the other libraries build on the data structures provided by numpy. Here's a basic tutorial.
- Scipy - a library that builds on top of numpy and provides more advanced linear algebra routines, statistical tools, and sparse matrices (useful in many applications, such as text mining)
- Matplotlib - plotting library, similar in style to that used by MATLAB. Here's the official gallery (click on a plot to see the code to generate it) and here's an IPython notebook with many good examples.
- Scikit-learn (a.k.a. sklearn) - machine learning library, making it easy to do anything from PCA to training classifiers. Here's a very quick quick start, and here's some tutorials showing how to do common machine learning tasks.
- Pandas - is a library for managing data sets. It really makes a breeze of difficult or tedious tasks (loading / processing / cleaning data). There is a 10 min intro video here, a tutorial here, another series of tutorials by Hernan Rojas.
- Networkx - a library for working with networks. This includes computing graph theoretic measures, laying out and plotting graphs. A brief tutorial from the docs is here.
- NLTK - the Natural Language Toolkit, can be very useful for text mining and language processing tasks. Actually, we won't demonstrate it during the in-class demo, but you can see its utility in tutorials such as this and this, as well as the documentation on the NLTK site.
You can find more useful links at Python for data analysis: the landscape of tutorials.
Even simple analysis can extract interesting results from good data, but nothing can make up for bad data. There's a few places to find potential data sets, including publicly-available data sources (Tableau software provides some data sets, and there is a directory of APIs and data sources at ProgrammableWeb).
Another common strategy is to scrape data from the web. There is an automatic tool to do so built by Kimono (which I haven't used but looks impressive). Python has several tools to do this (many discussed here). We will focus on the combination of mechanize (which simulates a browser to download HTML) and BeautifulSoup (which parses the downloaded HTML). There are some good tutorials showing how to use mechanize and Beautiful here and here.
There are two ways to install the aforementioned tools. The first is to use a Python distribution that already comes with all of these included, such as Anaconda.
The second option is to use the Python installer. Once you have Python installed, run the following on the command line:
pip install -U ipython numpy scipy matplotlib scikit-learn pandas networkx nltk pip install -U mechanize beautifulsoup4
You may want to run this as administrator to install them system-wide. If this options doesn't work for you, I recommend trying Anaconda.