Day 7
Today
Decision trees
Peer sharing of data exploration from Kaggle
For Next Time
Complete the ipython notebook on TF-IDF. The notebook is located in the repo under preclass/Exploring Tf-IDF.ipynb.
Decision Trees
Building Intuition with 20 Questions
Let's start out by playing a game. With the folks sitting around you, play either dictator / sitcom or 20 questions. Was the system able to correctly guess what (or who) you were thinking of? Any interesting observations regarding the questions that the system asked you?
A Familiar Example
Conveniently, the Wikipedia page for Decision Tree Learning has an example that you should be quite familiar with.
This is a decision tree. Trees are always drawn upside down. By this I mean that the trunk of the tree (where you should start) is drawn at the top whereas the leaves (where you will reach a decision regarding survival) is at the bottom. To classify a new datapoint, you would start at the top. At each node you will either make a final decision (if the node is labeled died or survived), or you will descend to the left or the right depending on the outcome of the Boolean test at the current node (in the diagram above left is for True and right is for False). Each leaf node has two numbers: a proportion and a percentage. The proportion is the fraction of people at the leaf node that survived (from the training set). The percentage is the percent of passengers in the training set that would end up at that leaf node.
Here is a more complex version of a decision tree fit to the titanic data (unfortunately it is pretty hard to read. Click on the image to see the image at full size).
Decision Tree Construction
Now that we know how to apply an already constructed decision tree to new data, we want to understand how to construct a decision tree in the first place. Here is a really nice visual introduction to this topic. Take a few minutes to read through it.
Rage Against the Machine!!!
Next, I challenge you to build a better decision tree than scikit-learn. To run the demo application, you need to make sure you have OpenCV installed. Unfortunately, the version of OpenCV you will get if you do conda install opencv is not built with support for the GUI features of OpenCV. Instead, use this command
conda install -c https://conda.anaconda.org/menpo opencv
To compete, pull the latest changes from the DataScience16 repo. You will find the relevant script in inclass/day07. Execute the script by typing:
$ python learn_interactive.py
I'll give a brief demo of how the program works.
Optimal Splits
Next, you will explore one method of creating an optimal split. There is an ipython notebook that walks you through the basic idea and gives you a function stub. If you don't finish this in class, that's fine. Finishing it up on your own time is completely optional. The notebook is called inclass/day07/Best Split.ipynb.
Peer Sharing of Data Exploration from Kaggle
We will divide up based on which dataset you are studying. If you are doing SF crime you will go to AC126. If you are doing movie sentiment, you will stay in AC326.
Once you get to the correct room, you will be sharing the visualizations and explorations of your chosen dataset. You should use the ipython notebook as a way to walk the audience through what you have found. Since there are quite a few people studying each dataset, you should use the projector to show your work.
Some guiding questions to consider when going through these explorations.
What are the potential implications of the exploration / visualization for the solving the predictive task? For instance, does the exploration suggest a particular machine learning technique that would be successful? Does the exploration suggest a particular set of features that are either useful or not useful for the predictive task?
What additional explorations might enrich the story?
If in the audience, did you perform an analysis that adds to the story of the team at the front of the room?
Here are some procedural suggestions for this activity.
Pick a team to volunteer to kick off the session. This team should give a brief overview of the dataset to remind everyone what it contains. The team should then go through some interesting points in their ipython notebook.
Maximize discussion and minimize talking at the audience. If you have something to add to enrich the story being shown on the projector, speak up! Ask interesting questions.
If you are not currently presenting, but you have a visualization or exploration that you feel like must be shared right away, go ahead and jump up, plug your laptop in, and show your work off.
Once the first team is done, the next team should go. Try to avoid showing content that has already been shown by other teams. If you do, just say something like "just like some of the other teams we visualized X versus Y". Don't dwell on material that has already been presented.
Take notes on interesting things and who you should follow up with to learn more.