Day 7

Today

For Next Time

Decision Trees

Building Intuition with 20 Questions

Let's start out by playing a game.  With the folks sitting around you, play either dictator / sitcom or 20 questions.  Was the system able to correctly guess what (or who) you were thinking of?  Any interesting observations regarding the questions that the system asked you?

A Familiar Example

Conveniently, the Wikipedia page for Decision Tree Learning has an example that you should be quite familiar with.

This is a decision tree.  Trees are always drawn upside down.  By this I mean that the trunk of the tree (where you should start) is drawn at the top whereas the leaves (where you will reach a decision regarding survival) is at the bottom.  To classify a new datapoint, you would start at the top.  At each node you will either make a final decision (if the node is labeled died or survived), or you will descend to the left or the right depending on the outcome of the Boolean test at the current node (in the diagram above left is for True and right is for False).  Each leaf node has two numbers: a proportion and a percentage.  The proportion is the fraction of people at the leaf node that survived (from the training set).  The percentage is the percent of passengers in the training set that would end up at that leaf node.

Here is a more complex version of a decision tree fit to the titanic data (unfortunately it is pretty hard to read.  Click on the image to see the image at full size).

Decision Tree Construction

Now that we know how to apply an already constructed decision tree to new data, we want to understand how to construct a decision tree in the first place.  Here is a really nice visual introduction to this topic.  Take a few minutes to read through it.

Rage Against the Machine!!!

Next, I challenge you to build a better decision tree than scikit-learn.  To run the demo application, you need to make sure you have OpenCV installed.  Unfortunately, the version of OpenCV you will get if you do conda install opencv is not built with support for the GUI features of OpenCV.  Instead, use this command

conda install -c https://conda.anaconda.org/menpo opencv

To compete, pull the latest changes from the DataScience16 repo.  You will find the relevant script in inclass/day07.  Execute the script by typing:

$ python learn_interactive.py

I'll give a brief demo of how the program works.

Optimal Splits

Next, you will explore one method of creating an optimal split.  There is an ipython notebook that walks you through the basic idea and gives you a function stub.  If you don't finish this in class, that's fine.  Finishing it up on your own time is completely optional.  The notebook is called inclass/day07/Best Split.ipynb.

Peer Sharing of Data Exploration from Kaggle

We will divide up based on which dataset you are studying.  If you are doing SF crime you will go to AC126.  If you are doing movie sentiment, you will stay in AC326.

Once you get to the correct room, you will be sharing the visualizations and explorations of your chosen dataset.  You should use the ipython notebook as a way to walk the audience through what you have found.  Since there are quite a few people studying each dataset, you should use the projector to show your work.

Some guiding questions to consider when going through these explorations.

Here are some procedural suggestions for this activity.