Byte 4: Machine Learning

Description: This assignment is designed to help you get comfortable applying Machine Learning (ML) and Statistics related skills to classify data.

Overview

In this assignment you will experience a machine learning method for selecting and optimizing an algorithm that prevents overfitting and biasing your algorithm on your training data. This helps improve the ability of your model to predict unseen data. We will be using the American Community Survey, from which we have held out a subset of the data that we will use to test your algorithms.

For this assignment you will use the 2015 American Community Survey data set, which contains detailed demographics for 3 million people living in the US. Your goal is to predict the sex of a person based on other demographics data. The learning goals for this assignment include:

Using libraries that can support statistics and machine learning rather than 'doing it yourself'
Using a statistical test to check for significant differences in things that might predict adoption
Using a visualization to sanity check the results you are finding
Developing a feature set to be used for classification
Following the best practices in developing a ML model
Using different classifiers and comparing their performance
Using cross validation to test your classifier
Understanding your results and comparing them to a baseline classifier
Iterating on your features to improve the results
Showing that your results are significantly better than baseline
Developing a report documenting your findings
Discussing the implications of the results and how ML can help us understand the data better

We include a limited exploration of effects of income, age, marital status, and occupation on sex of a person in the sample code. You will add to this thoughts about new features. For beginning programmers, work with features to improve your algorithms.

We demonstrate how to compare algorithms to find out which one is most promising. We encourage expert programmers to try new algorithms. Beginners should instead use these methods to compare the performance of possible feature sets. All of this should take place on the development set. Because our development set is comparably small, we will use cross-validation to evaluate our algorithms. However, because we also want to optimize the algorithms we are comparing (to ensure we are selecting the best configuration) we will use what we call inner-outer 10 fold cross validation.

We provide starter code for the following 3 algorithms:

Zero R, which always picks the majority class. This is our baseline.
Naive Bayes, which is a fast algorithm based on the Bayes' theorem, but with a naive assumptions about independence of features given the class (http://scikit-learn.org/stable/modules/naive_bayes.html)
Decision Tree, which is a non-parametric supervised learning method used for classification that predicts the value of a target variable by learning simple decision rules inferred from the data features (copied from http://scikit-learn.org/stable/modules/tree.html).

Preparatory Work

It is important that you follow the instructions exactly to minimize setup time! Setup for this assignment will have two paths:

Beginner - for those who have very low Python programming skills.
Intermediate/Expert - for those who have good Python programming skills (prior experience with ML is not required).

In both paths you will do this assignment on your local machine, and when done you will create an App Engine application that will describe your process and findings. This assignment has prerequisites, so follow the installation instructions below carefully.

Getting the Starter Code (EVERYONE MUST USE THE STARTER CODE)

We have provided an ipython notebook (attached below), which you should download and place in a directory called '[starter-code]/notebooks/'. You will also need to download the file data.zip from canvas (found at this url), and unzip it into '[starter-code]' (which will create the directory '[starter-code]/data/'. Take note of the [starter-code] directory.

GUI/non programmer preparation:

1) Install Jupyter (a Python Notebook) (ONLY if you do not have DataLabs or Jupyter already)

Follow the steps here ONLY if you do not have Google DataLabs or Jupyter installed. For the sake of simplicity and to cover most operating systems, you will install Jupyter notebooks using Anaconda. Go to Anaconda download page and install the Anaconda 3.6 GUI which will include Python 3, Jupyter, and all the correct libraries you need for this project.

2) Double check that you have all the libraries you need.

You can double that you have all the correct libraries this by comparing the libraries in this screen of Anaconda to those listed below (you can even type each into the search area to make this easier).

you should check for:

numpy - matrix and array operations
scipy - scientific computing including statistical methods
scikit-learn - ML library
pandas - data manipulation and exploration
seaborn - data visualization
graphviz - data visualization

If anything is missing, change the menu that says 'Installed' to 'Not Installed' and check the appropriate package. Then press 'Apply' at the bottom right of the window.

Next, you need to install one additional package, pydotplus which requires the following steps:

First, you need to add a 'channel' (a data repository where libraries can be downloaded) called 'conda-forge'. To do this you click on 'Channels', then click on 'Add' (top right of the window shown below), then enter the text 'conda-forge'

At this point, you have to hit return (while the cursor is in the text box that says conda-forge, and the interface will change to let you click update channels.

Finally, you need to search for pydot, select pydotplus's checkbox, and then select the 'Apply' button at the bottom right of the screen. This will bring up a dialog. Hit 'ok' and you can install pydotplus.

Running the starter code in Jupyter

In Anaconda, click launch in the jupyter notebook window

This will result in a window opening in your browser, showing your file system. Navigate to the directory called notebooks in the interactive-machine-learning byte (which you can download from GitHub as described earlier, and click on the file called interactive-ml.ipynb

This will open a new tab that should look something like this:

Programmer Preparation:

1) Anaconda route:

We recommend Anaconda for installing Jupyter. Go to Anaconda download page and install the Anaconda 3.6 GUI or Command Line (your choice) which will include Python 3, Jupyter, and all the correct libraries you need for this project. You can double check this and install further packages using conda install [package name] for Anaconda).

If you instead want to install everything from scratch, you will need to install Xcode and Python3 on your machine (if they are not yet installed). I use brew for this. You can then use pip3 install [package name] instead of Anaconda. This might happen if you installed Jupyter at some time in the past using pip3. If you take this approach, you can find more information at the Jupyter Download page. If you take this approach, you will also need to install a number of Python libraries for Machine Learning and Statistical Analysis. Here is a complete list of libraries all of which can easily be installed with pip in this order (you must ensure that each is installed in your environment):

numpy - matrix and array operations
scipy - scientific computing including statistical methods
scikit-learn - ML library
pandas - data manipulation and exploration
seaborn - data visualization
graphviz - data visualization

Note: you have to make sure that the method matches the method you used to install Jupyter or DataLabs. This means that if for some reason you decided to use pip to install jupyter, use pip3 install [package name] instead of Anaconda. This might happen if you installed Jupyter at some time in the past using pip. Similarly, if you downloaded the Anaconda GUI or command line, make sure any packages you install are done with conda install [package name]

Do not move on to next steps until you have Jupyter installed.

The starter code also uses GraphViz to visualize a DecisionTree classifier. Note that you must have GraphViz installed for that portion of the code to work! However, if are unable to install GraphViz or the required library called pydotplus, it is fine. You will be able to skip that one step in the Notebook.

Running the Starter Code in Jupyter

Start a console on your local machine and change directory to where you placed the starter code.

cd [starter-code]

To run a Jupyter notebook you need to go to a directory that contains your Python notebooks and run Jupyter from there. Use your local command line to change to the notebooks directory in your assignment directory.

cd [starter-code]/notebooks

Then start Jupyter by executing the command below in your interactive-data-ml directory:

jupyter notebook

If everything went well, this will open up a page in your browser that looks like this:

Click on interactive-ml.ipynb and this will bring up the notebook for you:

If you have DataLabs (Google's Jupyter version on steroids) installed, then you probably already know how to run it (you used it in a previous assignment). As a reminder see instructions here.

Byte Instructions (both paths)

You will now use the Python notebook to complete the ML portion of your assignment. The notebook contains detailed instructions how to complete the assignment. If you simply execute the sections of the notebook (IN ORDER!) you will see a fully working example of the ML methodology we use to explore the data and train and test a classifier.

You are expected to execute the code in the notebook and answer the questions as you go. You must answer the questions in the Notebook and include those answers later in your assignment handin. To receive full credit on this assignment, you are expected to modify the notebook to reflect additional exploratory analysis.

If you are a beginning programmer, we recommend that you focus on feature engineering (visualizing features using the functions already provided, and studying the codebook, domain, and data sufficiently to develop intuition about which features will help.

If you are an expert programmer, feature engineering is still an option, but you may also wish to explore computational methods such as automated feature selection, and changes to the classifier parameters (including adding another classifier of your choice).

To get full 100% on this assignment you have to beat the best current classifier in the development stage (more about this below).

We will evaluate your final classifiers and present the best performing ones in class.

Hand In Expectations

The tutorial above gets you to the point where you can test for difference between males and females, and generate a classifier that performs better than the baseline. You will be asked to hand in a report, in the form of your jupyter notebook that answers the following questions (note that you should include charts to support your argument).