Collecting, Analyzing And Interacting With Data

Byte 5 Results

Post date: Apr 7, 2015 8:37:19 PM

Hi all,

A brief summary of the Byte 5 results. First, overall, really nice job to all of you!

First off, not everyone was very specific/clear about what subset of the data they were looking at if any (i.e. only cats or cats + dogs or only dogs) and how missing values were dealt with (eliminate data? impute? etc.). As a result, you had quite different results from those choices alone in terms of accuracy, as well as the structure of your decision trees. Something to consider going forward. I present results below as if these differences didn't exist since I mostly don't know much about them. This means you may have trouble replicating each others' results as well.

Interesting things people tried with the features:

Categorizing breed into pure, mix, combo & unknown (& many other variants on simplifying breed across various assignments)
Changed IntakeMonth to the average ambient temperature in Louisville, KY for each given month to account for periodicity and to test the possibility that weather plays a role in adoption.
Modified ‘Size’ into 3 categories only (Small, Medium, Large). (& other variations on this idea)
Modified ‘Color’ into 10 categories only (Black, Blue, Pattern, Mix etc)
Tried 'Season' as a new feature, including 4 categories (spring, summer, autumn, winter). It turns out the accuracy rises when season is included.
Modified 'Age' to 3 categories (Young, Old, Unknown).

Interesting analyses people did

Showing a confusion matrix to explore which outcomes were leading to the biggest accuracy problems (e.g. 'other' in one case)
Many people tried removing features. While this can sometimes have value, I prefer to see specific reasons for removing those features, and many of you did this to the exclusion of other options such as pruning the decision tree, or making new more informative features
Several people tried an SVM classifier and found that it beat out the others
'Dummy' classifier achieved accuracies of around 50%

A sampling of best machine learning scores you achieved:

Page updated

Google Sites

Report abuse