Collecting, Analyzing And Interacting With Data

Byte 5 Results

Post date: Mar 31, 2014 3:51:44 PM

I want to share the Byte 5 results on the machine learning side because you each took such different approaches to improving accuracy. Only one person questioned the high impact of SpayNeuter -- to me this means that perhaps they are only spaying dogs they think are adoptable, so it may be a confound. A combination of all of these could lead to significant new improvements!

Maximum accuracies reported were by Harsh at 68.1% for decision tree after using entropy instead of the 'gini index' and tweaking Age and removing Sex and Nikola at 62.6% (1st place for Naïve Bayes) after modifying the Breed and Color features to have fewer categories.

One student reported 74% after adding Intake Month but did not provide enough details in their answer for me to include it as a trust worthy result, especially since the reference source code included it and did not do that well. Another student achieved 68.4% (after removing breed) but again did not provide enough details.

Some of the things you did include:

Imputing missing values [no information on how this affected accuracy over just removing them]
Improving on features:
- One student noted features that were 'purged' by the decision tree. These were improved. Improved 'Breed' by converting it into a binary feature (mixed breed or not) and color by grouping a bunch of rare colors into 'Other'). This led to a significant but small improvement in Decisions trees from 63% to 66% for that student.
Trying different subsets of features
- Simply removing breed [but unclear why this made a difference since it was already being ignored]
- Picking a combination that had high accuracy on optimization ('AnimalType', ''EstimatedAge', 'Sex', 'SpayNeuter', 'Size', 'IntakeType') - led to a very small improvement in accuracy over the full feature set (small drop in accuracy for Naïve Bayes; increase on Decision Trees from 64.27 to 64.35; no report on whether this is significant)
- I have modified the features and I have used 'IntakeMonth', 'Age', 'Sex', 'SpayNeuter', 'IntakeType' as the features. I mainly wanted to see based on Age, Sex and IntakeType. But I added IntakeMonth and SpayNeuter to avoid any form of bias. Led to an accuracy of 61.7. No information about the original accuracy but this person was wanting to look at the impact of Sex (which rose to the top after Breed, Size, Color, AnimalType were removed). Not clear this was a good thing.
Trying different classifiers
- This mostly involved trying classifiers we hadn't discussed much in class, and I did not count this towards 'highest score'. For example, one person achieved 68.8 using random forest and removing the Sex feature.
- Used 'entropy' instead of Gini index for building the tree (and also deleted missing/incomplete Age information [and removed sex]) led to 68.1% vs an original value of 67.7; don't know if this was significant.

Useful Observations

Harsh collected some very nice information about how decision trees are built using either entropy or the gini index (apparently the default in some of your trees was gini, not entropy).
- Gini Index - it's a meaure of how often a randomly chosen element from the set would be incorrectly labelled if it were randomly labelled taking into account the distribution of the subset
- Information Gain - based on the concept of information theory. Entropy refers to expected value of the information contained in the message. You would typically measure entropy depending on the uncertainty of the random variable - you could measure the probability by using frequency = number of positive samples/no of negative samples (in a simplistic manner)
Yanan noticed that 'Other' is often misclassified as 'Euthanized'. I would add that it might be worth examining what went into other to see whether these are in some deep sense similar (or whether we should git rid of this class or split it up differently).
It is also worth noting that SpayNeuter may be a problem feature (could be that it is only defined for animals who are adopted for example). I don't think it's that cut and dried but it could introduce some bias into the problem.

There were several mistaken choices:

Adding impossible or biased features:

Two people added 'Outcome Subtype' and achieved oven 80% accuracy. However I had to discount this as a success because Outcome Subtype is clearly going to be biased due to its connection to Outcome (what we are trying to predict). Others tried adding OutcomeMonth (with a small benefit) and Outcome Year (as part of several other changes with large benefit). All of these a problematic because they are not known at intake time and may introduce bias.

... added IntakeYear and OutcomeYear as additional features. I also removed Size, Sex and Breed from the feature set. Led to an improvement from 66.3 to 67%. No information on whether this difference is significant

Features and overfitting

Multiple people thought that reducing the features would reduce overfitting. This is not necessarily true and we will discuss this in class.

Page updated

Google Sites

Report abuse