Animals at Shelter: Outcome Prediction
Challenges: Text variable, multi-class classification
Tools: Python, Pandas, Bokuh, Seaborn, Tensorflow, Tableau
Challenges: Text variable, multi-class classification
Tools: Python, Pandas, Bokuh, Seaborn, Tensorflow, Tableau
Every year, approximately 6.5 million companion animals end up in US shelters. Among them, around 3.3 million are dogs and 3.2 million are cats, where almost half of them eventually get adopted. Although this seems a high number, there has been an evident decline in the number of animals that enter into the shelter from 2011 ( The American Society for the Prevention of Cruelty to Animals). Many animals are surrendered by their owners, while others are stray animals found by people or donated from breeders. This project is to use animals’ information (breed, color, sex, age etc.) to predict their outcome (Adoption, Died, Euthanasia, Return to owner, and Transfer).
In this project, I first explore the distribution of different traits to better understand the characteristics of animals at Austin Animal Center. Then I compare how the outcome differs between cats and dogs. Thirdly, I analyze weather the age distribution differs for different outcomes. Lastly, I build a model to predict the plausible outcome given the animals traits.
The data is collected by Austin Animal Center, from Oct 1st,2013 to March, 2016 with a total of 26,730 records on cats and dogs. I accessed the data from Kaggle2. The variables include AnimalID, Name, DateTime, OutcomeType, OutcomeSubtype, AnimalType, SexuponOutcome, AgeuponOutcome, Breed, and Color. Other than AgeuponOutcome, all other variables are categorical/text data. There are quite a few missing values (7,691) in Name, and around half of missing values(13,612) in OutcomeSubtype, both of which are not considered during our future analysis. There are 18 missing values in AgeuponOutcome, which I used mean imputation to fill in and 1 missing value in SexuponOutcome which I manually assigned to ‘Unknown’ category.
Question 0: How to deal with data processing?
Method: I have adopted different strategies for different variables in data processing. In order to visualize age using distribution plot, I first need to convert text data to numeric data. Upon inspection, I found that age is described using different units: year, month, day, so I recoded animals that are younger than 30 days to be 0 year and animals that are younger than 1 year but older than 30 days to be 0.5 year to distinguish from 0 and 1 year. After recoding to the same measurement unit, I converted age to numeric data. For later modeling, I recoded categorical variables (OutcomeType, sex) to numeric value (ie, 0,1,2…). For Breed, I created two categories: common breed and uncommon breed where common breed is defined as those breed that appeared more than 400 times in the data set. For Color, I also defined two categories: mix or pure. Notice that this is implemented very loosely in code3, which as later we’ll see, results in bad predictor.
Question 1 : What do the animals look like in Austin Animals Center?
Method: This is primarily done by data visualization. I used table to visualize the distribution of SexuponOutcome (referred to as sex later), seaborn bar plot for Animal Type, and violin/distribution plot for AgeuponOutcome(referred to as age later) for dogs and cats. It was challenging to visualize breed and color, as both are text data with 1,380 and 679 unique values respectively (Fig 1). I used word cloud to visualize the common color and bar plots for common breed that appeared more than 400 times for cats and dogs respectively.
Question 2 : How does the outcome differs for cats and dogs?
Method: This is done by data visualization using Tableau, as Tableau is very good at visualizing multiple categories and create different visual appealing plots. I have also used bubble plot to visualize how the overall outcome type changes across year, from 2013-2016. I was especially interested in seeing how the age affects different outcomes, especially adoption. I first used non-parametric Kruskai-Wallis test (age distribution is not normal) to see if age distribution differs across different outcome types, and then used the same test to test whether age differs between adopted and not adopted for cats and dogs. To visualize the test result, I plotted the age distribution for dogs/cats adopted vs. not adopted in the same bar graph using bokeh.
Question 3: What's the predictive model for animal outcome?
Method: In order to see how different animal traits affect/predict outcome type, I used 3 layers fully connected Neural Network to predict different outcome types given traits. This process involves many iterative parameter tuning and predictor selection in order to achieve higher accuracy, which we will describe further in Model Section.
4*: Here we refer common breed as those breeds that appeared more than 400 times in data given an animal type. There are 3 common breeds in cats, namely Domestic Short/Medium/Long Hair Mix, and 4 common breed in dogs, namely Pit Bull Mix, Chihuahua Shorthair, Labrador Retriever Mix and German Shepherd Mix.
5*: P-value for different outcomes: 1.05e-12; p-value for cats adopted vs. not adopted: 2.76e-43; p-value for dogs adopted vs. not adopted: 0.0
Creative Time Series Visualization: How does the Outcome types change by year?
Note that our data does not contain 12 month data for 2013(only 2 month) and 2016(only 3 month), so we should not compare the size of the bubble for those two years.
I employed fully connected Neural Network to predict outcome type (Adoption, Died, Euthanasia, Return to owner, and Transfer) given different traits (AnimalType, age, sex , Breed, Color) for an individual animal. I used TensorFlow DNN classifier with 80% of data as training and 20% of data as testing to do the training. My first try was to feed in all 5 predictors to predict 5 outcomes. Surprisingly, no matter how I tuned the parameters (number of hidden nodes, number of hidden layers, training_step etc.), the training error did not decrease and converge. After thinking about the reason for such poor performance, I realized that two variables: Breed and Color may create extra random noise that prevent the model from fitting. These two variables are not poor predictors by their nature, but because of how we defined it. Recall that the implementation was “loose” in code. For creating binary variable for Color, it’s implemented to only check if “/” is included in Color description. If it is, then it’s mix, otherwise it’s pure. For example, “Brown/White” is categorized as mix, and “Black” is categorized as pure. But “Cream Tabby” is also categorized as pure, which is problematic and results in random noise in predictors. For creating binary variable for Breed, we defined common breed to be those breeds that appeared more than 400 times in our data set. However, this cut-off is somewhat arbitrary, and binning 1,380 breeds to only 2 categories will lose a lot of the valuable information. So in our next approach, we removed the two variables and only used AnimalType, age, sex in Neural Network models. To my surprise, the model starts to decrease in training error and converges. After intensive parameter tuning, I found that predictive performance on test data is at best around 0.47 accuracy, which is far better than random guess 0.20, but still not very satisfactory. My next hypothesis is that maybe the current predictors are not strong enough to differentiate between some of the outcome types. For example, we might need more indicative predictors such as health condition to differentiate Died vs. Euthanasia. Another observation is that our data is unbalanced in a way that some of the outcome types have far less observations than others (Died: 197). This leads to my third Neural Network model which is designed to only predict Adopt vs. Not Adopt with 3 layers, each with 10, 200, 10 nodes and learning rate 0.0001. It turns out that my hypothesis was correct! The prediction accuracy boosted from 0.47 from last model to 0.73(Fig 7).
We can tell many stories or propose interesting hypotheses from the above analysis. For example, from the exploratory data analysis, we see that cats are very homogenized in terms of breed and age, and majority of cats are less than 1 year old. We also see that very few cats are returned to the owner and the adoption rate is significantly higher for kittens. This might be resulted from the fact that cats are less easy to trust their owners than dogs, and people are more likely to adopt kitten so that they can build a longer reliable relationship with the cat from early on. This in turn motives the shelter to obtain more kittens from breeders, which ultimately results in more homogeneity among cats. Dogs are in general faster to share loyalty and sense of belongings with their owners, which might be part of the reason why the second highest outcome type for dogs is returning to owner, as their owners are less likely to abandon them. The iterative process of building predictive models showed how exploratory data analysis can provide us very insightful understanding especially when we try to improve black box model such as Neural Network. The most challenging part of the analysis comes from how to create predictors for text data without losing much information, as this will directly influence the quality of predictors and our model performance. For future work, one can also investigate how the animals’ traits change over time, and build a different model at higher level predicting the total number of animals for each outcome.