Post date: Feb 24, 2014 5:06:03 PM
What sort of things seem to influence intake and outcomes for animals in a shelter? Here are some conclusions that visualizations provided by students completing Byte 4 suggest.
Different breeds may be adopted more or less at different types of year
(Yanan Jian)
Cats may be returned to owner less than dogs
(Runyun Zhang)
Adoptions may be much more common among animals under one year of age
(Abdel Bourai)
These visualizations still leave us with many questions -- for some of them it is unclear which animals were included in the analysis, or how unknown values were dealt with. Additionally, they do not provide any statistical analysis of the conclusions I suggest. Finally, data cleaning is not clearly described on these web pages and in some cases could be of great value (for example combining similar breeds rather then allowing them to be split out could change the results of the first visualization). Nonetheless, these examples help to illustrate the value of visualization for helping us begin to ask (and answer) questions about our data.
I also want to share one of the nicest and most complete anwers about how to prepare the data. It is a great summary of how to go about not only identifying problems with data but also cleaning them. I made a few small edits to the presentation but the words are all those of one of your fellow students (Harsh), a CIT Masters student (INI).
1. Surveying the data
I first surveyed what type of data it was that was missing. Was the data missing because of any relation to some other fields? My conclusion was that here the data are missing completely at random (MCAR)[1] When we say that data are missing completely at random, we mean that the probability that an observation (Xi) is missing is unrelated to the value of Xi or to the value of any other variables. By looking into the nature of the missing data we may choose the way in which we can clean it.
2. Available options:
There are other techniques, but these are what I looked at and considered primarily given the short amount of time.
Conclusion: The best approach was Listwise Deletion (Deleting entries) since the data seems to be Missing Completely at Random and the deleted records only form a small fraction of the total data set. I also removed the Unspecified column from the Age, Since I felt this data was not helpful to the user, given the question what is the relationship between age and outcome?
References:
[1] This answer contains elements referenced from the website: "Treatment of Missing Data": http://www.uvm.edu/~dhowell/StatPages/More_Stuff/Missing_Data/Missing.html
[2] http://en.wikipedia.org/wiki/Imputation_(statistics)