---
title: "Austin Animal Center Data Analysis"
author: "Daniel Draney"
date: "2024-09-24"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Background on this project and the company
I'm approaching this project as part of my Google Data Analyis course. I chose to look at pet adoption data because my wife and I have two pets that we "rescued" and they hold a special place in our hearts. So, what better way to learn more about data analysis than by looking at data about something meaningful.
The publicly available data from the Austin Animal Center shows that Austin, Texas is the largest "No Kill" city in the country. The data provided gives details of animal intakes and the outcomes as they leave the Animal Center. This data is something shared through the official city of Austin's open data portal. I found references to the data through Kaggle as I was researching possible data analysis projects that I could work through to broaden my data analysis skills. I intentionally chose not to review others' work that is on Kaggle so as not to skew my own work, but it's very likely that analysis I do in exploriing this data has already been performed by others that are doing similar exploratory data analysis work. Additionally, the city of Austin has some [basic charts](https://data.austintexas.gov/browse?Ownership_Department-name=Animal+Services&limitTo=charts) of the data that was helpful to look at some of the basics. These initial looks at the data helped me to define the questions I wanted to ask of the data.
I initially wanted to look at data closer to home, and I still plan to do so, so I will use the skills learned here looking at Austin's data to look closer at data for animal shelters in my own state of Idaho, separately in another analysis project. I'm hopeful that this project will allow me to dig deeper into the R programming language, RStudio, and showcase some of the additional analysis and programming skills I've learned, along with skills specific to using plotting within R to create meaningful charts and graphs.
## Ask
* I'd like to answer three questions with this project:
1. How does the length of stay in the animal center differ depending on animal type?
2. How does the outcome differ depending on animal type? Does one animal type survive more than another?
3. How does the calendar or cycle of adoptions impact the animal center? Is there a time of year that's busier?
* **The Business Task:** Through analysis of these guiding questions, I'd like to identify ways that the animal center can advertise, change processes, or have increased awareness to be able to serve the pets better.
## Prepare
* Because I'm the dedicated resource for this project, and fully in control of the plan for the analysis, the timeline should be only a few days.
* As mentioned above, the data for this work is available publicly and can be found [here](https://data.austintexas.gov/browse?q=animal) and is also referenced on Kaggle [here](https://www.kaggle.com/datasets/jackdaoud/animal-shelter-analytics/data).
* I'll begin by downloading the available data for intakes and outcomes, which includes data back to 2013 through August 2024 when I downloaded the data.
* The initial analysis includes:
1. How is the data organized?
2. Sort, filter, and categorize the data to understand what's included
3. Review the credibility of the data
4. Confirm the completeness of the data to understand capability to answer the questions being asked
* **Data Sources Used:** All data sources used for this were obtained from the City of Austin's open data portal as linked above.
```{r import the data to prepare for processing, include=FALSE}
install.packages("tidyverse")
library(tidyverse) #helps wrangle data
# Use the conflicted package to manage conflicts
library(conflicted)
# Set dplyr::filter and dplyr::lag as the default choices
conflict_prefer("filter", "dplyr")
conflict_prefer("lag", "dplyr")
conflicts_prefer(dplyr::filter)
#=====================
# STEP 1: COLLECT DATA
#=====================
# # Upload Divvy datasets (csv files) here
ac_intakes <- read_csv("Austin_Animal_Center_Intakes.csv")
ac_outcomes <- read_csv("Austin_Animal_Center_Outcomes.csv")
# At this stage the data can be reviewed a number of ways, such as with the View() function. Since it's a large data set with over 100k rows in each table, it can be quickly viewed in the RStudio environment view when this detail is being reviewed via the console in RStudio.
```
## Process
* Now that I've downloaded the data I can do my initial preparatory analysis. Because the data was provided in a .csv file format, there's some basic analysis that can be done within Excel quickly. For example, I can see that there are more records in outcomes than there are in intakes, so I will need to review consistency of the data to make sure I understand the differences. I can also see for example the earliest dates of the records and that not all records have the name of the pet.
* My entire cleaning, filtering, transforming, and bias checking of the data will be done within R so that I can maintain a record of the work, document any cleaning of the data, as well as to make the process repeatable.
* **Data Cleaning and Manipulation:**
1. Join the data tables together, and normalize the column names, so that the intake details and outcome information can be compared seamlessly.
2. Remove records that are irrelevant to analysis including animals with multiple intakes.
3. Calculate the date parts for the difference between intake time and outcome time to understand length of stay.
4. Calculate the date parts for the intake date and outcome date to understand the cycles through the year to know if there are peaks or valleys.
```{r process and combine the data to prepare for cleansing, include=FALSE}
#====================================================
# STEP 2: WRANGLE DATA AND COMBINE INTO A SINGLE TABLE
#====================================================
# Join the two tables together to merge the information into a single data frame
# Merge data frames together
ac_in_plus_out <- merge(x = ac_intakes, y = ac_outcomes, by = "Animal ID")
# Only use the fields needed, remove those in the list
ac_in_plus_out <- ac_in_plus_out %>%
select(-c(Name.y, "Animal Type.y",Color.y,Breed.y,MonthYear.x,MonthYear.y,))
# Rename the fields
ac_in_plus_out <- ac_in_plus_out %>%
rename(AnimalID = "Animal ID"
,AnimalName = Name.x
,AnimalType = "Animal Type.x"
,AnimalColor = Color.x
,AnimalBreed = Breed.x
,IntakeDate = DateTime.x
,OutcomeDate = DateTime.y
)
```
```{r clean and manipulate the data to prepare for analysis, include=FALSE}
#======================================================
# STEP 3: CLEAN UP AND ADD DATA TO PREPARE FOR ANALYSIS
#======================================================
# Initial joining of data will show that that some animals have multiple intakes and outcomes. Using Excel we can see that over 10,000 animals have had more than one intake and over 200 animals have had more than 5 intakes over the 10 years in the data. For the purposes of what we are analyzing in this data, we don't need to look at animals that have multiple visits, but this would be important to call out to stakeholders to help them help pet owners prevent pets from return visits.
# Add columns that list the date, month, day, and year of each intake
# This will allow us to aggregate intake data for each month, day, or year
# https://www.statmethods.net/input/dates.html more on date formats in R found at that link
ac_in_plus_out$date <- as.Date(ac_in_plus_out$IntakeDate,'%m/%d/%Y') #The default format is yyyy-mm-dd, so need to tell it the format
ac_in_plus_out$month <- format(as.Date(ac_in_plus_out$date), "%m")
ac_in_plus_out$day <- format(as.Date(ac_in_plus_out$date), "%d")
ac_in_plus_out$year <- format(as.Date(ac_in_plus_out$date), "%Y")
ac_in_plus_out$day_of_week <- format(as.Date(ac_in_plus_out$date), "%A")
# Format the Intake Date and Outcome Date to unambiguous formats
ac_in_plus_out$IntakeDate <- as.Date(ac_in_plus_out$IntakeDate,'%m/%d/%Y')
ac_in_plus_out$OutcomeDate <- as.Date(ac_in_plus_out$OutcomeDate,'%m/%d/%Y')
# Add a "stay_length" calculation to intakes (in seconds)
# https://stat.ethz.ch/R-manual/R-devel/library/base/html/difftime.html
ac_in_plus_out$stay_length <- difftime(ac_in_plus_out$OutcomeDate,ac_in_plus_out$IntakeDate)
# Calculate the Stay in number of days
ac_in_plus_out$stay_days <- ((ac_in_plus_out$stay_length/60)/60)/24 # divide by 60 to get minutes, 60 again for hours, divide by 24 for days
# Convert "stay_length" from Factor to numeric so we can run calculations on the data
# is.factor(ac_in_plus_out$stay_days) # commenting out so the view doesn't break the Knit
ac_in_plus_out$stay_days <- as.numeric(as.character(ac_in_plus_out$stay_days))
# is.numeric(ac_in_plus_out$stay_days)# commenting out so the view doesn't break the Knit
# Group by Animal ID to find the records that have more than one intake so that we can exclude them
intake_aggregate_tbl <- ac_in_plus_out %>% group_by(AnimalID) %>%
summarise(total_count=n(),.groups = 'drop') %>%
as.data.frame()
# Remove the records from the aggregate table that have more than 1 intake
filtered_aggregate <- intake_aggregate_tbl %>%
filter(total_count == 1)
# Merge the working list with the Aggregate list to only return the records with exactly one intake
intake_list <- merge(x = filtered_aggregate, y = ac_in_plus_out, by = "AnimalID")
# Add a field to check if the stay length is greater than one year
intake_list <- intake_list %>%
mutate(stay_length_years = stay_days / 365) #divide by 365 to understand the number of years for the stay
```
## Analyze
* We've asked questions of the data, and also prepared and processed the data, now we need to analyze the data to be able to make recommendations.
* **Summary of Analysis:**
* Dogs stay about a week less than cats do, seemingly the favorite to adopt in the Austin area.
* Birds have a higher rate of euthanasia than other pets, contributing to the perceived average shorter stay during initial analysis.
* Animals other than pets are euthanized more often than anything else.
* Austin may be indicated as a "no kill" shelter, but there are animals that are euthanized. However, some pets stayed several years without being euthanized.
* As one might expect, in spring, there's a spike in cat intakes at the animal shelter.
``` {r analyze the data, include=FALSE}
#=====================================
# STEP 4: CONDUCT DESCRIPTIVE ANALYSIS
#=====================================
# Descriptive analysis on stay length in days
mean(intake_list$stay_days) #straight average (total stay length / stays)
median(intake_list$stay_days) #midpoint number in the ascending array of stay lengths
max(intake_list$stay_days) #longest stay
min(intake_list$stay_days) #shortest stay
# You can condense the four lines above to one line using summary() on the specific attribute
summary(intake_list$stay_days)
# Compare different length of stay for different animal types
aggregate(intake_list$stay_days ~ intake_list$AnimalType, FUN = mean)
aggregate(intake_list$stay_days ~ intake_list$AnimalType, FUN = median)
aggregate(intake_list$stay_days ~ intake_list$AnimalType, FUN = max)
aggregate(intake_list$stay_days ~ intake_list$AnimalType, FUN = min)
# Create table to compare different outcomes for different animal types
intakes_by_type_outcome <- intake_list %>%
group_by(AnimalType, `Outcome Type`) %>%
summarise(count = n())
# Create table to compare intakes by month
intakes_by_type_month <- intake_list %>%
group_by(AnimalType, month) %>%
summarise(count = n())
# Create table to compare intakes by month and year
intakes_by_type_month_year <- intake_list %>%
group_by(AnimalType, year, month) %>%
summarise(count = n())
# Reassign to the desired values (month names)
intakes_by_type_month <- intakes_by_type_month %>%
mutate(month = recode(month,"01"="Jan","02"="Feb","03"="Mar","04"="Apr","05"="May","06"="Jun","07"="Jul","08"="Aug","09"="Sep","10"="Oct","11"="Nov","12"="Dec"))
intakes_by_type_month_year <- intakes_by_type_month_year %>%
mutate(month = recode(month,"01"="Jan","02"="Feb","03"="Mar","04"="Apr","05"="May","06"="Jun","07"="Jul","08"="Aug","09"="Sep","10"="Oct","11"="Nov","12"="Dec"))
# Notice that the months of the year are out of order. Let's fix that.
intakes_by_type_month$month <- ordered(intakes_by_type_month$month, levels=c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))
intakes_by_type_month_year$month <- ordered(intakes_by_type_month_year$month, levels=c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))
```
## Share
* **Let's take a look at a few of the key findings**
### How long does each type of animal stay with us on average?
* A visual of the average length of stay for animals by type
* Birds are at the Austin Animal Center less time than cats or dogs
* Dogs are at the Austin Animal center less time than cats
* But is the shorter stay for birds because of something else?
```{r plot of stay length by animal type, echo=FALSE, message = FALSE}
# Let's create a visualization for average duration, by animal type
intake_list %>%
group_by(AnimalType) %>%
summarise(number_of_rides = n()
,average_stay = mean(stay_days)) %>%
arrange(AnimalType) %>%
ggplot(aes(x = AnimalType, y = average_stay, fill = AnimalType, label=round(average_stay,2))) +
geom_col(position = "dodge") +
# avoid overlap of text and bar to make text visible as bar and text have the same color
geom_text(nudge_y = 1) +
labs(title = "Average Length of Stay") +
xlab("Animal Type") +
ylab("# of Days")
```
### How successul are we at our "No Kill" goal?
* A visual of the number of each type of outcome, per animal type
* Animals other than Livestock, Birds, Cats, or Dogs are Euthanized far more than any other outcome
```{r plot of outcomes by animal type, echo=FALSE, message = FALSE, fig.width=20, fig.height=15}
# Let's create a visualization for outcomes, by animal type
intakes_by_type_outcome %>%
ggplot(aes(x = `Outcome Type`, y = count, fill = `Outcome Type`,label="")) +
geom_col(position = "dodge") +
# avoid overlap of text and bar to make text visible as bar and text have the same color
geom_text(nudge_y = 1) +
labs(title = "Outcomes by Animal Type") +
xlab("Outcome Type") +
ylab("# of Outcomes") +
theme(text = element_text(size=30), axis.text.x = element_text(angle=45, vjust=1, hjust=1)) +
facet_wrap(~AnimalType, scales="free")
```
### What is the intake volume per month for all years?
* A visual of the number of intakes per animal type, per month
* Cats see a spike in the Spring, presumably because that's when kittens are born
```{r plot of the the number of intakes per month by animal type, echo=FALSE, message = FALSE, fig.width=20, fig.height=15}
# Let's create a visualization of the number of intakes per month of the year
intakes_by_type_month %>%
ggplot(aes(x = month, y = count, fill = month, label="")) +
geom_col(position = "dodge") +
# avoid overlap of text and bar to make text visible as bar and text have the same color
geom_text(nudge_y = 1) +
labs(title = "Intakes per Month by Animal Type") +
xlab("Month") +
ylab("# of Intakes") +
theme(text = element_text(size=30), axis.text.x = element_text(angle=45, vjust=1, hjust=1)) +
facet_wrap(~AnimalType, scales="free")
```
### How do intakes of pets trend year over year?
* A visual of the number of intakes per pet type, per month, year over year
* We can see the impact of COVID during 2020 and we can see a spike April thru June for cats.
``` {r plot of the number of intakes per month, year over year, for pets only, echo=FALSE, message = FALSE, fig.width=20, fig.height=15}
# Let's create a visualization of the number of intakes per month, year over year after filtering out incomplete years, for pets only
intakes_filtered_month_year <- intakes_by_type_month_year |> filter(year > 2014 & year < 2021 & AnimalType %in% c("Bird", "Cat", "Dog") )
intakes_filtered_month_year %>%
ggplot(aes(x = month, y = count, group = year, color = year)) +
geom_line() +
labs(title = "Pet Intakes by Type Year over Year") +
xlab("Month") +
ylab("# of Intakes") +
theme(text = element_text(size=30), axis.text.x = element_text(angle=45, vjust=1, hjust=1)) +
facet_wrap(~AnimalType, scales="free")
```
## Act
* **Recommendations**
1. On average, cats take longer to be adopted than dogs. Is there a possibility of public advertising about cats to increase adoption rates?
2. Birds are euthanized more than other pets. Are there opportunities to find organizations to partner with to take in birds?
3. There are many "other" animals that are euthanized, such as bats for example. Is there a more appropriate handling for these animals?
4. In Spring there's a spike in intakes for cats, presumably because of kittens. How can the animal center partner with other agencies to increase the rate spay/neuter of pets?
5. Year over year, there's a spike in the Spring for cats. Do we need to increase staff during these times to account for the increased pets?
## Reflect
* **With the project complete, we're now at the phase of reflection.**
* What have I learned? What can I do better next time?
* This case study was a bit more "Self-Guided" than the last study I worked on, so it allowed me to do research on different programming in R to learn more and showcase some more things like using Facets.
* This case study also didn't have clearly defined steps like the prior study did, so I needed to look at data more closely to reconcile what I was seeing and what needed to be answered.
* I also realized in this case study that the template I used/created in the prior case study was helpful to kickstart the process for this case study.