Austin Animal Center Outcomes Data Product

Yuning Zhang, Bhajleen Kaur, Carla Gonzalez, Lucas Mego

Tip: The person who did the part was labeled under each section heading in gray Italian font. If more than one person was labeled for a section, the person who did the subpart was labeled under subheadings in that section in gray Italian font.

Introduction

Bhajleen Kaur

The AAC shelter director and her staff would like us to analyze the outcome data to help them answer the following questions:

What is the average number of days in AAC for an animal?

By animal type
By animal type and breed
By year by animal type

How many animals come into AAC every year, and how many go out?
What are the 3 busiest months (by number of intakes) and what are the 3 slowest months?
Which animal type has the most adoptions?

Therefore, we created a data pipeline that shared insights about the animals that had been taken into the Austin Animal Center (AAC) over the years to effectively answer these questions and any further questions related to the dataset we got. First, we needed to understand the dataset we had been given, then we loaded the data into Superset, created the dimensional model, created and ran the dbt model files, designed our data products, and created the visualizations.

Data Ingestion

Yuning Zhang (Me), Bhajleen Kaur, Carla Gonzalez

Source Data

animals.csv:

https://docs.google.com/spreadsheets/d/1R9uSnLUgUEFvrWEJlgvL9TL21K_6QTTjTi_AikEGPkU/edit?usp=sharing

outcome_types:

https://docs.google.com/spreadsheets/d/169rKFzmosVkpqrp9o1YNrIQgX2GgxaCk315gGjykHBg/edit?usp=sharing

outcome_subtypes:

https://docs.google.com/spreadsheets/d/1ftYqeWkElH2aVgen4SQIScrII81W70_SzQjzZ-BIn3U/edit?usp=sharing

animal_outcomes:

https://docs.google.com/spreadsheets/d/1Y_hHN1-8k_b2TlSmrqKeq6Edxxym8AvZB8jOhHnPdwY/edit?usp=sharing

Documentation for source data

Concrete Example

Conceptual ER Diagram

Physical ER Diagram

Yuning Zhang (Me)

Concrete Example

Conceptual ER Diagram

Physical ER Diagram

SQL to create staging tables

Bhajleen Kaur, Carla Gonzalez

First we dropped the tables we created, to be able to make any changes in the future if necessary, making the tables idempotent. Then, created our tables:

```sql

-- First, have all the DROP TABLE IF EXISTS commands

DROP TABLE IF EXISTS staging.animals;

DROP TABLE IF EXISTS staging.outcome_subtypes;

DROP TABLE IF EXISTS staging.animal_outcomes;

DROP TABLE IF EXISTS staging.outcome_types;

-- Then have all of the CREATE TABLE IF NOT EXISTS commands

CREATE TABLE IF NOT EXISTS staging.animals (

animal_id TEXT PRIMARY KEY,

animal_name TEXT,

breed TEXT,

animal_coloring TEXT,

animal_type TEXT);

CREATE TABLE IF NOT EXISTS staging.outcome_subtypes (

id INTEGER PRIMARY KEY,

outcome_type_id INTEGER,

outcome_subtype TEXT);

CREATE TABLE IF NOT EXISTS staging.animal_outcomes (

animal_id TEXT,

outcome_type_id INTEGER,

outcome_subtype_id INTEGER,

intake_time TIMESTAMP,

outcome_time TIMESTAMP,

days_in_shelter INTEGER );

CREATE TABLE IF NOT EXISTS staging.outcome_types (

id INTEGER PRIMARY KEY,

outcome_type TEXT);

After having our create table statements ready, we went to the terminal and changed the directory to scripts, to finally run bash create_staging_tables.sh

Load script with \copy commands

Bhajleen Kaur, Carla Gonzalez

Our tables were now ready to hold data. We wrote these copy commands to load the data from each csv into its respective table:

```sql

TRUNCATE TABLE staging.outcome_types;

\copy staging.outcome_types FROM '../../data/outcome_types.csv' WITH HEADER CSV;

TRUNCATE TABLE staging.animal_outcomes;

\copy staging.animal_outcomes FROM '../../data/animal_outcomes.csv' WITH HEADER CSV;

TRUNCATE TABLE staging.animals;

\copy staging.animals FROM '../../data/animals.csv' WITH HEADER CSV;

TRUNCATE TABLE staging.outcome_subtypes;

\copy staging.outcome_subtypes FROM '../../data/outcome_subtypes.csv' WITH HEADER CSV;

Again, in our scripts directory we ran bash load_staging_data.sh

Sample tables (screenshots)

Bhajleen Kaur, Carla Gonzalez

Animals

Outcome_types

Outcome_subtypes

Animal Outcomes

Data Cleaning

Yuning Zhang (Me)

Observation

When we were observing the data in the csv file, we found:

There were negative values in the column of 'days_in_shelter', which was impossible.

An animal had multiple records. However, the 'intake_time' for these records were the same (i.e. 'animal_id' and 'intake_time' appeared the same), which was also impossible because a single animal could not be intaken simultaneously.

Therefore, we wanted to drop the records which had multiple records with the same 'intake_time' or had negative 'days_in_shelter'. Here are what we did:

SQL to create cleaned tables

-- clean_animal_outcomes_new1.sql

-- We used 'WHERE a.days_in_shelter >= 0' to filter the records with non-negative 'days_in_shelter'. 'GROUP BY a.animal_id, a.intake_time' made the records that both the 'animal_id' and 'intake_time' were identical in the same group, we used 'HAVING COUNT(*) = 1' to filter the group which contained only 1 record. In that case, the combination of 'a.animal_id' and 'a.intake_time' was unique for the records we selected.

{{

config(

materialized="table"

)

}}

SELECT a.animal_id, a.intake_time

FROM staging.animal_outcomes AS a

WHERE a.days_in_shelter >= 0

GROUP BY a.animal_id, a.intake_time

HAVING COUNT(*) = 1

-- clean_animal_outcomes_new2.sql

-- Since we used 'Group By' above, and only the columns in 'Group By' can be selected. However, we needed more info. than only 'animal_id' and 'intake_time', so we needed to join the table to get them.

{{

config(

materialized="table"

)

}}

SELECT c.animal_id, c.intake_time, a.outcome_type_id, a.outcome_subtype_id, a.outcome_time, a.days_in_shelter

FROM staging.animal_outcomes AS a

JOIN {{ ref('clean_animal_outcomes_new1') }} AS c ON (a.animal_id = c.animal_id and a.intake_time = c.intake_time)

Load the cleaned tables into Superset using run dbt

In terminal, we first used $ cd aac/src/dbt/aac/ to achieve the folder where 'models' folder was included, then did $ dbt run --select clean_animal_outcomes_new1.sql, then did $ dbt run --select clean_animal_outcomes_new2.sql

Sample data after cleaning (screenshot)

Data Transformation (to Star Schema)

Yuning Zhang (Me)

Identify facts, grain, and dimensions

We extract primary keys from each table(i.e. animal_id, outcome_type_id, outcome_subtype_id, and intake_time) as columns in the fact table since the primary keys are indivisible and unique, and they are the keys to retrieve a specific record. The atomic fact in fact table is days_in_shelter since it is numeric and additive, and the grain is a single stay by an animal in the shelter, which means a true visit that generates a new row of data for animal_id, outcome_type_id, outcome_subtype_id, and intake_time.

We have 4 dimensions, which are 'Animals_Dimension', 'Animal_Outcomes_Dimension', 'Outcome_Types_Dimension', 'Outcome_Subtypes_Dimension'. Although there are many duplicated attributes in the fact table and the 'Animal_Outcomes_Dimension' table, we still keep them instead of combining the attribute 'outcome_time' into the fact table and delete the 'Animal_Outcomes_Dimension' table because in the real scenario, there would probably be a lot more attributes associated with an animal outcome and those would have to go into the dimension table.

Transformation diagram

Documentation for star schema

Examples of Concrete data (for the star schema)

Conceptual ER (for the star schema)

Physical ER (for the star schema)

SQL to create star schema tables

-- animals_dimension.sql

{{ config(

materialized="table"

) }}

SELECT animal_id,

animal_name,

breed AS animal_breed,

animal_coloring,

animal_type

FROM staging.animals

-- outcome_types_dimension.sql

{{ config(

materialized="table"

) }}

SELECT animal_id,

animal_name,

breed AS animal_breed,

animal_coloring,

animal_type

FROM staging.animals

-- outcome_subtypes_dimension.sql

{{ config(

materialized="table"

) }}

SELECT id as outcome_subtype_id,

outcome_subtype

FROM staging.outcome_subtypes

-- animal_outcomes_dimension_new.sql

{{ config(

materialized="table"

) }}

SELECT animal_id,

intake_time,

outcome_time

FROM analytics.clean_animal_outcomes_new2

-- animal_outcomes_facts_new.sql

{{ config(

materialized="table"

) }}

SELECT a.animal_id,

os.id AS outcome_subtypes_id,

t.id AS outcome_types_id,

o.intake_time,

o.days_in_shelter

FROM analytics.clean_animal_outcomes_new2 AS o

JOIN staging.outcome_subtypes AS os ON (o.outcome_subtype_id = os.id)

JOIN staging.outcome_types AS t ON (o.outcome_type_id = t.id)

JOIN staging.animals AS a ON (o.animal_id = a.animal_id)

Load the cleaned tables into Superset using run dbt

In terminal, we first used $ cd aac/src/dbt/aac/ to achieve the folder where 'models' folder was included, then did $ dbt run --select animals_dimension.sql. Similarly, ran the rest four.

Sample tables (screenshots)

animal_dimensions

outcome_types_dimension

outcome_subtypes_dimension

animal_outcomes_dimension_new

animal_outcomes_facts_new

Data products: Transformation & Visualization

Yuning Zhang (Me), Bhajleen Kaur, Carla Gonzalez, Lucas Mego

Rationale for data products produced

Bhajleen Kaur

We created data products to store and organize the data effectively through determining the relationships between tables. Then we could retrieve the specific record or answer the questions from clients quickly and accurately. Through writing queries to join fact tables and different dimension tables by connecting key attributes, we could address all project requirements as we could answer all questions related to animals that are taken into AAC.

Transformation & Loading into superset by run dbt & Visualization

see tips below

(for loading into superset by run dbt, we just showed this step in the example 1(1), the following answers followed the exact same method, you just needed to change the file name after the 'select')

Tips:

Yuning Zhang (Me) did transformation in 1(1) and 2(2); Lucas Mego did transformation in 1(3); Bhajleen Kaur and Carla Gonzalez did transformation in 2; Carla Gonzalez did transformation in 3; Bhajleen Kaur did did transformation in 4.

Lucas Mego did explanations of visualization in 1,2,3,4.

Yuning Zhang (Me) corrected all content in this section after data cleaning since we did not do that at first.

What is the average number of days in AAC for an animal?

(1) By animal type

Transformation

{{

config(

materialized="table"

)

}}

SELECT a.animal_type,

f.days_in_shelter

FROM {{ ref('animal_outcomes_facts_new') }} AS f

JOIN {{ ref('animals_dimension') }} AS a ON (f.animal_id = a.animal_id)

WHERE f.days_in_shelter > 0

Loading into superset by run dbt

In terminal, we first used $ cd aac/src/dbt/aac/ to achieve the folder where 'models' folder was included, then did $ dbt run --select avg_number_days_byAnimalType_new.sql.

Visualization

The visualization for the average number of days in the Austin Animal Center for an animal by animal type answers the project requirements by illustrating the average number of days in the center by animal center in a pie graph. This pie graph displays the relative size of the average number of days when comparing one animal type to another. The metric AVG(days_in_shelter) and the dimension animal_type build the visualization to address the question. As we can see, livestock animals tend to have the longest stays in the shelter. It is important to rememeber that the sample size for livestock animals is significantly smaller than the other categories, so this number may not be representative of all livestock animals and the average number of days spent during an animal shelter stay.

(2) Average Number of Days in AAC by Type and Breed

Transformation

{{

config(

materialized="table"

)

}}

SELECT a.animal_type,

a.animal_breed,

f.days_in_shelter

FROM {{ ref('animal_outcomes_facts_new') }} AS f

JOIN {{ ref('animals_dimension') }} AS a ON (f.animal_id = a.animal_id)

WHERE f.days_in_shelter > 0

Visualization

The visualization for the average number of days in the Austin Animal Center for an animal by animal type and breed answers the project requirements by using a bar graph to display the average number of days in the animal center by type and breed. The metric (AVG days_in_shelter) and the dimensions animal_breed and animal_type build the visualization to address the question. Here, Chihuahua Longhair/Silky Terrier Dogs tend to spend the least amount of time in the animal shelter.

(3) Average Days By Year And Animal Type

Transformation

{{

config(

materialized="table"

)

}}

SELECT b.animal_type,

EXTRACT(YEAR from f.intake_time),

f.days_in_shelter

FROM {{ ref('animal_outcomes_facts_new') }} AS f

JOIN {{ ref('animal_outcomes_dimension_new') }} AS a ON (f.animal_id = a.animal_id)

JOIN {{ ref('animals_dimension') }} AS b ON (f.animal_id = b.animal_id)

WHERE f.days_in_shelter > 0

Visualization

The visualization for the average number of days in the Austin Animal Center for an animal by year and by animal type answers the project requirements by using a bar graph to display the average number of days in the animal center by year and by animal type together. The metric (AVG days_in_shelter) and the dimensions date_part and animal_type build the visualization to address the question.

2. Number of Intakes and Outcomes Per Year

Transformation

-- queries2intermidiate_new.sql

{{

config(

materialized="table"

)

}}

SELECT a.animal_id,

EXTRACT(YEAR from a.intake_time) as year_intake,

EXTRACT(YEAR from a.outcome_time) as year_outcome

FROM {{ ref('animal_outcomes_dimension_new') }} as a

-- queries2intermidiate_new.sql

{{

config(

materialized="table"

)

}}

SELECT year_intake,

COUNT(animal_id) as count_intake

FROM {{ ref('queries2intermidiate_new') }}

GROUP BY year_intake

-- queries2part2_new.sql

{{

config(

materialized="table"

)

}}

SELECT year_outcome,

COUNT(animal_id) as count_outcome

FROM {{ ref('queries2intermidiate_new') }}

GROUP BY year_outcome

-- finalqueries2_new.sql

{{

config(

materialized="table"

)

}}

select *

FROM {{ ref('queries2part1_new') }} AS a

JOIN {{ ref('queries2part2_new') }} AS b ON (a.year_intake = b.year_outcome)

Visualization

Utilizing a bar graph the visualization for the number of animals that come in and out of AAC every year answers the project requirements by displaying the number of animals that come in and out of AAC every year. For the most part, there are similar numbers of intakes and outcomes per year.

3. Number of Intakes Per Month

Transformation

{{

config(

materialized="table"

)

}}

SELECT TO_CHAR(animal_outcomes_facts_new.intake_time, 'Month') as month,

COUNT(animal_outcomes_facts_new.intake_time) AS num_animals_in

FROM {{ ref ('animal_outcomes_facts_new') }}

GROUP BY month

ORDER BY COUNT(animal_outcomes_facts_new.intake_time) DESC

Visualization

Using a bar graph for the sum of the number of intakes each month, the visualization and answers the project requirements as the client can see the three months with the highest intake for the busiest months. The busiest months are May, June, and October and the slowest months are January, December, and February.

4. Adoptions by Animal Type

Transformation

{{

config(

materialized="table"

)

}}

SELECT COUNT(outcome_type) as num_adoptions,

d.animal_type

FROM {{ ref ('animal_outcomes_facts_new') }} AS a

JOIN {{ ref ('animals_dimension') }} AS d ON a.animal_id = d.animal_id

JOIN {{ ref ('outcome_types_dimension') }} AS o ON a.outcome_types_id = o.outcome_type_id

WHERE outcome_type = 'Adoption'

GROUP BY d.animal_type

ORDER BY COUNT(outcome_type)DESC

Visualization

The visualization for the adoptions by animal type answers the project requirements by using a pie chart to show the number of each animal type that was adopted and the relative magnitude of the number of adoptions compared to others. Here, we can see that out of all adoptions, almost 2/3 of all adoptions are for dogs. This may suggest that dogs are the most likely animals to get adopted.

Document visualization decisions

Yuning Zhang(Me)

There are many options for visualizations like pie charts, line graphs, and bar charts. We used a pie chart for 'Average Number of Days in AAC by Animal Type' and 'Adoptions by Animal Type' since both have only two variables (e.g. for 'Average Number of Days in AAC by Animal Type', two variables are average number of days in AAC and animal type) and a pie chart is very effective at measuring the relationship between two variables. The radius of the circle can be made proportional to the quantity of animal type, so it's easy to compare the average number of days in AAC and adoptions among different animal types. We chose a bar chart to measure the Average Number of Days in AAC by Type and Breed, Average Days By Year And Animal Type, Number of Intakes and Outcomes Per Year and Number of Intakes Per Month since we want to know the several highest results, and in the bar chart we can range them in a descending order so it's easy to grab the data the client wants.

We choose different color if there is a clear comparison like intakes verses outcomes per year in 'Number of Intakes and Outcomes Per Year'. For pie chart, we use bright color for small proportional data to make it obvious.

Reflection and challenges addressed

Carla Gonzalez, Yuning Zhang (Me)

A couple of challenges we faced were that we had to backtrack our project when we found inconsistent data. We found rows where the same animal had the same intake time, which should be impossible. In addition, we found rows that showed the outcome_time to be before the intake_time, resulting in an animal staying negative days in the shelter. In order to solve these problems, we cleaned our data with extra queries, created a clean staging table, and reran our code to create the dimensional tables and star schema with the clean data. While this problem took some time to figure out, it helped us have a better understanding of creating the data pipeline and also helped us appreciate how quickly we can make adjustments the data pipeline when needed.

We learned about composite primary keys in this project. We discovered them from the data and validated our hypothesis by understanding the meaning of the ACC animal intake and outcome processes. We also learned to use 'reference' at the building models stage so that dbt knew what was the order to implement different queries.

Page updated

Report abuse