Austin Animal Center Outcomes Data Product
Yuning Zhang, Bhajleen Kaur, Carla Gonzalez, Lucas Mego
Yuning Zhang, Bhajleen Kaur, Carla Gonzalez, Lucas Mego
Tip: The person who did the part was labeled under each section heading in gray Italian font. If more than one person was labeled for a section, the person who did the subpart was labeled under subheadings in that section in gray Italian font.
Bhajleen Kaur
The AAC shelter director and her staff would like us to analyze the outcome data to help them answer the following questions:
What is the average number of days in AAC for an animal?
By animal type
By animal type and breed
By year by animal type
How many animals come into AAC every year, and how many go out?
What are the 3 busiest months (by number of intakes) and what are the 3 slowest months?
Which animal type has the most adoptions?
Therefore, we created a data pipeline that shared insights about the animals that had been taken into the Austin Animal Center (AAC) over the years to effectively answer these questions and any further questions related to the dataset we got. First, we needed to understand the dataset we had been given, then we loaded the data into Superset, created the dimensional model, created and ran the dbt model files, designed our data products, and created the visualizations.
Yuning Zhang (Me), Bhajleen Kaur, Carla Gonzalez
animals.csv:
https://docs.google.com/spreadsheets/d/1R9uSnLUgUEFvrWEJlgvL9TL21K_6QTTjTi_AikEGPkU/edit?usp=sharing
outcome_types:
https://docs.google.com/spreadsheets/d/169rKFzmosVkpqrp9o1YNrIQgX2GgxaCk315gGjykHBg/edit?usp=sharing
outcome_subtypes:
https://docs.google.com/spreadsheets/d/1ftYqeWkElH2aVgen4SQIScrII81W70_SzQjzZ-BIn3U/edit?usp=sharing
animal_outcomes:
https://docs.google.com/spreadsheets/d/1Y_hHN1-8k_b2TlSmrqKeq6Edxxym8AvZB8jOhHnPdwY/edit?usp=sharing
Concrete Example
Conceptual ER Diagram
Physical ER Diagram
Yuning Zhang (Me)
Concrete Example
Conceptual ER Diagram
Physical ER Diagram
Bhajleen Kaur, Carla Gonzalez
First we dropped the tables we created, to be able to make any changes in the future if necessary, making the tables idempotent. Then, created our tables:
```sql
-- First, have all the DROP TABLE IF EXISTS commands
DROP TABLE IF EXISTS staging.animals;
DROP TABLE IF EXISTS staging.outcome_subtypes;
DROP TABLE IF EXISTS staging.animal_outcomes;
DROP TABLE IF EXISTS staging.outcome_types;
-- Then have all of the CREATE TABLE IF NOT EXISTS commands
CREATE TABLE IF NOT EXISTS staging.animals (
animal_id TEXT PRIMARY KEY,
animal_name TEXT,
breed TEXT,
animal_coloring TEXT,
animal_type TEXT);
CREATE TABLE IF NOT EXISTS staging.outcome_subtypes (
id INTEGER PRIMARY KEY,
outcome_type_id INTEGER,
outcome_subtype TEXT);
CREATE TABLE IF NOT EXISTS staging.animal_outcomes (
animal_id TEXT,
outcome_type_id INTEGER,
outcome_subtype_id INTEGER,
intake_time TIMESTAMP,
outcome_time TIMESTAMP,
days_in_shelter INTEGER );
CREATE TABLE IF NOT EXISTS staging.outcome_types (
id INTEGER PRIMARY KEY,
outcome_type TEXT);
After having our create table statements ready, we went to the terminal and changed the directory to scripts, to finally run bash create_staging_tables.sh
Bhajleen Kaur, Carla Gonzalez
Our tables were now ready to hold data. We wrote these copy commands to load the data from each csv into its respective table:
```sql
TRUNCATE TABLE staging.outcome_types;
\copy staging.outcome_types FROM '../../data/outcome_types.csv' WITH HEADER CSV;
TRUNCATE TABLE staging.animal_outcomes;
\copy staging.animal_outcomes FROM '../../data/animal_outcomes.csv' WITH HEADER CSV;
TRUNCATE TABLE staging.animals;
\copy staging.animals FROM '../../data/animals.csv' WITH HEADER CSV;
TRUNCATE TABLE staging.outcome_subtypes;
\copy staging.outcome_subtypes FROM '../../data/outcome_subtypes.csv' WITH HEADER CSV;
Again, in our scripts directory we ran bash load_staging_data.sh
Bhajleen Kaur, Carla Gonzalez
Animals
Outcome_types
Outcome_subtypes
Animal Outcomes
Yuning Zhang (Me)
When we were observing the data in the csv file, we found:
There were negative values in the column of 'days_in_shelter', which was impossible.
An animal had multiple records. However, the 'intake_time' for these records were the same (i.e. 'animal_id' and 'intake_time' appeared the same), which was also impossible because a single animal could not be intaken simultaneously.
Therefore, we wanted to drop the records which had multiple records with the same 'intake_time' or had negative 'days_in_shelter'. Here are what we did:
-- clean_animal_outcomes_new1.sql
-- We used 'WHERE a.days_in_shelter >= 0' to filter the records with non-negative 'days_in_shelter'. 'GROUP BY a.animal_id, a.intake_time' made the records that both the 'animal_id' and 'intake_time' were identical in the same group, we used 'HAVING COUNT(*) = 1' to filter the group which contained only 1 record. In that case, the combination of 'a.animal_id' and 'a.intake_time' was unique for the records we selected.
{{
config(
materialized="table"
)
}}
SELECT a.animal_id, a.intake_time
FROM staging.animal_outcomes AS a
WHERE a.days_in_shelter >= 0
GROUP BY a.animal_id, a.intake_time
HAVING COUNT(*) = 1
-- clean_animal_outcomes_new2.sql
-- Since we used 'Group By' above, and only the columns in 'Group By' can be selected. However, we needed more info. than only 'animal_id' and 'intake_time', so we needed to join the table to get them.
{{
config(
materialized="table"
)
}}
SELECT c.animal_id, c.intake_time, a.outcome_type_id, a.outcome_subtype_id, a.outcome_time, a.days_in_shelter
FROM staging.animal_outcomes AS a
JOIN {{ ref('clean_animal_outcomes_new1') }} AS c ON (a.animal_id = c.animal_id and a.intake_time = c.intake_time)
In terminal, we first used $ cd aac/src/dbt/aac/ to achieve the folder where 'models' folder was included, then did $ dbt run --select clean_animal_outcomes_new1.sql, then did $ dbt run --select clean_animal_outcomes_new2.sql
Sample data after cleaning (screenshot)
Yuning Zhang (Me)
We extract primary keys from each table(i.e. animal_id, outcome_type_id, outcome_subtype_id, and intake_time) as columns in the fact table since the primary keys are indivisible and unique, and they are the keys to retrieve a specific record. The atomic fact in fact table is days_in_shelter since it is numeric and additive, and the grain is a single stay by an animal in the shelter, which means a true visit that generates a new row of data for animal_id, outcome_type_id, outcome_subtype_id, and intake_time.
We have 4 dimensions, which are 'Animals_Dimension', 'Animal_Outcomes_Dimension', 'Outcome_Types_Dimension', 'Outcome_Subtypes_Dimension'. Although there are many duplicated attributes in the fact table and the 'Animal_Outcomes_Dimension' table, we still keep them instead of combining the attribute 'outcome_time' into the fact table and delete the 'Animal_Outcomes_Dimension' table because in the real scenario, there would probably be a lot more attributes associated with an animal outcome and those would have to go into the dimension table.
Examples of Concrete data (for the star schema)
Conceptual ER (for the star schema)
Physical ER (for the star schema)
-- animals_dimension.sql
{{ config(
materialized="table"
) }}
SELECT animal_id,
animal_name,
breed AS animal_breed,
animal_coloring,
animal_type
FROM staging.animals
-- outcome_types_dimension.sql
{{ config(
materialized="table"
) }}
SELECT animal_id,
animal_name,
breed AS animal_breed,
animal_coloring,
animal_type
FROM staging.animals
-- outcome_subtypes_dimension.sql
{{ config(
materialized="table"
) }}
SELECT id as outcome_subtype_id,
outcome_subtype
FROM staging.outcome_subtypes
-- animal_outcomes_dimension_new.sql
{{ config(
materialized="table"
) }}
SELECT animal_id,
intake_time,
outcome_time
FROM analytics.clean_animal_outcomes_new2
-- animal_outcomes_facts_new.sql
{{ config(
materialized="table"
) }}
SELECT a.animal_id,
os.id AS outcome_subtypes_id,
t.id AS outcome_types_id,
o.intake_time,
o.days_in_shelter
FROM analytics.clean_animal_outcomes_new2 AS o
JOIN staging.outcome_subtypes AS os ON (o.outcome_subtype_id = os.id)
JOIN staging.outcome_types AS t ON (o.outcome_type_id = t.id)
JOIN staging.animals AS a ON (o.animal_id = a.animal_id)
In terminal, we first used $ cd aac/src/dbt/aac/ to achieve the folder where 'models' folder was included, then did $ dbt run --select animals_dimension.sql. Similarly, ran the rest four.
animal_dimensions
outcome_types_dimension
outcome_subtypes_dimension
animal_outcomes_dimension_new
animal_outcomes_facts_new
Yuning Zhang (Me), Bhajleen Kaur, Carla Gonzalez, Lucas Mego
Bhajleen Kaur
We created data products to store and organize the data effectively through determining the relationships between tables. Then we could retrieve the specific record or answer the questions from clients quickly and accurately. Through writing queries to join fact tables and different dimension tables by connecting key attributes, we could address all project requirements as we could answer all questions related to animals that are taken into AAC.
see tips below
(for loading into superset by run dbt, we just showed this step in the example 1(1), the following answers followed the exact same method, you just needed to change the file name after the 'select')
Tips:
Yuning Zhang (Me) did transformation in 1(1) and 2(2); Lucas Mego did transformation in 1(3); Bhajleen Kaur and Carla Gonzalez did transformation in 2; Carla Gonzalez did transformation in 3; Bhajleen Kaur did did transformation in 4.
Lucas Mego did explanations of visualization in 1,2,3,4.
Yuning Zhang (Me) corrected all content in this section after data cleaning since we did not do that at first.
What is the average number of days in AAC for an animal?
(1) By animal type
Transformation
{{
config(
materialized="table"
)
}}
SELECT a.animal_type,
f.days_in_shelter
FROM {{ ref('animal_outcomes_facts_new') }} AS f
JOIN {{ ref('animals_dimension') }} AS a ON (f.animal_id = a.animal_id)
WHERE f.days_in_shelter > 0
Loading into superset by run dbt
In terminal, we first used $ cd aac/src/dbt/aac/ to achieve the folder where 'models' folder was included, then did $ dbt run --select avg_number_days_byAnimalType_new.sql.
Visualization
The visualization for the average number of days in the Austin Animal Center for an animal by animal type answers the project requirements by illustrating the average number of days in the center by animal center in a pie graph. This pie graph displays the relative size of the average number of days when comparing one animal type to another. The metric AVG(days_in_shelter) and the dimension animal_type build the visualization to address the question. As we can see, livestock animals tend to have the longest stays in the shelter. It is important to rememeber that the sample size for livestock animals is significantly smaller than the other categories, so this number may not be representative of all livestock animals and the average number of days spent during an animal shelter stay.
(2) Average Number of Days in AAC by Type and Breed
Transformation
{{
config(
materialized="table"
)
}}
SELECT a.animal_type,
a.animal_breed,
f.days_in_shelter
FROM {{ ref('animal_outcomes_facts_new') }} AS f
JOIN {{ ref('animals_dimension') }} AS a ON (f.animal_id = a.animal_id)
WHERE f.days_in_shelter > 0
Visualization
The visualization for the average number of days in the Austin Animal Center for an animal by animal type and breed answers the project requirements by using a bar graph to display the average number of days in the animal center by type and breed. The metric (AVG days_in_shelter) and the dimensions animal_breed and animal_type build the visualization to address the question. Here, Chihuahua Longhair/Silky Terrier Dogs tend to spend the least amount of time in the animal shelter.
(3) Average Days By Year And Animal Type
Transformation
{{
config(
materialized="table"
)
}}
SELECT b.animal_type,
EXTRACT(YEAR from f.intake_time),
f.days_in_shelter
FROM {{ ref('animal_outcomes_facts_new') }} AS f
JOIN {{ ref('animal_outcomes_dimension_new') }} AS a ON (f.animal_id = a.animal_id)
JOIN {{ ref('animals_dimension') }} AS b ON (f.animal_id = b.animal_id)
WHERE f.days_in_shelter > 0
Visualization
The visualization for the average number of days in the Austin Animal Center for an animal by year and by animal type answers the project requirements by using a bar graph to display the average number of days in the animal center by year and by animal type together. The metric (AVG days_in_shelter) and the dimensions date_part and animal_type build the visualization to address the question.
2. Number of Intakes and Outcomes Per Year
Transformation
-- queries2intermidiate_new.sql
{{
config(
materialized="table"
)
}}
SELECT a.animal_id,
EXTRACT(YEAR from a.intake_time) as year_intake,
EXTRACT(YEAR from a.outcome_time) as year_outcome
FROM {{ ref('animal_outcomes_dimension_new') }} as a
-- queries2intermidiate_new.sql
{{
config(
materialized="table"
)
}}
SELECT year_intake,
COUNT(animal_id) as count_intake
FROM {{ ref('queries2intermidiate_new') }}
GROUP BY year_intake
-- queries2part2_new.sql
{{
config(
materialized="table"
)
}}
SELECT year_outcome,
COUNT(animal_id) as count_outcome
FROM {{ ref('queries2intermidiate_new') }}
GROUP BY year_outcome
-- finalqueries2_new.sql
{{
config(
materialized="table"
)
}}
select *
FROM {{ ref('queries2part1_new') }} AS a
JOIN {{ ref('queries2part2_new') }} AS b ON (a.year_intake = b.year_outcome)
Visualization
Utilizing a bar graph the visualization for the number of animals that come in and out of AAC every year answers the project requirements by displaying the number of animals that come in and out of AAC every year. For the most part, there are similar numbers of intakes and outcomes per year.
3. Number of Intakes Per Month
Transformation
{{
config(
materialized="table"
)
}}
SELECT TO_CHAR(animal_outcomes_facts_new.intake_time, 'Month') as month,
COUNT(animal_outcomes_facts_new.intake_time) AS num_animals_in
FROM {{ ref ('animal_outcomes_facts_new') }}
GROUP BY month
ORDER BY COUNT(animal_outcomes_facts_new.intake_time) DESC
Visualization
Using a bar graph for the sum of the number of intakes each month, the visualization and answers the project requirements as the client can see the three months with the highest intake for the busiest months. The busiest months are May, June, and October and the slowest months are January, December, and February.
4. Adoptions by Animal Type
Transformation
{{
config(
materialized="table"
)
}}
SELECT COUNT(outcome_type) as num_adoptions,
d.animal_type
FROM {{ ref ('animal_outcomes_facts_new') }} AS a
JOIN {{ ref ('animals_dimension') }} AS d ON a.animal_id = d.animal_id
JOIN {{ ref ('outcome_types_dimension') }} AS o ON a.outcome_types_id = o.outcome_type_id
WHERE outcome_type = 'Adoption'
GROUP BY d.animal_type
ORDER BY COUNT(outcome_type)DESC
Visualization
The visualization for the adoptions by animal type answers the project requirements by using a pie chart to show the number of each animal type that was adopted and the relative magnitude of the number of adoptions compared to others. Here, we can see that out of all adoptions, almost 2/3 of all adoptions are for dogs. This may suggest that dogs are the most likely animals to get adopted.
Yuning Zhang(Me)
There are many options for visualizations like pie charts, line graphs, and bar charts. We used a pie chart for 'Average Number of Days in AAC by Animal Type' and 'Adoptions by Animal Type' since both have only two variables (e.g. for 'Average Number of Days in AAC by Animal Type', two variables are average number of days in AAC and animal type) and a pie chart is very effective at measuring the relationship between two variables. The radius of the circle can be made proportional to the quantity of animal type, so it's easy to compare the average number of days in AAC and adoptions among different animal types. We chose a bar chart to measure the Average Number of Days in AAC by Type and Breed, Average Days By Year And Animal Type, Number of Intakes and Outcomes Per Year and Number of Intakes Per Month since we want to know the several highest results, and in the bar chart we can range them in a descending order so it's easy to grab the data the client wants.
We choose different color if there is a clear comparison like intakes verses outcomes per year in 'Number of Intakes and Outcomes Per Year'. For pie chart, we use bright color for small proportional data to make it obvious.
Carla Gonzalez, Yuning Zhang (Me)
A couple of challenges we faced were that we had to backtrack our project when we found inconsistent data. We found rows where the same animal had the same intake time, which should be impossible. In addition, we found rows that showed the outcome_time to be before the intake_time, resulting in an animal staying negative days in the shelter. In order to solve these problems, we cleaned our data with extra queries, created a clean staging table, and reran our code to create the dimensional tables and star schema with the clean data. While this problem took some time to figure out, it helped us have a better understanding of creating the data pipeline and also helped us appreciate how quickly we can make adjustments the data pipeline when needed.
We learned about composite primary keys in this project. We discovered them from the data and validated our hypothesis by understanding the meaning of the ACC animal intake and outcome processes. We also learned to use 'reference' at the building models stage so that dbt knew what was the order to implement different queries.