Introduction
Welcome to the Cyclistic bike-share analysis case study! In this case study, I work for a fictional company, Cyclistic, along with some key team members. In order to answer the business questions, we follow the steps of the data analysis process: Ask, Prepare, Process, Analyze, Share, and Act.
Scenario
I work as a junior data analyst on the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, our team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, our team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve our recommendations, so they must be backed up with compelling data insights and professional data visualizations.
Ask
Three questions will guide the future marketing program:
1. How do annual members and casual riders use Cyclistic bikes differently?
2. Why would casual riders buy Cyclistic annual memberships?
3. How can Cyclistic use digital media to influence casual riders to become members?
Prepare
Where is your data located?
- The data has been made available by Motivate International Inc. under this license.
How is the data organized?
- Divvy 2019 Q1 and Divvy 2020 Q1 datasets. I will be using these 2 for this project.
Are there issues with bias or credibility in this data?
- No, issues with the data. Both Datasets have similar data but columns are named slightly different.
How are you addressing licensing, privacy, security, and accessibility?
- under this license.
How did you verify the data’s integrity?
- By checking the link / license above.
How does it help you answer your question?
- Starting by performing the following calculations: create a column called ride_length. Calculate the length of each ride by subtracting the column started_at from the column ended_at (for example, =D2-C2) and format as HH:MM:SS using Format > Cels > Time>373055. 7. Create a column called day_of_week, and calculate the day of the week that each ride started using the WEEKDAY command(for example, =WEEKDAY(C2,1)) in each file. Format as General or as a number with no decimals, noting that 1 = Sunday and 7 = Saturday.
Are there any problems with the data?
Not at the moment for sheets/excel. Later on when working with Bigquery and R the tables will have to be merged together as one.
Process
What tools are you choosing and why?
- Google Sheets for a quicker preview, Tableau for Visualizations, BigQuery to merge the data and R, due to the large amount of data I will be using mainly R for this project.
Have you ensured your data’s integrity?
- The data has been made available by Motivate International Inc. under this license.
What steps have you taken to ensure that your data is clean?
- First in GoogleSheets, by creating two columns and seeing the results: ride_length and day_of_week on both files.
How can you verify that your data is clean and ready to analyze?
- Now both Divvy 2019 Q1 and Divvy 2020 Q1datasets have the same columns added: ride_length and day_of_week (which were needed to run an analysis next).
Have you documented your cleaning process so you can review and share those results?
Yes, This is all being documented as described here.
Analyze
How should you organize your data to perform analysis on it?
Still in GoogleSheets by running a few calculations in one file to get a better sense of the data layout. Options: Calculating, the mean of ride_length, the max of ride_length, and the mode of day_of_week
Has your data been properly formatted?
I also created a pivot table to quickly calculate and visualize the data. Options:
Calculate the average ride_length for members and casual riders - Try rows = member_casual; Values = Average of ride_length
Calculate the average ride_length for users by day_of_week - Try columns = day_of_week; Rows = member_casual; Values= Average of ride_length. Calculate the number of rides for users by day_of_week by adding Count of trip_id to Values.
What surprises did you discover in the data?
In both Datasets the number of rides for members is significantly larger, on Divvy_Trips_2020_Q1. 378,407 rides from members vs 48,480 for casual.
On Divvy_Trips_2019_Q1, 341,906 Members vs 23,163 for casual customers.
What trends or relationships did you find in the data?
Another interesting fact was the average ride length for Members and Casuals (non customers). Members with an average of 12 minutes per ride vs over an hour for non members.
How will these insights help answer your business questions?
With these insights we can make business decisions / suggestions, such as promotions and incentives that could motivate non members to become members.
From what we can see, this also helps us answer our very first question How do annual members and casual riders use Cyclistic bikes differently?
As the data suggests members are more dedicated and rides are more steady through out the week. When non members ride longer rides but way less when it comes to how many days.
Now that we spent some time working with the individual spreadsheets, we can merge them together. I will now show how it can be done in SQL / BigQuery:
-- Rename columns to make them consistent with Divvy_Trips_2020
SELECT *
EXCEPT(trip_id,bikeid,start_time,end_time,from_station_name,from_station_id,to_station_name,to_station_id,usertype),
trip_id AS ride_id,bikeid AS rideable_type,start_time AS started_at,end_time AS ended_at,from_station_name AS start_station_name,from_station_id AS start_station_id,to_station_name AS end_station_name,to_station_id AS end_station_id,usertype AS member_casual
FROM `playground-cacb5.bikeshare_v2.Divvy_Trips_2019` -- SAVE result AS V2
-- Convert ride_id and rideable_type to character so that they can stack correctly with Divvy_trips_2020
SELECT
tripduration,
gender,
birthyear,
started_at,
ended_at,
start_station_name,
start_station_id,
end_station_name,
end_station_id,
member_casual,
ride_length,
day_of_week,
CAST(ride_id AS STRING) AS ride_id,
CAST(rideable_type AS STRING) AS rideable_type
FROM `playground-cacb5.bikeshare_v2.Divvy_Trips_2019_V2` --SAVE result AS V3
-- # Remove ONE @ a time: (start_lat, start_lng, end_lat, end_lng, birthyear, gender, "tripduration") and ALSO REMOVE FROM Divvy_Trips_2020
ALTER TABLE
`playground-cacb5.bikeshare_v2.Divvy_Trips_2019_V3`
DROP COLUMN IF EXISTS tripduration -- we can start with trip duration
Now both datasets match with same columns and data type:
Now it is time to combine both datasets into one:
-- Stack the individual data frames into one big data frame - SAVE AS Divvy_2019_and_2020
SELECT
started_at,
ended_at,
start_station_name,
start_station_id,
end_station_name,
end_station_id,
member_casual,
ride_id,
rideable_type,
ride_length,
day_of_week
FROM
`playground-cacb5.bikeshare_v2.Divvy_Trips_2020`
UNION ALL
SELECT
started_at,
ended_at,
start_station_name,
start_station_id,
end_station_name,
end_station_id,
member_casual,
ride_id,
rideable_type,
ride_length,
day_of_week
FROM `playground-cacb5.bikeshare_v2.Divvy_Trips_2019_V3`
Now that both tables are combined we have a total of 791,956 rows. 365,069 from Divvy_Trips_2019_V3 plus 426,887 from Divvy_Trips_2020
We can now explore the data, looking at distinct values, such as maximum, minimum, or mean values.
UPDATE `playground-cacb5.bikeshare_v2.Divvy_2019_and_2020`
SET
member_casual = 'member'
WHERE
member_casual = 'Subscriber'
--# In the "member_casual" column, replace "Subscriber" with "member" and "Customer" with "casual"
-- unable in bigQuery free tier. But this can also be done in R as We will do next.
Now I will show how all this can be done in R- Follow these steps for using R
Open your preferred version of R ,you may Download it or use it on the Cloud. Click this link to follow along, and select “Use template”, Then, copy and paste the text from the template into an R script. Here is a short cut to both Divvy 2019 Q1 and Divvy 2020 Q1 datasets. I will be using these 2 for this project.
install.packages("tidyverse")
library(tidyverse) #helps wrangle data
# Use the conflicted package to manage conflicts
library(conflicted)
# Set dplyr::filter and dplyr::lag as the default choices
conflict_prefer("filter", "dplyr")
conflict_prefer("lag", "dplyr")
#=====================
# STEP 1: COLLECT DATA
#=====================
# # Upload Divvy datasets (csv files) here
q1_2019 <- read_csv("C:/Users/fiuza/OneDrive/Desktop/Divvy_Trips_2019_Q1.csv")
q1_2020 <- read_csv("C:/Users/fiuza/OneDrive/Desktop/Divvy_Trips_2020_Q1.csv")
#====================================================
# STEP 2: WRANGLE DATA AND COMBINE INTO A SINGLE FILE
#====================================================
# Compare column names each of the files
# While the names don't have to be in the same order, they DO need to match perfectly before we can use a command to join them into one file
colnames(q1_2019)
colnames(q1_2020)
# Rename columns to make them consistent with q1_2020 (as this will be the supposed going-forward table design for Divvy)
(q1_2019 <- rename(q1_2019
,ride_id = trip_id
,rideable_type = bikeid
,started_at = start_time
,ended_at = end_time
,start_station_name = from_station_name
,start_station_id = from_station_id
,end_station_name = to_station_name
,end_station_id = to_station_id
,member_casual = usertype
))
# Inspect the dataframes and look for incongruencies
str(q1_2019)
str(q1_2020)
# Convert ride_id and rideable_type to character so that they can stack correctly
q1_2019 <- mutate(q1_2019, ride_id = as.character(ride_id)
,rideable_type = as.character(rideable_type))
# Stack individual quarter's data frames into one big data frame
all_trips <- bind_rows(q1_2019, q1_2020)
# Remove lat, long, birthyear, and gender fields as this data was dropped beginning in 2020
all_trips <- all_trips %>%
select(-c(start_lat, start_lng, end_lat, end_lng, birthyear, gender, "tripduration"))
#======================================================
# STEP 3: CLEAN UP AND ADD DATA TO PREPARE FOR ANALYSIS
#======================================================
# Inspect the new table that has been created
colnames(all_trips) #List of column names
nrow(all_trips) #How many rows are in data frame?
dim(all_trips) #Dimensions of the data frame?
head(all_trips) #See the first 6 rows of data frame. Also tail(all_trips)
str(all_trips) #See list of columns and data types (numeric, character, etc)
summary(all_trips) #Statistical summary of data. Mainly for numerics
# There are a few problems we will need to fix:
# (1) In the "member_casual" column, there are two names for members ("member" and "Subscriber") and two names for casual riders ("Customer" and "casual"). We will need to consolidate that from four to two labels.
# (2) The data can only be aggregated at the ride-level, which is too granular. We will want to add some additional columns of data -- such as day, month, year -- that provide additional opportunities to aggregate the data.
# (3) We will want to add a calculated field for length of ride since the 2020Q1 data did not have the "tripduration" column. We will add "ride_length" to the entire dataframe for consistency.
# (4) There are some rides where tripduration shows up as negative, including several hundred rides where Divvy took bikes out of circulation for Quality Control reasons. We will want to delete these rides.
# In the "member_casual" column, replace "Subscriber" with "member" and "Customer" with "casual"
# Before 2020, Divvy used different labels for these two types of riders ... we will want to make our data frame consistent with their current nomenclature
# N.B.: "Level" is a special property of a column that is retained even if a subset does not contain any values from a specific level
# Begin by seeing how many observations fall under each usertype
table(all_trips$member_casual)
# Reassign to the desired values (we will go with the current 2020 labels)
all_trips <- all_trips %>%
mutate(member_casual = recode(member_casual
,"Subscriber" = "member"
,"Customer" = "casual"))
# Check to make sure the proper number of observations were reassigned
table(all_trips$member_casual)
# Add columns that list the date, month, day, and year of each ride
# This will allow us to aggregate ride data for each month, day, or year ... before completing these operations we could only aggregate at the ride level
# https://www.statmethods.net/input/dates.html more on date formats in R found at that link
all_trips$date <- as.Date(all_trips$started_at) #The default format is yyyy-mm-dd
all_trips$month <- format(as.Date(all_trips$date), "%m")
all_trips$day <- format(as.Date(all_trips$date), "%d")
all_trips$year <- format(as.Date(all_trips$date), "%Y")
all_trips$day_of_week <- format(as.Date(all_trips$date), "%A")
# Add a "ride_length" calculation to all_trips (in seconds)
# https://stat.ethz.ch/R-manual/R-devel/library/base/html/difftime.html
all_trips$ride_length <- difftime(all_trips$ended_at,all_trips$started_at)
# Inspect the structure of the columns
str(all_trips)
# Convert "ride_length" from Factor to numeric so we can run calculations on the data
is.factor(all_trips$ride_length)
all_trips$ride_length <- as.numeric(as.character(all_trips$ride_length))
is.numeric(all_trips$ride_length)
# Remove "bad" data
# The dataframe includes a few hundred entries when bikes were taken out of docks and checked for quality by Divvy or ride_length was negative
# We will create a new version of the dataframe (v2) since data is being removed
# https://www.datasciencemadesimple.com/delete-or-drop-rows-in-r-with-conditions-2/
all_trips_v2 <- all_trips[!(all_trips$start_station_name == "HQ QR" | all_trips$ride_length<0),]
3,767 rows of bad data were removed.
#=====================================
# STEP 4: CONDUCT DESCRIPTIVE ANALYSIS
#=====================================
# Descriptive analysis on ride_length (all figures in seconds)
mean(all_trips_v2$ride_length) #straight average (total ride length / rides)
median(all_trips_v2$ride_length) #midpoint number in the ascending array of ride lengths
max(all_trips_v2$ride_length) #longest ride
min(all_trips_v2$ride_length) #shortest ride
# You can condense the four lines above to one line using summary() on the specific attribute
summary(all_trips_v2$ride_length)
# Compare members and casual users
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = mean)
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = median)
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = max)
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = min)
# See the average ride time by each day for members vs casual users
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)
# Notice that the days of the week are out of order. Let's fix that.
all_trips_v2$day_of_week <- ordered(all_trips_v2$day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
# Now, let's run the average ride time by each day for members vs casual users
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)
# analyze ridership data by type and weekday
all_trips_v2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>% #creates weekday field using wday()
group_by(member_casual, weekday) %>% #groups by usertype and weekday
summarise(number_of_rides = n() #calculates the number of rides and average duration
,average_duration = mean(ride_length)) %>% # calculates the average duration
arrange(member_casual, weekday) # sorts
# Let's visualize the number of rides by rider type
all_trips_v2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
group_by(member_casual, weekday) %>%
summarise(number_of_rides = n()
,average_duration = mean(ride_length)) %>%
arrange(member_casual, weekday) %>%
ggplot(aes(x = weekday, y = number_of_rides, fill = member_casual)) +
geom_col(position = "dodge")
# Let's create a visualization for average duration
all_trips_v2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
group_by(member_casual, weekday) %>%
summarise(number_of_rides = n()
average_duration = mean(ride_length)) %>%
arrange(member_casual, weekday) %>%
ggplot(aes(x = weekday, y = average_duration, fill = member_casual)) +
geom_col(position = "dodge")
#=================================================
# STEP 5: EXPORT SUMMARY FILE FOR FURTHER ANALYSIS
#=================================================
# Create a csv file(s) that we will visualize in Excel, Tableau, or my presentation software
# N.B.: This file location is for a Mac. If you are working on a PC, change the file location accordingly (most likely "C:\Users\YOUR_USERNAME\Desktop\...") to export the data. You can read more here: https://datatofish.com/export-dataframe-to-csv-in-r/
counts <- aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)
write.csv(counts, file = 'avg_ride_length.csv')
countRides <- all_trips_v2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
group_by(member_casual, weekday) %>%
summarise(number_of_rides = n())
write.csv(countRides, file = 'C:/Users/fiuza/OneDrive/Desktop/number_ofRides.csv')
Tableau Interactive ViZZ
Share
Were you able to answer the question of how annual members and casual riders use Cyclistic bikes differently?
Annual members ride for less time but their number of rides are significantly bigger. The busiest day for casual riders is Sunday, for members it is Tuesday.
What story does your data tell?
After combining both datasets (quarter one of 2019 and quarter one of 2020) we still see a much larger number of members vs casual customers, about 90% of all customers are members.
The data also shows that casual riders spend more time riding. Proving that they ride less days but when they do they go all for it.
Tuesdays being the busiest days for customers with 127,974 rides and an average of 769 seconds (about 13 minutes) versus
casual riders on Sundays being the busiest day for them with an average of 5061 seconds( about 1.3 hrs).
How do your findings relate to your original question?
The numbers prove to be true that most casual members ride way more in the weekends and that they are 10% of all customers.
Who is your audience? What is the best way to communicate with them?
Cyclistic executives. They will be approving marketing strategies based on my findings to convert more casual riders to members.
Can data visualization help you share your findings?
Yes, definitely, here are some visualizations from R as shared above and some visualizations from Tableau that make a lot more since than just staring at numbers.
Is your presentation accessible to your audience?
Yes, I will be sharing my findings with them. along with all the material I used for this research.
Act
What is your final conclusion based on your analysis?
The busiest day of the week when it comes to the number of rides is Sunday for casual customers with 18652 rides. The day casual customers spend the longest riding is Thursday with an average of 2.3 hours per ride (or 8451 seconds) vs 16 min for members on Saturdays. This is the group of customers we want to convert to members. I would recommend to the executives making the decision to which strategy approve that we should come up with some type of incentive that would motivate the casual customers to convert to members. As I mentioned before casual customers are about a 10% of all customers. Casual customers ride less days but when they do they ride for way longer. Maybe we should focus on motivating this group of casual customers to be more steady with the habit so they also don't eventually get burned out and quit riding/exercising.
How could your team and business apply your insights?
We should focus on motivating this group of casual customers to be more steady with the habit so they also don't eventually get burned out and quit riding/exercising. By making it worth it to casual customers moneywise to convert.
What next steps would you or your stakeholders take based on your findings?
I would calculate how much profit could be made based on the conversion versus what casual members regularly bring. Them implement an advertisement campaign with the marketing team to convince / convert these casual members to members
Is there additional data you could use to expand on your findings?
I would provide all the data I used for the project and be open to suggestions in case new calculations were needed.