Case Study 1 - Cyclistic Bike-share
The data has been made available by Motivate International Inc. for case study purpose at below link
Introduction
Following are my detailed analysis and findings for Google Data Analytics Certification course for Case Study 1 - Cyclistic Bike-share.
In this case study, I used Cyclistic Bike-share data to analyze how annual members and casual riders use Cyclistic bikes differently, determine trends, and make 3 top recommendations to stakeholders based on those findings. R programming and R Studio are used for this analysis as total of 4 million+ row of data is not possible with excel spreadsheet. 8 charts from analysis and findings as well as R script are included at the end of this analysis.
To answer the key business questions, I follow
· the steps of the data analysis process: ask, prepare, process, analyze, share, and act.
· Case Study Roadmap tables — including guiding questions and key tasks — to stay on the right path.
Scenario
You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.
Characters and teams
Cyclistic: A bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike. The majority of riders opt for traditional bikes; about 8% of riders use the assistive options. Cyclistic users are more likely to ride for leisure, but about 30% use them to commute to work each day.
Lily Moreno: The director of marketing and your manager. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program. These may include email, social media, and other channels.
Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy. You joined this team six months ago and have been busy learning about Cyclistic’s mission and business goals — as well as how you, as a junior data analyst, can help Cyclistic achieve them.
Cyclistic executive team: The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program.
About the company
In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.
Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.
Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, Moreno believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a very good chance to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs.
Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends.
Ask
Three questions will guide the future marketing program:
1. How do annual members and casual riders use Cyclistic bikes differently?
2. Why would casual riders buy Cyclistic annual memberships?
3. How can Cyclistic use digital media to influence casual riders to become members?
Moreno has assigned you the first question to answer: How do annual members and casual riders use Cyclistic bikes differently?
You will produce a report with the following deliverables:
1. A clear statement of the business task
2. A description of all data sources used
3. Documentation of any cleaning or manipulation of data
4. A summary of your analysis
5. Supporting visualizations and key findings
6. Your top three recommendations based on your analysis
My Analysis as follows:
Phase 1: Ask
Guiding questions
● What is the problem you are trying to solve?
To identify and provide insight on differences between members and casual to marketing team for marketing campaigns to convert casual riders into annual members in order to increase profitability.
● How can your insights drive business decisions?
The insights will help marketing team design right marketing strategy to convert casual riders into annual members.
Deliverable
A clear statement of the business task
To find key differences between casuals and members for marketing team to design right marketing strategy to convert casual riders into annual members.
Phase 2: Prepare
Guiding questions
● Where is your data located?
The data is public data made available by Motivate International Inc.
● How is the data organized?
The data are organized into csv files by month, quarter and whole year from 2013 till current. For this analysis, data for last 12 months from Aug 2020 till Jul 2021 are used.
● Are there issues with bias or credibility in this data? Does your data ROCCC?
This data are primary and structured data collected by Motivate International Inc and make available to public. It should ROCCC (Reliable, Original, Comprehensive, Current, Cited).
● How are you addressing licensing, privacy, security, and accessibility?
This data has been made available by Motivate International Inc. under license for case study purposes and there is no personal identifiable information or sensitive information in the dataset.
● How did you verify the data’s integrity?
All 12 month csv files are checked to have same data structure except csv file for 4 months with 2 columns (started_station_id, ended_station_id) with different vector type.
● How does it help you answer your question?
Having the 12 months data put together, the trends and riders’ bike usage up to a year should provide some key insights.
● Are there any problems with the data?
4 months (Aug 2020 to Dec 2020) data with 2 columns (started_station_id, ended_station_id) with different vector type need to change to same vector type.
Deliverable
A description of all data sources used
12 months data is downloaded, credibility and integrity verified, unzipped and data structure identified and ready to process.
Phase 3: Process
Guiding questions
● What tools are you choosing and why?
RStudio is used as the amount of data is more than 4 million rows which make excel not possible.
● Have you ensured your data’s integrity?
Yes, all 12 monthly csv files have consistent data structure and binds successfully for analysis.
● What steps have you taken to ensure that your data is clean?
First, null values and NA are dropped. Secondly, zero or negative value for duration of ride length were removed before analysis.
● How can you verify that your data is clean and ready to analyze?
Drop NA function used and ride length checked with minimum function to verify no zero or negative value in ride length.
● Have you documented your cleaning process so you can review and share those results?
Yes, cleaning and manipulation of data has been documented in R script as below.
Deliverable
Documentation of any cleaning or manipulation of data
Cleaning and manipulation of data has been documented in R script as below.
Phase 4: Analyze
Guiding questions
● How should you organize your data to perform analysis on it?
12 monthly data files are organized into a single csv files for analysis.
● Has your data been properly formatted?
Yes, all columns are properly formatted before organized into single csv.
● What surprises did you discover in the data?
Surprise to find out casuals have long average ride duration compare to members on every day of week.
● What trends or relationships did you find in the data?
1. There are more members than casuals.
2. Total bike usage pick up from May and peak at Jul and gradually decrease to lowest usage in Feb.
3. There are more classic bike used by casuals during Saturdays and Sundays.
4. Members have shorter average ride duration.
5. More casuals ride on Fridays, Saturdays and Sundays than Monday to Thursday.
● How will these insights help answer your business questions?
These insights help to bring a better understanding of bike used patterns for members and casuals.
Deliverable
A summary of your analysis
Members in general taking shorter rides than casuals which is most time for commuting to work. The average ride time remains similar throughout the week whereas casuals have average longer ride time during Fridays, Saturdays and Sundays.
Phase 5: Share
Guiding questions
● Were you able to answer the question of how annual members and casual riders use Cyclistic bikes differently?
Yes. Based on findings, there are differences between members and casual riders.
● What story does your data tell?
The story the data tells is that casual riders use bike on average more than 2 times longer duration than members on each day. Casual riders also ride more often and longer duration on Saturdays and Sundays. Apr to Oct seems to be popular period for casual riders.
● How do your findings relate to your original question?
The findings provide a view of different profile for casual riders and members, relating “How do annual members and casual riders use Cyclistic bikes differently?”
● Who is your audience? What is the best way to communicate with them?
The audience is my manager, Lily Moreno and my cyclistic marketing analytics team. The best way to communicate is through data visualization and slide presentation of the findings.
● Can data visualization help you share your findings?
Yes, data visualization is best way to share my findings.
● Is your presentation accessible to your audience?
Yes, distinguishing colors and labellings are used to make plots accessible.
Deliverable
Supporting visualizations and key findings
8 charts for visualizations and key findings are presented as follow:
Phase 6: Act
Guiding questions
● What is your final conclusion based on your analysis?
Members and casual riders have different bike usage profiles. Members ride frequency and duration are consistent throughout the 7 days a week likely for work. Casual riders have more ride and longer duration during Saturdays and Sundays which likely for leisure, exercise. Apr to Oct seems to be popular period for casual riders.
● How could your team and business apply your insights?
The team can focus on marketing campaign to convert casual into members by promoting riding to work as a green way to reduce carbon emission and save the earth and a form of exercise for healthy living.
● What next steps would you or your stakeholders take based on your findings?
Next steps can be conducting a survey targeting casual members
1) to understand their bike usage, e.g leisure, exercise
2) share knowledge on carbon emission reduction and its benefits to the earth
3) find out what are the factors that will entice them to convert to members
● Is there additional data you could use to expand on your findings?
1) zip code or area of casual rider address to determine if they stay near cyclistic service area
2) climate/ temperature which might be affecting the number of rides from Nov to Mar
Deliverable
Your top three recommendations based on your analysis
1. Conduct a survey targeting casual members with objectives as mentioned above.
2. Build a marketing campaign targeting casual riders that riding as a green way to get to work, health and other benefits they can get from this exercise.
3. Targeting the campaign and ads on social media during the popular months from Apr to Oct where there are high number of casual riders.
R script
install.packages("tidyverse") # Install tidyverse package
library("tidyverse") # Load tidyverse package
aug_2020 <- read_csv("../input/cyclistic-bikeshare-user-analysis/202008-divvy-tripdata/202008-divvy-tripdata.csv") # Load all csv file
sep_2020 <- read_csv("../input/cyclistic-bikeshare-user-analysis/202009-divvy-tripdata/202009-divvy-tripdata.csv")
oct_2020 <- read_csv("../input/cyclistic-bikeshare-user-analysis/202010-divvy-tripdata/202010-divvy-tripdata.csv")
nov_2020 <- read_csv("../input/cyclistic-bikeshare-user-analysis/202011-divvy-tripdata/202011-divvy-tripdata.csv")
dec_2020 <- read_csv("../input/cyclistic-bikeshare-user-analysis/202012-divvy-tripdata/202012-divvy-tripdata.csv")
jan_2021 <- read_csv("../input/cyclistic-bikeshare-user-analysis/202101-divvy-tripdata/202101-divvy-tripdata.csv")
feb_2021 <- read_csv("../input/cyclistic-bikeshare-user-analysis/202102-divvy-tripdata/202102-divvy-tripdata.csv")
mar_2021 <- read_csv("../input/cyclistic-bikeshare-user-analysis/202103-divvy-tripdata/202103-divvy-tripdata.csv")
apr_2021 <- read_csv("../input/cyclistic-bikeshare-user-analysis/202104-divvy-tripdata/202104-divvy-tripdata.csv")
may_2021 <- read_csv("../input/cyclistic-bikeshare-user-analysis/202105-divvy-tripdata/202105-divvy-tripdata.csv")
jun_2021 <- read_csv("../input/cyclistic-bikeshare-user-analysis/202106-divvy-tripdata/202106-divvy-tripdata.csv")
jul_2021 <- read_csv("../input/cyclistic-bikeshare-user-analysis/202107-divvy-tripdata/202107-divvy-tripdata.csv")
spec(aug_2020) # Check column specification
spec(sep_2020)
spec(oct_2020)
spec(nov_2020)
spec(dec_2020)
spec(jan_2021)
spec(feb_2021)
spec(mar_2021)
spec(apr_2021)
spec(may_2021)
spec(jun_2021)
spec(jul_2021)
aug_2020 <- mutate(aug_2020,start_station_id=as.character(start_station_id),end_station_id=as.character(end_station_id)) # mutate start and end station id to character
sep_2020 <- mutate(sep_2020,start_station_id=as.character(start_station_id),end_station_id=as.character(end_station_id)) # mutate start and end station id to character
oct_2020 <- mutate(oct_2020,start_station_id=as.character(start_station_id),end_station_id=as.character(end_station_id)) # mutate start and end station id to character
nov_2020 <- mutate(nov_2020,start_station_id=as.character(start_station_id),end_station_id=as.character(end_station_id)) # mutate start and end station id to character
data_all <- rbind(aug_2020,sep_2020,oct_2020,nov_2020,dec_2020,jan_2021,feb_2021,mar_2021,apr_2021,may_2021,jun_2021,jul_2021) # Combine 12 months data
data_all_v1 <- data_all %>% drop_na() %>% mutate(date=as.Date(started_at)) # remove rows with blank value
data_all_v1$ride_length <- difftime(data_all_v1$ended_at,data_all_v1$started_at) # ride length
data_all_v1$ride_length <- as.numeric(data_all_v1$ride_length)
data_all_v2<- data_all_v1[!(data_all_v1$ride_length<=0),] # remove ride length with '0' and negative value
mindate <- min(data_all_v2$date)
maxdate <- max(data_all_v2$date)
options(scipen = 999) # turn off scientific e notation in R to get values in numeric form - options(scipen = 0) to turn on
aggregate(data_all_v2$ride_length ~ data_all_v2$member_casual, FUN = min) # check no '0' and -ve values in min ride length
aggregate(data_all_v2$ride_length ~ data_all_v2$member_casual, FUN = mean)
mean_ride_year <- data_all_v2 %>% #1 comparing Member and Casual users average ride length yearly
group_by(member_casual) %>%
summarise(average_duration=mean(ride_length), .groups="drop")
ggplot(mean_ride_year) +
geom_col(mapping=aes(x=member_casual,y=average_duration,fill=member_casual)) +
labs(title="Member vs Casual users average ride duration during a year", x="User type", y="Average ride duration in seconds",
subtitle=paste0("Data from: ", mindate, " to ", maxdate), caption = "Data by Motivate International Inc")
data_all_v2$date <- as.Date(data_all_v2$started_at)
data_all_v2$day_of_week <- format(as.Date(data_all_v2$started_at), "%A")
data_all_v2$day_of_week <- ordered(data_all_v2$day_of_week, levels=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
aggregate(data_all_v2$ride_length ~ data_all_v2$member_casual + data_all_v2$day_of_week, FUN = mean)
mean_ride_day <- data_all_v2 %>% #2 Member versus Casual users average ride duration during each day of week for a year
group_by(member_casual, day_of_week) %>%
summarise(average_duration=mean(ride_length), .groups="drop")
ggplot(mean_ride_day) +
geom_col(mapping=aes(x=day_of_week,y=average_duration,fill=member_casual),position = "dodge") +
labs(title="Member vs Casual users average ride duration during each day of week for a year", x="User type", y="Average ride duration in seconds",
subtitle=paste0("Data from: ", mindate, " to ", maxdate), caption = "Data by Motivate International Inc")
bike_type <- data_all_v2 %>% filter(rideable_type=="classic_bike" | rideable_type=="electric_bike" | rideable_type=="docked_bike")
bike_type %>%
group_by(member_casual,rideable_type) %>%
summarise(totals=n(), .groups="drop") %>% #3 comparing Member and Casual users bike type yearly
ggplot() +
geom_col(mapping=aes(x=member_casual,y=totals,fill=rideable_type),position = "dodge") +
labs(title="Member vs Casual users total bike type used in a year", x="User type", y="Number of rides",
subtitle=paste0("Data from: ", mindate, " to ", maxdate), caption = "Data by Motivate International Inc")
data_all_v2$month <- format(as.Date(data_all_v2$started_at), "%m")
bike_type_monthly <- data_all_v2 %>%
group_by(member_casual,rideable_type, month) %>%
summarise(totals=n(), .groups="drop") #4 comparing Member and Casual users bike type monthly for a year
ggplot(data=bike_type_monthly) +
geom_col(mapping=aes(x=month,y=totals,fill=rideable_type),position = "dodge") + facet_wrap(~member_casual) +
labs(title="Member vs Casual users total bike type used in each month for a year", x="month", y="Number of rides",
subtitle=paste0("Data from: ", mindate, " to ", maxdate), caption = "Data by Motivate International Inc")
bike_type %>%
group_by(member_casual,rideable_type, day_of_week) %>%
summarise(totals=n(), .groups="drop") %>% #5 comparing Member and Casual users bike type weekday for a year
ggplot() +
geom_col(mapping=aes(x=day_of_week,y=totals,fill=rideable_type),position = "dodge") + facet_wrap(~member_casual) + theme(axis.text.x = element_text(angle = 45)) +
labs(title="Member vs Casual users total bike type used in each day for a year", x="day of week", y="Number of rides",
subtitle=paste0("Data from: ", mindate, " to ", maxdate), caption = "Data by Motivate International Inc")
ride_yearly <- data_all_v2 %>%
group_by(member_casual) %>%
summarise(number_of_rides=n(), .groups="drop") #6 comparing Member and Casual users no of ride for a year
ggplot(data=ride_yearly) +
geom_col(mapping=aes(x=member_casual,y=number_of_rides,fill=member_casual)) +
labs(title="Member vs Casual users total number of rides in a year", x="User type", y="Number of rides",
subtitle=paste0("Data from: ", mindate, " to ", maxdate), caption = "Data by Motivate International Inc")
ride_monthly <- data_all_v2 %>%
group_by(member_casual, month) %>%
summarise(number_of_rides=n(), .groups="drop") #7 comparing Member and Casual users no of ride monthly for a year
ggplot(data=ride_monthly) +
geom_col(mapping=aes(x=month,y=number_of_rides,fill=member_casual),position = "dodge") +
labs(title="Member vs Casual users total number of rides in each month for a year", x="month", y="Number of rides",
subtitle=paste0("Data from: ", mindate, " to ", maxdate), caption = "Data by Motivate International Inc")
ride_daily <- data_all_v2 %>%
group_by(member_casual, day_of_week) %>%
summarise(number_of_rides=n(), .groups="drop") #8 comparing Member and Casual users no of ride daily for a year
ggplot(data=ride_daily) +
geom_col(mapping=aes(x=day_of_week,y=number_of_rides,fill=member_casual),position = "dodge") +
labs(title="Member vs Casual users total number of rides in each day for a year", x="day of week", y="Number of rides",
subtitle=paste0("Data from: ", mindate, " to ", maxdate), caption = "Data by Motivate International Inc")