The Google Data Analytics Professional Certificate on Coursera is a great course that I am happy to say is a rewarding program. The final part in getting my certificate for this course is to do an optional capstone study on a scenario. This case study pulls everything I have learned into one project that shows all my skills.
In this case study I am a Junior Data Analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. My responsibility in this project is to find out and understand how casual riders and annual members use Cyclistic bikes differently. From these insights my team will design a new marketing strategy to convert casual riders into annual members. But first we need Cyclistic executives to approve my recommendations. For this we will need to have my recommendation backed up with compelling data insights and professional data visualizations.
Cyclistic is a bike-share company that was launched in 2016. Since the launch the program has grown to a fleet of 5,824 bicycles that are Geo tracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system at any time. In the past Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. This was possible because of the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members. Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, Kelly Moreno, director of marketing and my manager believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, she believes there is a very good chance to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs. Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends.
The business task or questions that will guide future marketing program:
1. How do annual members and casual riders use Cyclistic bikes differently?
2. Why would casual riders buy Cyclistic annual memberships?
3. How can Cyclistic use digital media to influence casual riders to become members?
For this study I will be using Cyclistic historical data, which is found and can be downloaded at this link Cyclistic Historical Data. The data is made available by Motivate International Inc. under this license. The data that I will be working with is from April, 2019 to March, 2020 for our analysis.
For preparing for analysis I started by downloading the zip files into a folder and storing them appropriately. Then create a subfolder for the .CSV file so that I have a copy of the original data. First thing I did was to glance at the data using Microsoft Excel. After looking over the data in all four files the best choice for analysis is RStuido in order to process over 5 million records.
I will be using RStudio to process the data to clean.
we first need to install and load the packages required, which in this case are Tidyverse, Lubridate, and ggplot.
#Install Packages
```{r}
install.packages("tidyverse")
install.packages("lubridate")
install.packages("ggplot")
```
#install Library
```{r}
library(tidyverse)
library(lubridate)
library(ggplot2)
```
Once the packages and library are loaded then .CSV files are needed to be collected. The files then were renamed and shortened for easy reading.
#Upload Divvy datasets(csv files)
```{r}
q2_2019 <- read_csv("Divvy_Trips_2019_Q2.csv")
q3_2019 <- read_csv("Divvy_Trips_2019_Q3.csv")
q4_2019 <- read_csv("Divvy_Trips_2019_Q4.csv")
q1_2020 <- read_csv("Divvy_Trips_2020_Q1.csv")
```
After all the files have been collected we will need to get ready to combine it into one single data file. First we have to make sure all the labels are consistent with q1_2020 names. After this we will combine the four files into a single file (all_trips) Using this codes we can make this possible.
#Rename columns to make them consisent with q1_2020
```{r}
(q4_2019 <- rename(q4_2019
,ride_id = trip_id
,rideable_type = bikeid
,started_at = start_time
,ended_at = end_time
,start_station_name = from_station_name
,start_station_id = from_station_id
,end_station_name = to_station_name
,end_station_id = to_station_id
,member_casual = usertype))
```
```{r}
(q3_2019 <- rename(q3_2019
,ride_id = trip_id
,rideable_type = bikeid
,started_at = start_time
,ended_at = end_time
,start_station_name = from_station_name
,start_station_id = from_station_id
,end_station_name = to_station_name
,end_station_id = to_station_id
,member_casual = usertype))
```
```{r}
(q2_2019 <- rename(q2_2019
,ride_id = "01 - Rental Details Rental ID"
,rideable_type = "01 - Rental Details Bike ID"
,started_at = "01 - Rental Details Local Start Time"
,ended_at = "01 - Rental Details Local End Time"
,start_station_name = "03 - Rental Start Station Name"
,start_station_id = "03 - Rental Start Station ID"
,end_station_name = "02 - Rental End Station Name"
,end_station_id = "02 - Rental End Station ID"
,member_casual = "User Type"))
```
#Convert ride_id and rideable_type to character so that they can stack correctly
```{r}
q4_2019 <- mutate(q4_2019, ride_id = as.character(ride_id)
,rideable_type = as.character(rideable_type))
q3_2019 <- mutate(q3_2019, ride_id = as.character(ride_id)
,rideable_type = as.character(rideable_type))
q2_2019 <- mutate(q2_2019, ride_id = as.character(ride_id)
,rideable_type = as.character(rideable_type))
```
#Stack individual quarter's data frames into one big data frme
```{r}
all_trips <- bind_rows(q2_2019, q3_2019, q4_2019, q1_2020)
Once all the files have been combined we can start the analysis. First removing the lat, long, birthyear, and gender fields. This is because they were dropped from the in 2020 set data. Then we start with clean up process and fix few things.
1. In the "member_casual" column, there are two names for members ("member" and "Subscriber") and two names for casual riders ("Customer" and "casual"). We will need to consolidate that from four to two labels.
2. We will want to add some additional columns of data such as day, month, year that can provide additional information.
3. We will want to add a calculated field for length of ride since the 2020Q1 data did not have the "tripduration" column. We will add "ride_length" to the entire dataframe.
4. There are some rides where tripduration shows up as negative, including several hundred rides where Divvy took bikes out of circulation for Quality Control reasons. We will want to delete these rides.
5. In the "member_casual" column, replace "Subscriber" with "member" and "Customer" with "casual"
#Reassign to the desired values
```{r}
all_trips <- all_trips %>%
mutate(member_casual = recode(member_casual
,"Subscriber" = "member"
,"Customer" = "casual"))
```
# Add columns that list the date, month, day, and year of each ride
```{r}
all_trips$date <- as.Date(all_trips$started_at) #The default format is yyyy-mm-dd
all_trips$month <- format(as.Date(all_trips$date), "%m")
all_trips$day <- format(as.Date(all_trips$date), "%d")
all_trips$year <- format(as.Date(all_trips$date), "%Y")
all_trips$day_of_week <- format(as.Date(all_trips$date), "%A")
#Add a "ride_length" calculation to all_trips(in seconds)
```{r}
all_trips$ride_length <- difftime(all_trips$ended_at,all_trips$started_at)
```
#Remove "bad" data
# The dataframe includes a few hundred entries when bikes were taken out of docks and checked for quality by Divvy or ride_length was negative
# We will create a new version of the dataframe (v2) since data is being removed
```{r}
all_trips_v2 <- all_trips[!(all_trips$start_station_name == "HQ QR" | all_trips$ride_length<0),]
Now with data file is clean and bad data is removed we can analyze further. We can analyze ridership by weekday. We can visualize the number or rides by rider type and figure out what that data tells us. We can share the files by using the visualization tools in RStudio that can chart graph to show what data means.
# analyze ridership data by type and weekday
```{r}
all_trips_v2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>% #creates weekday field using wday()
group_by(member_casual, weekday) %>% #groups by usertype and weekday
summarise(number_of_rides = n() #calculates the number of rides and average duration
,average_duration = mean(ride_length)) %>% # calculates the average duration
arrange(member_casual, weekday) # sorts
```
#visualize the number of rides by rider type
```{r}
all_trips_v2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
group_by(member_casual, weekday) %>%
summarise(number_of_rides = n()
,average_duration = mean(ride_length)) %>%
arrange(member_casual, weekday) %>%
ggplot(aes(x = weekday, y = number_of_rides, fill = member_casual)) +
geom_col(position = "dodge")
ggsave('hotel_booking_chart.png', width=7,
height=7)
```
What does data tell us about Number of Ride for Week
Member users ride more on weekdays.
Member users use the service most on the middle of the week.
Casual user ride more on weekends.
Casual users use service less on middle of the week especially on Wednesdays.
# visualization for average duration
```{r}
all_trips_v2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
group_by(member_casual, weekday) %>%
summarise(number_of_rides = n()
,average_duration = mean(ride_length)) %>%
arrange(member_casual, weekday) %>%
ggplot(aes(x = weekday, y = average_duration, fill = member_casual)) +
geom_col(position = "dodge")
ggsave('average_durationh.png', width=7,
height=7)
```
What does data tell us about Time Spent for Week
Casual users ride for longer length of time then member rider
Member users have consistent average length of time through out the week
What does data tell us about the monthly Uses
Both casual and member users have increase of uses in summer months
Compare to Member, casual rider have gradual increase in summer months
Based on the analysis my conclusion is that casual users are riding more in summer months in the city then members users are. Weather played a key role for the casual users. As the warmer weather goes, casual users utilize the service more. January to March had a decrease in casual riders, this could be due to late winter start where snow could have played a role. Casual users spend a lot more time on the ride then the member users. Casual users are riding more on weekends than the member users.
From this information my recommendation is to create a marketing plan for casual users for the summer months. Targeting the casual users early in the year to be ready for the summer months. Offer discounts for colder months to attract casual users to increase the uses in these months. Create a rewards program to attract the casual users and offer to both users to join the membership benefits and rewards to retain the users.