ASK
BUSINESS TASK: Design marketing strategies aimed at converting casual riders into annual members by finding out how annual members and casual riders use Cyclistic bikes differently.
KEY STAKEHOLDERS:
- Lily Moreno: Manager, Director of Marketing
- The Cyclistic Executive Team
Guiding Questions:
1. How do annual members and casual riders use Cyclistic bikes differently?
2. Why would casual riders buy Cyclistic annual memberships?
3. How can Cyclistic use digital media to influence casual riders to become members?
PREPARE
Data Accessibility, Privacy, & Security: This is an open dataset that can be found here. It is first-party, public data that excludes the rider's identifiable information such as name, address, age, gender, and financial information.
Data Credibility:
I will be using 12 datasets from the Cyclistic's historial trip database containing single month information from September 2021 - August 2022. The categorical data will help me analyze how the casual riders and annual members use the service differently. I will find patterns to target the geolocations of digital media campaigns and convert casual riders to annual members.
Data Reliability (ROCCC):
Reliable - this is first-party data straight from Cyclistic's historical data repository.
Original - this dataset was gathered directly from the company, Cyclistic.
Comprehensive - includes all ride length information and type use along with geographical information. Does not include personal identification information.
Current - dataset combines the past 12 months, with August 2022 as the most recent.
Cited - link available to dataset here.
Due to the large size of the data, I will be using R for my analysis.
1. LOAD LIBRARIES
We will be loading the following libraries using the library() function.
library(tidyverse) #to connect data and create visualizations
library(lubridate) #for date and time functions
library(ggplot2) #for data visualizations
library(hms) #for date and time functions
library(readr) #to import .csv files
library(dplyr) #for data manipulation
library(janitor) #to clean dirty data
library(tidyr) #to wrangle and tidy data
2. IMPORT DATA
I will be using the read_csv() function from the readr package to assign each file as mmm_yyyy using the <- operator.
sep_2021 <- read_csv("../input/bikesharedataset//202109-divvy-tripdata/202109-divvy-tripdata.csv")
oct_2021 <- read_csv("../input/bikesharedataset//202110-divvy-tripdata/202110-divvy-tripdata.csv")
nov_2021 <- read_csv("../input/bikesharedataset//202111-divvy-tripdata/202111-divvy-tripdata.csv")
dec_2021 <- read_csv("../input/bikesharedataset//202112-divvy-tripdata/202112-divvy-tripdata.csv")
jan_2022 <- read_csv("../input/bikesharedataset//202201-divvy-tripdata/202201-divvy-tripdata.csv")
feb_2022 <- read_csv("../input/bikesharedataset//202202-divvy-tripdata/202202-divvy-tripdata.csv")
mar_2022 <- read_csv("../input/bikesharedataset//202203-divvy-tripdata/202203-divvy-tripdata.csv")
apr_2022 <- read_csv("../input/bikesharedataset//202204-divvy-tripdata/202204-divvy-tripdata.csv")
may_2022 <- read_csv("../input/bikesharedataset//202205-divvy-tripdata/202205-divvy-tripdata.csv")
jun_2022 <- read_csv("../input/bikesharedataset//202206-divvy-tripdata/202206-divvy-tripdata.csv")
jul_2022 <- read_csv("../input/bikesharedataset//202207-divvy-tripdata/202207-divvy-tripdata.csv")
aug_2022 <- read_csv("../input/bikesharedataset//202208-divvy-tripdata/202208-divvy-tripdata.csv")
3. INSPECT COLUMN NAMES
I will use colnames() to inspect consistency in our dataframes.
colnames(sep_2021)
colnames(oct_2021)
colnames(nov_2021)
colnames(dec_2021)
colnames(jan_2022)
colnames(feb_2022)
colnames(mar_2022)
colnames(apr_2022)
colnames(may_2022)
colnames(jun_2022)
colnames(jul_2022)
colnames(aug_2022)
4. INSPECT DATA STRUCTURE
I will use the str() function to check the consistency in structure and datatypes among the datasets.
colnames(sep_2021)
colnames(oct_2021)
colnames(nov_2021)
colnames(dec_2021)
colnames(jan_2022)
colnames(feb_2022)
colnames(mar_2022)
colnames(apr_2022)
colnames(may_2022)
colnames(jun_2022)
colnames(jul_2022)
colnames(aug_2022)
5. COMBINING DATASETS
I will combine the 12 datasets to create 1 new dataframe called all_trips using rbind().
all_trips <- rbind(sep_2021, oct_2021, nov_2021, dec_2021, jan_2022, feb_2022, mar_2022, apr_2022, may_2022, jun_2022, jul_2022, aug_2022)
str(all_trips)
6. REMOVING DUPLICATES
Since I must keep the raw data, I will create a new dataframe, all_trips2, for manipulation starting with !duplicated() to remove any duplicate ride_id observations.
all_trips2 <- all_trips[!duplicated(all_trips$ride_id), ]
7. ADDRESS MISSING VALUES
I noticed multiple NA values in the started_at and ended_at fields, and there could be more in any data field! Since we need a complete, clean dataset, we will remove any missing data using na.omit().
all_trips2 <- na.omit(all_trips2)
PROCESS
8. CREATE COLUMNS
I will create three attributes using the mutate() and the $ operator to further analyze my data:
ride_length: I will use the as.numeric function to output difftime() between ended_at and stared_at as a numerical value instead of a character. The difference in time will be in seconds so I will divide it by 60 to transform the ride_length to minutes.
week_day: I will use the weekdays() function to extract the day of week the ride started_at.
month_year: I will use the months() function to extract the month in the year the ride started_at.
all_trips2 <- all_trips2$ride_length <- as.double(difftime(all_trips2$ended_at
,all_trips2$started_at))
all_trips2$ride_length <- all_trips2$ride_length/60
all_trips2 <- all_trips2 %>%
mutate(week_day = weekdays(started_at)
,month_year = months(started_at))
9. ORDER THE VARIABLES
Right now, the week_day and month_year variables are characters. We need to give them an order so they plot in chronological order instead of alphabetical order. I will use factor() and create a vector for the levels of days and months.
# order month_year variables
all_trips2 <- all_trips2 %>%
mutate(month_year = factor(month_year, levels = c("September"
,"October"
,"November"
,"December"
,"January"
,"February"
,"March"
,"April"
,"May"
,"June"
,"July"
,"August")))
# order week_day variables
all_trips2 <- all_trips2 %>%
mutate(week_day = factor(week_day, levels = c("Monday"
,"Tuesday"
,"Wednesday"
,"Thursday"
,"Friday"
,"Saturday"
,"Sunday")))
10. REMOVE NEGATIVE RIDES LENGTHS
I will use the count() function to find the number of rides with a negative value and filter() to remove those observations from our dataset.
count(all_trips2, ride_length < 0)
ride_length < 0 n
<lgl> <int>
FALSE 4560047
TRUE 99
We find that there are 99 trips with a ride duration of less than 0 minutes. Let's remove those trips:
all_trips2 <- all_trips2 %>%
filter(ride_length>0)
ANALYZE & SHARE
1. ANALYZING TOTAL RIDES BY CUSTOMER
First, let's count the total number of trips for annual members and casual riders, and see how they compare.
I will use group_by() to separate casual_riders and annual_members in the summaries. To find the percentage of each group, pct_total, I will find the length() of the ride_id vectors, and divide by the number of rows of the object using nrow(). Multiply the quotient by 100 to turn it into a percentage.
We will use ggplot() to visualize the number of rides in a bar graph using geom_bar(). With aes() I can modify the aesthetics in the graph such as the variables and use fill to differentiate the two rider types. I added the scale_y_continuous() to scale the y-axis in a readable format using commas since the automatic output was in scientific notation. I used labs() to create accurate labels for the graph's attributes and title. I will include show.legend=FALSE to hide the legend since we have the rider types labeled on the x-axis already.
all_trips2 %>%
group_by(member_casual) %>%
summarise(count=length(ride_id)
,pct_total=(length(ride_id)/nrow(all_trips2))*100)
# plot number of annual members and casual riders
ggplot(all_trips2) +
geom_bar(mapping=aes(member_casual, fill=member_casual), show.legend=FALSE) +
labs(title = "Casual Riders v. Annual Members", x="Rider Type"
, y="Count", fill="Rider Type") +
scale_y_continuous(labels = scales::comma)
Here we discover that 59% of users over the last 12 months are members and 41% are casual riders. Not a bad ratio!
Casual Riders = 1,877,864 (41% total rides)
Annual Members = 2,681,991 (59% total rides)
TOTAL RIDES = 4,559,855
RIDES BY MONTH
Let's look at the num_of_rides per month using month_year in our arrange() function. I will use position_dodge so that the bars do not overlap and set the width of each bar to 0.6. Since the x-variable names are so long, they overlap on the graph. I will angle the value names using theme().
all_trips2 %>%
group_by(member_casual, month_year) %>%
summarise(num_rides = n(),
.groups = "drop") %>%
arrange(month_year, member_casual) %>%
ggplot(aes(x = month_year, y = num_rides
,fill = member_casual)) +
geom_col(width = 0.6,
position=position_dodge(width=0.6)) +
scale_y_continuous(labels = scales::comma) +
theme(axis.text.x=element_text(angle=45)) +
labs(title = "Rides Per Month",
x="Month (Sep 2021 - Aug 2022)"
,y="Number of Rides", fill="Rider Type")
The data shows that both casual and annual members ride more in the months of June through September. This could be due to conditions such as nice weather in the summer and harsh weather conditions in the winter. Overall, the annual members took more rides each month, which makes sense!
RIDES BY DAY OF WEEK
Let's compare rider type to the day of the week by arranging our visual by week_day.
all_trips2 %>%
group_by(member_casual, week_day) %>%
summarise(num_rides = n(), .groups="drop") %>%
arrange(week_day) %>%
ggplot(aes(x = week_day, y=num_rides
,fill=member_casual)) +
geom_col(width=0.7, position = position_dodge(width=0.8)) +
theme(axis.text.x = element_text(angle = 35)) +
scale_y_continuous(labels = scales::comma) +
labs(title = "Rides v. Day of Week", x="Week Day"
,y="Number of Rides",fill="Rider Type")
RIDES BY DAY OF AVERAGE MONTH
Let's compare the num_rides by day in the average month. For this, I will extract the day of the month using mday() and combine geom_line() and geom_point to connect the values in a line graph. Since x and y have high ranges, I will use seq() to add breaks every 10,000 rides to make it easier to read. I will use label_number_si() to automatically apply the necessary SI prefix. Since there are a maximum of 31 days in a month, I will scale_x_continuous() from 1 to 31.
all_trips2 %>%
mutate(day=mday(started_at)) %>%
group_by(member_casual, day) %>%
summarise(num_rides = n(), .groups="drop") %>%
ggplot(aes(day, num_rides, color=member_casual)) +
geom_line() +
geom_point() +
scale_y_continuous(labels = scales::label_number_si()
,breaks = seq(0,100000,10000)) +
scale_x_continuous(breaks = seq(1,31,2)) +
labs(title="Number of Rides by Day of Month"
,x="Day of Month",y="Number of Rides"
,color="Rider Type")
RIDES EACH MONTH
Let's compare the num_rides each day over all 12 months using facet_wrap(~). This will create a graph for each separate month_year starting in September 2021 and ending August 2022. We will group_by() all member_casual, day and month_year To separate all attributes of our visual.
all_trips2 %>%
mutate(day=mday(started_at)) %>%
group_by(member_casual, day, month_year) %>%
summarise(num_rides = n(), .groups="drop") %>%
ggplot(aes(day, num_rides, color=member_casual)) +
geom_line() +
scale_y_continuous(labels = scales::label_number_si()
,breaks = seq(0,20000,5000)) +
scale_x_continuous(breaks = seq(1, 31, 6)) +
labs(title="Number of Rides Per Day (Sep 2021 - Aug 2022)"
,x="Day of Month",y="Number of Rides"
,color="Rider Type") +
facet_wrap(~month_year)
2. ANALYZING AVERAGE RIDE LENGTH
Let's start by analyzing the average ride_length for each rider type using the mean() function. I will find the standard deviation using sd() to analyze the consistency in rider patterns.
We will extract data per day using mday() and see how the average ride_length varies from the beginning to end of the average month.
all_trips2 %>%
mutate(day = mday(started_at)) %>%
group_by(member_casual, day) %>%
summarise(mean = mean(ride_length),
.groups = "drop") %>%
ggplot(aes(day, mean, color = member_casual)) +
geom_line() +
geom_point() +
scale_y_continuous(breaks = seq(0, 28, 2)) +
scale_x_continuous(breaks = seq(1, 31, 2)) +
labs(title = "Avg Ride Length by Day", x = "Day of Month"
,y = "Average (min)", color="Rider Type")
all_trips2 %>%
group_by(member_casual) %>%
summarise(mean = mean(ride_length), sd = sd(ride_length),
.groups="drop")
Annual Members
Avg Ride Length: 12.5 minutes
Casual Riders
Avg Ride Length: 25.8 minutes
AVERAGE LENGTH BY HOUR OF DAY
Let's break it down and see the average ride_length by the hour of a day. 0 will represent the beginning of the day, and it will go up to hour 24, the end of the day. I will use mean() for each ride_length, and hour() to extract the hour in started_at.
all_trips2 %>%
mutate(hour = hour(started_at)) %>%
group_by(member_casual, hour) %>%
summarise(mean = mean(ride_length),
.groups = "drop") %>%
ggplot(aes(hour, mean, color = member_casual)) +
geom_line() +
geom_point() +
scale_y_continuous(breaks = seq(0, 30, 2)) +
scale_x_continuous(breaks = seq(0, 23, 2)) +
labs(title = "Avg Ride Length by Hour"
,x = "Hour of Day", y = "Average Length (min)"
,color="Rider Type")
3. ANALYZING BIKE TYPES
We will now analyze how each rider type compares to the type of bike used over the past year. I will use group_by to filter the summary for rideable_type, along with summarize() and n() to count the number of individual rides per rider category. I will then use ggplot() and geom_col() to create a visual representation of the rideable_type usage. Since the range of rides goes from 0 ~ 3,000,000, I will use breaks of 500,000 on the y-axis for a clearer visual.
all_trips2 %>%
group_by(rideable_type) %>%
summarise(num_rides = n())
all_trips2 %>%
group_by(rideable_type) %>%
summarise(num_rides = n()) %>%
ggplot(aes(rideable_type, num_rides
,fill = rideable_type)) +
geom_col(position = "dodge", show.legend = FALSE) +
scale_y_continuous(labels = scales::label_number_si(),
breaks = seq(0, 3000000, 500000)) +
labs(title = "Number of Bike Types Used"
,x = "Bike Type", y = "Number of Rides")
Classic Bike = 2,890,611 (63%)
Docked Bikes = 205,766 (5%)
Electric Bikes = 1,463,428 (32%)
ANALYZING BIKE TYPES BY CUSTOMER
Let's see if there is a difference in rideable_type favor between casual riders and annual members. I will do this by using group() for both member_casual and rideable_type. We will dodge the position so no bars overlap. Since the num_rides has such a high value we will use label_numer_si() again to give the appropriate SI prefix and use breaks to divide the tick marks into ever 500,000 rides.
all_trips2 %>%
group_by(member_casual, rideable_type) %>%
summarise(num_rides = n(), .groups="drop") %>%
ggplot(aes(member_casual, num_rides, fill=rideable_type)) +
geom_col(position = "dodge") +
scale_y_continuous(labels = scales::label_number_si()
,breaks = seq(0, 2000000, 500000)) +
labs(title = "Bike Type Usage by Customer"
,x = "Member Type", y = "Numer of Rides"
,fill = "Bike Type")
This graph tells us that the classic_bike was used most often among both member and casual riders. There were no docked_bike used among the annual members this past year.
4. ANALYZING MOST USED STATIONS
We will now decipher the most used bike stations to target our audience geographically. First, we will find out how many Cyclistic stations we are dealing with. I will create a the table stations and transform our data using gather() to collect a key set of columns, start_station_name and end_station_name to create column station_name. I will use distinct() to sum up similar station attributes together, and count the total_station with n().
To format the table, I will use arrange() to desc() the order of total num_rides.
all_trips2 %>%
gather(key, station_name, start_station_name, end_station_name) %>%
distinct(station_name) %>%
summarise(total_stations = n())
all_trips2 %>%
gather(key, station_name, start_station_name, end_station_name) %>%
group_by(station_name) %>%
summarise(num_rides = n()) %>%
arrange(desc(num_rides)) %>%
slice(1:10)
ANALYZING MOST USED STATIONS BY CUSTOMER
Let's dig deeper and see which stations were visited most by rider type. To do this, I will use gather() to create a column, station_name that combines start_station_name and end_station_name. I will then create a vector, station, using unite() to combine member_casual and station_name. I will use arrange() to set the list in descending order by num_rides. I will only get the top 10 stations by using slice(1:10). I will use scale_y_discrete() this time since our y values, and the station_name are discrete variables, not continuous. I don't want the casual and member titles tagged with the station_name, so I will use sub() to remove them. I will use str_trunc() to truncate the strings and position them, and then limits=reverse argument to reverse the order on the categorical axis.
all_trips2 %>%
gather(key, station_name, start_station_name, end_station_name) %>%
group_by(member_casual, station_name) %>%
summarise(num_rides = n(), .groups="drop_last") %>%
arrange(desc(num_rides)) %>%
slice(1:10)
KEY FINDINGS
1. How do annual members and casual riders use Cyclistic bikes differently?
Although annual members account for more rides over the past year, the casual riders average over double their average ride length!
Annual members have a higher ride count during Monday-Friday, which could be due to using the bike-share service to get to and from work.
Casual riders use the ride share service more on Saturday and Sunday.
Higher ride volumes are in June, July, August, and September.
The classic bike type is the most used among all customers. The electric bike was also popular among casual riders.
2. Why would casual riders buy Cyclistic annual memberships?
Casual riders would benefit from becoming annual members to save money on their long bike rides and get additional benefits for riding so often on the weekend.
They could also start using the bike share service to go to and from work.
3. How can Cyclistic use digital media to influence casual riders to become members?
Cyclistic can use digital media to geographically target the most popular stations casual riders utilize and run a campaign promoting benefits for ride lengths and weekends.
The social media ad campaign should be ran between July and September and target a radius around the top visited stations by casual riders.
ACT
In order to convert casual riders to annual members, I have come up with the following 3 solutions:
Create a social media campaign targeting the locations of the most visited stations during any of the peak months: June, July, August, and September.
Launch a mobile app that allows users to create a profile with identifiable information. They can reserve bikes, estimate trip costs, and pay for their rides. Offer a "New Member" special for downloading the app.
Create and promote an Annual Membership Perks Program targeted at casual rider patterns, such as cheaper ride prices on the weekends. The app can include many perks too for members only.
BUSINESS PROPOSAL PRESENTATION