Cyclistic Bike-Share project

Introduction: This project is part of the Google Data Analytics Professional Certification Program. I will analyze the riding patterns between causal riders and annual members using the 6 phases of the data analysis process: Ask, Prepare, Process, Analyze, Share, and act, to give insight on 3 strategies to convert casual riders to members. Overall I found that, on average, casual riders have double the ride duration, and ride more often on the weekends in the months June-September. The classic bike type is most popular with the busiest station at Streeter Dr. & Grand Ave.

scenario: You are a junior data analyst working in the marketing analyst team at cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently.

The complete analysis can be found in my Kaggle Notebook HERE.

ASK

BUSINESS TASK: Design marketing strategies aimed at converting casual riders into annual members by finding out how annual members and casual riders use Cyclistic bikes differently.

KEY STAKEHOLDERS:
- Lily Moreno: Manager, Director of Marketing
- The Cyclistic Executive Team

Guiding Questions:
1. How do annual members and casual riders use Cyclistic bikes differently?
2. Why would casual riders buy Cyclistic annual memberships?
3. How can Cyclistic use digital media to influence casual riders to become members?

PREPARE

Data Accessibility, Privacy, & Security: This is an open dataset that can be found here. It is first-party, public data that excludes the rider's identifiable information such as name, address, age, gender, and financial information.

Data Credibility:
I will be using 12 datasets from the Cyclistic's historial trip database containing single month information from September 2021 - August 2022. The categorical data will help me analyze how the casual riders and annual members use the service differently. I will find patterns to target the geolocations of digital media campaigns and convert casual riders to annual members.

Data Reliability (ROCCC):

Reliable - this is first-party data straight from Cyclistic's historical data repository.

Original - this dataset was gathered directly from the company, Cyclistic.

Comprehensive - includes all ride length information and type use along with geographical information. Does not include personal identification information.

Current - dataset combines the past 12 months, with August 2022 as the most recent.

Cited - link available to dataset here.

Due to the large size of the data, I will be using R for my analysis.

CLEANING THE DATA

1. LOAD LIBRARIES
We will be loading the following libraries using the library() function.

library(tidyverse) #to connect data and create visualizations

library(lubridate) #for date and time functions

library(ggplot2) #for data visualizations

library(hms) #for date and time functions

library(readr) #to import .csv files

library(dplyr) #for data manipulation

library(janitor) #to clean dirty data

library(tidyr) #to wrangle and tidy data

2. IMPORT DATA
I will be using the read_csv() function from the readr package to assign each file as mmm_yyyy using the <- operator.

sep_2021 <- read_csv("../input/bikesharedataset//202109-divvy-tripdata/202109-divvy-tripdata.csv")

oct_2021 <- read_csv("../input/bikesharedataset//202110-divvy-tripdata/202110-divvy-tripdata.csv")

nov_2021 <- read_csv("../input/bikesharedataset//202111-divvy-tripdata/202111-divvy-tripdata.csv")

dec_2021 <- read_csv("../input/bikesharedataset//202112-divvy-tripdata/202112-divvy-tripdata.csv")

jan_2022 <- read_csv("../input/bikesharedataset//202201-divvy-tripdata/202201-divvy-tripdata.csv")

feb_2022 <- read_csv("../input/bikesharedataset//202202-divvy-tripdata/202202-divvy-tripdata.csv")

mar_2022 <- read_csv("../input/bikesharedataset//202203-divvy-tripdata/202203-divvy-tripdata.csv")

apr_2022 <- read_csv("../input/bikesharedataset//202204-divvy-tripdata/202204-divvy-tripdata.csv")

may_2022 <- read_csv("../input/bikesharedataset//202205-divvy-tripdata/202205-divvy-tripdata.csv")

jun_2022 <- read_csv("../input/bikesharedataset//202206-divvy-tripdata/202206-divvy-tripdata.csv")

jul_2022 <- read_csv("../input/bikesharedataset//202207-divvy-tripdata/202207-divvy-tripdata.csv")

aug_2022 <- read_csv("../input/bikesharedataset//202208-divvy-tripdata/202208-divvy-tripdata.csv")

3. INSPECT COLUMN NAMES
I will use colnames() to inspect consistency in our dataframes.

colnames(sep_2021)

colnames(oct_2021)

colnames(nov_2021)

colnames(dec_2021)

colnames(jan_2022)

colnames(feb_2022)

colnames(mar_2022)

colnames(apr_2022)

colnames(may_2022)

colnames(jun_2022)

colnames(jul_2022)

colnames(aug_2022)

4. INSPECT DATA STRUCTURE

I will use the str() function to check the consistency in structure and datatypes among the datasets.

colnames(sep_2021)

colnames(oct_2021)

colnames(nov_2021)

colnames(dec_2021)

colnames(jan_2022)

colnames(feb_2022)

colnames(mar_2022)

colnames(apr_2022)

colnames(may_2022)

colnames(jun_2022)

colnames(jul_2022)

colnames(aug_2022)

5. COMBINING DATASETS

I will combine the 12 datasets to create 1 new dataframe called all_trips using rbind().

all_trips <- rbind(sep_2021, oct_2021, nov_2021, dec_2021, jan_2022, feb_2022, mar_2022, apr_2022, may_2022, jun_2022, jul_2022, aug_2022)

str(all_trips)

6. REMOVING DUPLICATES

Since I must keep the raw data, I will create a new dataframe, all_trips2, for manipulation starting with !duplicated() to remove any duplicate ride_id observations.

all_trips2 <- all_trips[!duplicated(all_trips$ride_id), ]

7. ADDRESS MISSING VALUES

I noticed multiple NA values in the started_at and ended_at fields, and there could be more in any data field! Since we need a complete, clean dataset, we will remove any missing data using na.omit().

all_trips2 <- na.omit(all_trips2)

PROCESS

8. CREATE COLUMNS

I will create three attributes using the mutate() and the $ operator to further analyze my data:

ride_length: I will use the as.numeric function to output difftime() between ended_at and stared_at as a numerical value instead of a character. The difference in time will be in seconds so I will divide it by 60 to transform the ride_length to minutes.
week_day: I will use the weekdays() function to extract the day of week the ride started_at.
month_year: I will use the months() function to extract the month in the year the ride started_at.

all_trips2 <- all_trips2$ride_length <- as.double(difftime(all_trips2$ended_at

,all_trips2$started_at))

all_trips2$ride_length <- all_trips2$ride_length/60

all_trips2 <- all_trips2 %>%

mutate(week_day = weekdays(started_at)

,month_year = months(started_at))

9. ORDER THE VARIABLES

Right now, the week_day and month_year variables are characters. We need to give them an order so they plot in chronological order instead of alphabetical order. I will use factor() and create a vector for the levels of days and months.

# order month_year variables

all_trips2 <- all_trips2 %>%

mutate(month_year = factor(month_year, levels = c("September"

,"October"

,"November"

,"December"

,"January"

,"February"

,"March"

,"April"

,"May"

,"June"

,"July"

,"August")))

# order week_day variables

all_trips2 <- all_trips2 %>%

mutate(week_day = factor(week_day, levels = c("Monday"

,"Tuesday"

,"Wednesday"

,"Thursday"

,"Friday"

,"Saturday"

,"Sunday")))

10. REMOVE NEGATIVE RIDES LENGTHS

I will use the count() function to find the number of rides with a negative value and filter() to remove those observations from our dataset.

count(all_trips2, ride_length < 0)

ride_length < 0 n

<lgl> <int>

FALSE 4560047

TRUE 99

We find that there are 99 trips with a ride duration of less than 0 minutes. Let's remove those trips:

all_trips2 <- all_trips2 %>%

filter(ride_length>0)

ANALYZE & SHARE

1. ANALYZING TOTAL RIDES BY CUSTOMER

First, let's count the total number of trips for annual members and casual riders, and see how they compare.

I will use group_by() to separate casual_riders and annual_members in the summaries. To find the percentage of each group, pct_total, I will find the length() of the ride_id vectors, and divide by the number of rows of the object using nrow(). Multiply the quotient by 100 to turn it into a percentage.

We will use ggplot() to visualize the number of rides in a bar graph using geom_bar(). With aes() I can modify the aesthetics in the graph such as the variables and use fill to differentiate the two rider types. I added the scale_y_continuous() to scale the y-axis in a readable format using commas since the automatic output was in scientific notation. I used labs() to create accurate labels for the graph's attributes and title. I will include show.legend=FALSE to hide the legend since we have the rider types labeled on the x-axis already.

all_trips2 %>%

group_by(member_casual) %>%

summarise(count=length(ride_id)

,pct_total=(length(ride_id)/nrow(all_trips2))*100)

# plot number of annual members and casual riders

ggplot(all_trips2) +

geom_bar(mapping=aes(member_casual, fill=member_casual), show.legend=FALSE) +

labs(title = "Casual Riders v. Annual Members", x="Rider Type"

, y="Count", fill="Rider Type") +

scale_y_continuous(labels = scales::comma)

Here we discover that 59% of users over the last 12 months are members and 41% are casual riders. Not a bad ratio!

TOTAL RIDES (SEPT 2021 - AUG 2022)

Casual Riders = 1,877,864 (41% total rides)

Annual Members = 2,681,991 (59% total rides)

TOTAL RIDES = 4,559,855

RIDES BY MONTH

Let's look at the num_of_rides per month using month_year in our arrange() function. I will use position_dodge so that the bars do not overlap and set the width of each bar to 0.6. Since the x-variable names are so long, they overlap on the graph. I will angle the value names using theme().

all_trips2 %>%

group_by(member_casual, month_year) %>%

summarise(num_rides = n(),

.groups = "drop") %>%

arrange(month_year, member_casual) %>%

ggplot(aes(x = month_year, y = num_rides

,fill = member_casual)) +

geom_col(width = 0.6,

position=position_dodge(width=0.6)) +

scale_y_continuous(labels = scales::comma) +

theme(axis.text.x=element_text(angle=45)) +

labs(title = "Rides Per Month",

x="Month (Sep 2021 - Aug 2022)"

,y="Number of Rides", fill="Rider Type")

The data shows that both casual and annual members ride more in the months of June through September. This could be due to conditions such as nice weather in the summer and harsh weather conditions in the winter. Overall, the annual members took more rides each month, which makes sense!

RIDES BY DAY OF WEEK

Let's compare rider type to the day of the week by arranging our visual by week_day.

all_trips2 %>%

group_by(member_casual, week_day) %>%

summarise(num_rides = n(), .groups="drop") %>%

arrange(week_day) %>%

ggplot(aes(x = week_day, y=num_rides

,fill=member_casual)) +

geom_col(width=0.7, position = position_dodge(width=0.8)) +

theme(axis.text.x = element_text(angle = 35)) +

scale_y_continuous(labels = scales::comma) +

labs(title = "Rides v. Day of Week", x="Week Day"

,y="Number of Rides",fill="Rider Type")

This visual tells us that annual members use the ride share service more often during the week, Monday through Friday, while casual riders use the service more than members on Saturdays and Sundays. This could be due to annual members using the ride share service for a normal Monday-Friday work week.

RIDES BY DAY OF AVERAGE MONTH

Let's compare the num_rides by day in the average month. For this, I will extract the day of the month using mday() and combine geom_line() and geom_point to connect the values in a line graph. Since x and y have high ranges, I will use seq() to add breaks every 10,000 rides to make it easier to read. I will use label_number_si() to automatically apply the necessary SI prefix. Since there are a maximum of 31 days in a month, I will scale_x_continuous() from 1 to 31.

all_trips2 %>%

mutate(day=mday(started_at)) %>%

group_by(member_casual, day) %>%

summarise(num_rides = n(), .groups="drop") %>%

ggplot(aes(day, num_rides, color=member_casual)) +

geom_line() +

geom_point() +

scale_y_continuous(labels = scales::label_number_si()

,breaks = seq(0,100000,10000)) +

scale_x_continuous(breaks = seq(1,31,2)) +

labs(title="Number of Rides by Day of Month"

,x="Day of Month",y="Number of Rides"

,color="Rider Type")

The data reinforces that annual members take more rides per day than casual riders. Both casual riders and annual members have a significant decrease and have the lowest num_rides() per day on the 31st of each month. This data could be skewed due to some months not having 31 days, automatically inputting "0 rides" for that day.

RIDES EACH MONTH

Let's compare the num_rides each day over all 12 months using facet_wrap(~). This will create a graph for each separate month_year starting in September 2021 and ending August 2022. We will group_by() all member_casual, day and month_year To separate all attributes of our visual.

all_trips2 %>%

mutate(day=mday(started_at)) %>%

group_by(member_casual, day, month_year) %>%

summarise(num_rides = n(), .groups="drop") %>%

ggplot(aes(day, num_rides, color=member_casual)) +

geom_line() +

scale_y_continuous(labels = scales::label_number_si()

,breaks = seq(0,20000,5000)) +

scale_x_continuous(breaks = seq(1, 31, 6)) +

labs(title="Number of Rides Per Day (Sep 2021 - Aug 2022)"

,x="Day of Month",y="Number of Rides"

,color="Rider Type") +

facet_wrap(~month_year)

This data shows that although annual members, on average, take more rides daily than casual riders, there are times where casual riders will utilize the ride share service more often than members! It is also reinforced that peak months are June, July, August, and September.

2. ANALYZING AVERAGE RIDE LENGTH

Let's start by analyzing the average ride_length for each rider type using the mean() function. I will find the standard deviation using sd() to analyze the consistency in rider patterns.

We will extract data per day using mday() and see how the average ride_length varies from the beginning to end of the average month.

all_trips2 %>%

mutate(day = mday(started_at)) %>%

group_by(member_casual, day) %>%

summarise(mean = mean(ride_length),

.groups = "drop") %>%

ggplot(aes(day, mean, color = member_casual)) +

geom_line() +

geom_point() +

scale_y_continuous(breaks = seq(0, 28, 2)) +

scale_x_continuous(breaks = seq(1, 31, 2)) +

labs(title = "Avg Ride Length by Day", x = "Day of Month"

,y = "Average (min)", color="Rider Type")

all_trips2 %>%

group_by(member_casual) %>%

summarise(mean = mean(ride_length), sd = sd(ride_length),

.groups="drop")

Annual Members
Avg Ride Length: 12.5 minutes

Casual Riders
Avg Ride Length: 25.8 minutes

AVERAGE LENGTH PER DAY OF MONTH

This data shows the average ride_length() for casual riders is over double that of the annual riders. The high sd() for casual riders, about 115 minutes, shows no consistency in ride_length(). However, the sd for annual members averages about 18 minutes, showing more consistency in their ride patterns.

AVERAGE LENGTH BY HOUR OF DAY

Let's break it down and see the average ride_length by the hour of a day. 0 will represent the beginning of the day, and it will go up to hour 24, the end of the day. I will use mean() for each ride_length, and hour() to extract the hour in started_at.

all_trips2 %>%

mutate(hour = hour(started_at)) %>%

group_by(member_casual, hour) %>%

summarise(mean = mean(ride_length),

.groups = "drop") %>%

ggplot(aes(hour, mean, color = member_casual)) +

geom_line() +

geom_point() +

scale_y_continuous(breaks = seq(0, 30, 2)) +

scale_x_continuous(breaks = seq(0, 23, 2)) +

labs(title = "Avg Ride Length by Hour"

,x = "Hour of Day", y = "Average Length (min)"

,color="Rider Type")

This data shows that on average, casual riders hold a higher ride_length() over the annual members all day, peaking from about 10:00AM to 2:00PM. Both parties have a lower ride_length() between 5:00AM to 7:00AM.

3. ANALYZING BIKE TYPES

We will now analyze how each rider type compares to the type of bike used over the past year. I will use group_by to filter the summary for rideable_type, along with summarize() and n() to count the number of individual rides per rider category. I will then use ggplot() and geom_col() to create a visual representation of the rideable_type usage. Since the range of rides goes from 0 ~ 3,000,000, I will use breaks of 500,000 on the y-axis for a clearer visual.

all_trips2 %>%

group_by(rideable_type) %>%

summarise(num_rides = n())

all_trips2 %>%

group_by(rideable_type) %>%

summarise(num_rides = n()) %>%

ggplot(aes(rideable_type, num_rides

,fill = rideable_type)) +

geom_col(position = "dodge", show.legend = FALSE) +

scale_y_continuous(labels = scales::label_number_si(),

breaks = seq(0, 3000000, 500000)) +

labs(title = "Number of Bike Types Used"

,x = "Bike Type", y = "Number of Rides")

TOTAL BIKE TYPE USAGE

Classic Bike = 2,890,611 (63%)

Docked Bikes = 205,766 (5%)

Electric Bikes = 1,463,428 (32%)

ANALYZING BIKE TYPES BY CUSTOMER

Let's see if there is a difference in rideable_type favor between casual riders and annual members. I will do this by using group() for both member_casual and rideable_type. We will dodge the position so no bars overlap. Since the num_rides has such a high value we will use label_numer_si() again to give the appropriate SI prefix and use breaks to divide the tick marks into ever 500,000 rides.

all_trips2 %>%

group_by(member_casual, rideable_type) %>%

summarise(num_rides = n(), .groups="drop") %>%

ggplot(aes(member_casual, num_rides, fill=rideable_type)) +

geom_col(position = "dodge") +

scale_y_continuous(labels = scales::label_number_si()

,breaks = seq(0, 2000000, 500000)) +

labs(title = "Bike Type Usage by Customer"

,x = "Member Type", y = "Numer of Rides"

,fill = "Bike Type")

BIKE TYPE USAGE BY CUSTOMER

This graph tells us that the classic_bike was used most often among both member and casual riders. There were no docked_bike used among the annual members this past year.

4. ANALYZING MOST USED STATIONS

We will now decipher the most used bike stations to target our audience geographically. First, we will find out how many Cyclistic stations we are dealing with. I will create a the table stations and transform our data using gather() to collect a key set of columns, start_station_name and end_station_name to create column station_name. I will use distinct() to sum up similar station attributes together, and count the total_station with n().

To format the table, I will use arrange() to desc() the order of total num_rides.

all_trips2 %>%

gather(key, station_name, start_station_name, end_station_name) %>%

distinct(station_name) %>%

summarise(total_stations = n())

all_trips2 %>%

gather(key, station_name, start_station_name, end_station_name) %>%

group_by(station_name) %>%

summarise(num_rides = n()) %>%

arrange(desc(num_rides)) %>%

slice(1:10)

TOP 10 STATIONS OVERALL

Streeter Dr & Grand Ave - 150,835 starting rides
DuSable Lake Shore Dr & North Blvd - 84.907 starting rides
DuSable Lake Shore Dr & Monroe St - 80,230 starting rides
Michigan Ave & Oak St - 79,394 starting rides
Wells St & Concord Ln - 76,640 starting rides
Millennium Park - 71,900 starting rides
Clark St & Elm St - 70,552 starting rides
Theater on the Lake - 66,668 starting rides
Kingsbury St & Kinzie St - 66,182 starting rides
Wells St & Elm St - 64,124 starting rides

ANALYZING MOST USED STATIONS BY CUSTOMER

Let's dig deeper and see which stations were visited most by rider type. To do this, I will use gather() to create a column, station_name that combines start_station_name and end_station_name. I will then create a vector, station, using unite() to combine member_casual and station_name. I will use arrange() to set the list in descending order by num_rides. I will only get the top 10 stations by using slice(1:10). I will use scale_y_discrete() this time since our y values, and the station_name are discrete variables, not continuous. I don't want the casual and member titles tagged with the station_name, so I will use sub() to remove them. I will use str_trunc() to truncate the strings and position them, and then limits=reverse argument to reverse the order on the categorical axis.

all_trips2 %>%

gather(key, station_name, start_station_name, end_station_name) %>%

group_by(member_casual, station_name) %>%

summarise(num_rides = n(), .groups="drop_last") %>%

arrange(desc(num_rides)) %>%

slice(1:10)

KEY FINDINGS

Guided Questions

1. How do annual members and casual riders use Cyclistic bikes differently?

Although annual members account for more rides over the past year, the casual riders average over double their average ride length!
Annual members have a higher ride count during Monday-Friday, which could be due to using the bike-share service to get to and from work.
Casual riders use the ride share service more on Saturday and Sunday.
Higher ride volumes are in June, July, August, and September.
The classic bike type is the most used among all customers. The electric bike was also popular among casual riders.

2. Why would casual riders buy Cyclistic annual memberships?

Casual riders would benefit from becoming annual members to save money on their long bike rides and get additional benefits for riding so often on the weekend.
They could also start using the bike share service to go to and from work.

3. How can Cyclistic use digital media to influence casual riders to become members?

Cyclistic can use digital media to geographically target the most popular stations casual riders utilize and run a campaign promoting benefits for ride lengths and weekends.
The social media ad campaign should be ran between July and September and target a radius around the top visited stations by casual riders.

ACT

In order to convert casual riders to annual members, I have come up with the following 3 solutions:

Create a social media campaign targeting the locations of the most visited stations during any of the peak months: June, July, August, and September.
Launch a mobile app that allows users to create a profile with identifiable information. They can reserve bikes, estimate trip costs, and pay for their rides. Offer a "New Member" special for downloading the app.
Create and promote an Annual Membership Perks Program targeted at casual rider patterns, such as cheaper ride prices on the weekends. The app can include many perks too for members only.

BUSINESS PROPOSAL PRESENTATION

Page updated

Google Sites

Report abuse