Data Analysis Case Study

A scenario-driven case study, demonstrating usage of Excel and SQL

Scenario

Cyclistic is a fictional bike-share company in Chicago with over 5,800 bicycles and 600 docking stations across the city. The company offers multiple pricing plans: Single-ride passes, full-day passes, and annual memberships. Customers who purchase either single-ride or full-day passes shall be referred to as casual riders. Customers who purchase annual memberships shall be referred to as annual members. According to the Cyclistic finance analysis team, annual members are far more profitable than casual riders. The goal of this analysis is to design marketing strategies aimed towards converting casual riders into annual members. To do so, data must be analyzed regarding how casual riders and annual members differ in terms of bicycle usage.

About the Dataset

This case study will utilize a six-month period of Cyclistic historical bike trip data (April 2021 through September 2021) to identify trends. This period was chosen because a majority of bicycle usage occurs from mid-spring through early fall. From April-September 2021, over 3 million individual rides occurred. The first link below is the source of the dataset, the second link is the license under which the data has been made available by Motivate International, Inc.

Index of bucket "divvy-tripdata"

Data License Agreement | Divvy Bikes

Each row in the dataset represents an individual bike trip. The columns of data contain various attributes of each trip, including the following:

Start and end dates/times in UTC
Latitude and longitude coordinates of the starting and ending points of each trip (four columns, two for starting lat/lon, two for ending lat/lon)
Whether the rider is a casual rider or annual member (stored in "member_casual" column with either "casual" or "member" text string)

Data Limitations and Data Utilized

Before the data could be analyzed, some data limitations had to be considered. Firstly, only the starting and ending coordinates of each trip were available. Although the distance between start and end coordinates can be measured, this would not take into account riders that made round trips (starting and ending at the same docking station), or riders that biked a certain distance and then backtracked a partial distance during their trip. Therefore, the starting/ending trip coordinates were not used for this case study, as the distance covered by each bike trip is unknown. Secondly, start and end times were stored in text strings in the following format: "MM/DD/YY HH:MM UTC". To perform calculations on trip length, the data had to be converted into a date/time format, rather than a string format. Also, the "UTC" (referring to Coordinated Universal Time) had to be removed from each start/end time, as that text string would prevent time calculations from being made. Finally, several data rows contained one or more null values (missing values), which, if left unattended, would make performing calculations extremely difficult to impossible. Therefore, data cleaning had to be performed on the start/end times and null values before analysis could begin. Documentation of these cleaning steps will be addressed in the next section. The data this study will focus on will be the following:

Trip length (in minutes)
Day of the week that each trip occurred (Sunday, Monday, etc.)
Whether a trip was made by a casual rider or annual member

Data Processing: Cleaning and Transformation Methodologies

The decision was made to remove any rows of data containing null values from the dataset. This resulted in the removal of a few thousand data entries. However, with the overall sample size being over 3 million entries, (as well as the inability to track down the missing values) the loss of data was relatively minimal. SQL was utilized to identify any null values and delete those rows. There were six tables of data, one for each month. For each table, the following query (Delete_Null_Values) was executed to remove the null values:

DELETE FROM `casestudy-347018.TripData.2021_04_Trip_Data` /* change the table name to whichever table you want to delete entries from */

WHERE

ride_id IS NULL

OR rideable_type IS NULL

OR started_at IS NULL

OR ended_at IS NULL

OR start_station_name IS NULL

OR start_station_id IS NULL

OR end_station_name IS NULL

OR end_station_id IS NULL

OR start_lat IS NULL

OR start_lng IS NULL

OR end_lat IS NULL

OR end_lng IS NULL

OR member_casual IS NULL;

The following SQL query (Select_Null_Values) was run to verify that the Delete_Null_Values query successfully removed all rows containing null values. If Select_Null_Values returns no data, the deletion procedure was successful:

SELECT *

FROM `casestudy-347018.TripData.2021_04_Trip_Data` /* change the table name to whichever table you want to query */

WHERE

ride_id IS NULL

OR rideable_type IS NULL

OR started_at IS NULL

OR ended_at IS NULL

OR start_station_name IS NULL

OR start_station_id IS NULL

OR end_station_name IS NULL

OR end_station_id IS NULL

OR start_lat IS NULL

OR start_lng IS NULL

OR end_lat IS NULL

OR end_lng IS NULL

OR member_casual IS NULL;

Despite the removal of rows with null values, the tables were still extremely large. The following SQL query selects a smaller number of columns, which allowed for easier export into Excel files for additional data cleaning and analysis:

SELECT ride_id, rideable_type, started_at, ended_at, member_casual

FROM `casestudy-347018.TripData.2021_04_Trip_Data` /* change year and month to whichever file you are selecting */

ORDER BY started_at;

Upon opening each data table in Excel, the date/time data in the started_at and ended_at columns needed to be cleaned, with the primary task being to remove the "UTC" from each of the strings. In addition, the ride length in minutes was calculated for each trip. Due to the size of the dataset, combining all six months into a single spreadsheet was not possible, as it would have exceeded the maximum number of rows allowed in an Excel spreadsheet. Therefore, each month of data was separated into six different spreadsheets. For each of those spreadsheets, the following steps were performed in Excel for data cleaning and transformation:

1. Ensure Data types for the started_at, ended_at columns are set to General

2. Remove "UTC" from end of the time strings so Excel can do calculations:

A. created new columns, started_at_date, started_at_time

B. used the "split cells" tool in Excel to split the date, time, and "UTC" into three separate columns (may need to temporarily delete the column header when doing this step):

select Data tab then Text to Columns
select Delimited > Next then select Delimeters
choose Column data format
make sure there is no data to the right of the column to start out, otherwise it will get overwritten

C. delete the column with the "UTC" text

D. change started_at_date data type to Date "*3/14/2012" and started_at_time to the data type "13:30:55" Time

3. Convert start dates and times to plain text for concatenation (this was done because initial concatenation attempt yielded incorrect numbers), then convert back to a date/time format after concatenation.

A. for start date: create "start_date_using_TEXT" column, used formula =TEXT(cell,"MM/DD/YY"), make data type "General"

B: for start time: create "start_time_using_TEXT" column, used formula =TEXT(cell,"HH:MM:SS"), make data type "General"

C: create "concat_start_date-time" column containing combined info from steps A and B using CONCAT function: =CONCAT(column1," ",column2)

D. convert data type in concat_start_date-time to a Date: under More Number Formats, select Date, then use the Type "3/14/12 13:30"

4. Repeat Steps 2 and 3 for end dates and times: ended_at_date, ended_at_time columns.

5. Calculate ride time length

A. Go back to original data, name column F "ride_length".

B. Input this formula in ride_length: =(concat_end_date-time column) - (concat_start_date-time column)

C. Format ride_length as data type Time: under More Number Formats, select Time, then use the Type "37:30:55"

6. Convert ride time length to minutes (to make it easier to perform calculations)

A. Create column called ride_length_mins

B: Use this formula to convert ride_length from the HH:MM:SS format to a number of minutes: =((HOUR(F2)*60)+MINUTE(F2)+(SECOND(F2)/60))

C: Set the data type for ride_length_mins column to Number

7. Apply the ROUND function to ride_length_mins column to round numbers to nearest integer, this is to ensure no values slip through the cracks when calculating trip length distribution later on.

A. Create new column for rounded_mins

B. Use the formula =ROUND(D2,0) and apply it to each row in rounded_mins

C. Copy rounded_mins, paste back into ride_length_mins (use Paste Special and select Values to get just the values)

8. Add a column called day_of_week to the right of ride_length_mins. In the day_of_week column, apply the WEEKDAY formula:

=WEEKDAY(started_at_column, 1). This gives an integer output for the day of the week each bike trip started on, with 1 = Sunday, 2 = Monday, ..., 7 = Saturday.

Analysis

Using Excel, the following parameters were calculated for both casual rider trips and annual member trips:

Average trip length in minutes
Number of trips distributed by trip length in the following categories: 0-14 mins, 15-29 mins, 30-59 mins, 60-119 mins, 120+ mins
Percentage of trips in each of the time categories above
Total trips occurring on each day of the week
Percentage of trips occurring on each day of the week

Results are shown in the following tables:

NOTE: Overall avg. is a "weighted" avg. for all six months using the SUMIF Excel function for casual and member columns. Example formula for casual: =SUMIF('2021_04_Data'!C2:C298208,"=casual",'2021_04_Data'!D2:D298208)+SUMIF('2021_05_Data'!C2:C450995,"=casual",'2021_05_Data'!D2:D450995)+SUMIF('2021_06_Data'!C2:C608778,"=casual",'2021_06_Data'!D2:D608778)+SUMIF('2021_07_Data'!C2:C692332,"=casual",'2021_07_Data'!D2:D692332)+SUMIF('2021_08_Data'!C2:C674408,"=casual",'2021_08_Data'!D2:D674404)+SUMIF('2021_09_Data'!C2:C621150,"=casual",'2021_09_Data'!D2:D621119))/(SUM(B3:B8))

Data Visualizations

Looking at trip totals by day of the week, there is little variation with annual member trips with a difference of approximately 45,000 trips (approximately 2.7%) between the highest and lowest daily totals. With casual trips, however, the daily totals are far greater on weekends compared to week days. The pie charts below illustrate this trend further. Looking at casual rider trips, 42.2% of all trips occurred on weekend days. However, looking at the distribution for annual member trips, there is very little day-to-day variation in the percentages.

A large percentage of casual rider trips were less than 30 minutes: Approximately 72.5% of trips, amounting to 1,193,170 trips. Note that approximately 10.4% of trips lasted 60 minutes or longer, amounting to 171,040 trips. These values were obtained using multiple COUNTIFS formulas in Excel. Example formula below for calculating number of casual rider trips lasting less than 15 minutes:

=COUNTIFS('2021_04_Data'!C2:C298208,"=casual",'2021_04_Data'!D2:D298208,"<15")+COUNTIFS('2021_05_Data'!C2:C450995,"=casual",'2021_05_Data'!D2:D450995,"<15")+COUNTIFS('2021_06_Data'!C2:C608778,"=casual",'2021_06_Data'!D2:D608778,"<15")+COUNTIFS('2021_07_Data'!C2:C692322,"=casual",'2021_07_Data'!D2:D692322,"<15")+COUNTIFS('2021_08_Data'!C2:C674404,"=casual",'2021_08_Data'!D2:D674404,"<15")+COUNTIFS('2021_09_Data'!C2:C621119,"=casual",'2021_09_Data'!D2:D621119,"<15")

Looking now at annual member trips, a vast majority of trips were less than 30 minutes: Approximately 91.6% of trips (1,558,476 trips). This trend is similar to that of the casual rider trips. However, there is one clear difference in the casual and member distributions: Only 0.8% of the annual member trips were 60 minutes or longer (0.6% were 60-119 minutes, 0.2% were 120+ minutes; these two distributions were combined into one pie slice due to their miniscule size). Recall that 10.4% of the casual rider trips lasted 60 minutes or longer. In terms of raw numbers, there were 12,816 annual member trips lasting 60+ minutes, compared to 171,040 casual rider trips lasting 60+ minutes.

Recommended Actions for Marketing Strategy

Based on the analysis of the bike-share data, the following recommendations for marketing strategy are being made:

Focus annual membership advertisements on weekends. Over 40% of all casual rider trips occur on Saturdays and Sundays. Ads ran on these days will receive the most viewership from casual members. Placing advertisement banners at docking stations would be an effective delivery method, as riders would see the ad when starting and ending their trips.
Increase the cost of single-ride passes and full-day passes on weekends. Doing so should convince more casual riders to sign up for annual memberships so that individual riders can save money in the long-term (while Cyclistic gains revenue from more annual memberships). Furthermore, demand for bicycles is clearly greater on weekends, so this cost increase would be justified.
Charge a small fee for single-rides lasting 60 minutes or more. This should convince more casual riders to either purchase a full-day pass or upgrade to annual membership. Although this action may not directly lead to the purchase of more annual memberships, it should still increase revenue.

Additional Recommendations for Potential Further Analysis

The following non-marketing-strategy recommendations are being made, which would potentially allow for the discovery of additional data trends to assist with making actionable data-driven decisions:

Install digital odometers on all bicycles that relay data to a database. One major limitation to this analysis was the fact that the only location data stored was starting location and ending location. As stated earlier, this prevented trip length in distance from being utilized in this analysis, as distance travelled would be recorded as zero (and therefore be unknown) if a rider makes a round trip (ie. starting and ending location are the same). A digital odometer could provide valuable insight into further understanding the difference between bicycle usage by casual riders versus annual members.
Analyze data by time of day. This study could be expanded further by using start and end times to determine if there are differences in trips by time of day (eg. morning, afternoon, evening). This could be accomplished through additional usage of SQL, and possibly the R programming language. This shall be performed at a future time in efforts to expand the case study.

Page updated

Google Sites

Report abuse