In 2016, Cyclistic launched a successful bike-share offering. Since then, the programme has grown to a fleet of more than 5,600 bicycles and 600 docking stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime. Cyclistic set itself apart from their competitors by offering reclining bikes, hand tricycles and cargo bike, make bike-sharing more inclusive to people with disabilities and riders who cannot use a standard two-wheeled bike
Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, the Director of Marketing believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, the approach will be to convert casual riders into members.
As a junior data analyst working in the marketing analyst team at Cyclistic, what are the key questions that I want to answer.
How do members and casual riders differ in Cyclistic bikes usage based on the past 24 months worth of data.
Are there seasonal differences between members and casual riders?
Are there discernable patterns in usage of bikes on a day to day basis?
Where are the locations with the highest density of Casual users?
I will use the past 2 years of Cyclistic's trip data (from Oct 2019 to Sept 2021) to analyze and identify trends. The historicial trip data can be downloaded from https://divvy-tripdata.s3.amazonaws.com/index.html in csv format.
2019 columns via google sheets
Spotting anomalies in the data
There are noticeable anomaly in the trip duration column which should be a numeric format rather than text / string based data type
2020 and 2021 columns via google sheets
Columns in 2020 and 2021 data are named differently and have additional info as compared to 2019's
Creating the necessary tables for 2019, 2020 and 2021 data sets with the corresponding column names and data types in PostGreSQL, prior to importing the entire data
We shall repeat the entire process for all CSV files downloaded covering Q4 2019 to Q3 2021. This will be followed by the importing of the data via the PgAdmin interface of the CSV data into the respective tables
Checking the tables and data sets using query language in SQL
Combining all tables from 2020 - 2021 into a single view. This view will be the primary source of data we will be querying from
Let's try breaking down the ride_start date and time into year, month (spelled), day of week (spelled), ride duration. These will be additional columns to aid us in our analysis later
We will start by cleaning the 2019 data set
We will rename the cols name in 2019 data to match those of 2020 and 2021, as well as remove columns that won't be used. As there are several weird inputs in the tripduration column, we will remove it and create our own calculations
Data categories under the member_casual column of 2020 onwards are of 'member' or 'casual' labels. However 2019 and the years before are indicated as 'Subscriber' or 'Customer'. We will need to make our data consistent with their current nomenclature.
We will now create a view for 2019 data with the inclusion of the date, month, day of week and trip duration queries and rearrange the columns sequence.
Only Start date and time col are retained while end date and time col will not be part of the 2019 view although the end date time will be used as part of the time difference calculation
We will create a new view for 2020 & 2021 data with the inclusion of the date, month, day of week and trip duration queries.
At the same time we will remove bicycles that are docked @ the HQ for repairs hence any trips starting or ending @ HQ QR will be removed, and sequence of coloumns will match that of 2019's.
Combining the 2019, 2020 and 2021 data set into 1 query using a UNION ALL function. We have to ensure that the number of cols and data types must be similar in both tables hence geospatial location data will be removed from the 2020_2021 view
Of the more than 8.5 million rides occurring over the past 24 months, Casual riders make up approximately 40% of the total, which is quite an significant number. This group of users presents a large potential amount of riders to be converted into Members.
As 2019 and 2021 data are not collated over 12 months, it will be meaningless to compare total rides per year. However we can see that by Sept 2021 the total number of rides already surpasses the 12 months of 2020, which shows a very healthy uptake of Cyclistic Bike rides
However if we group the number of riders by year and membership category, we can see that comparing 12 months of 2020 data with 9 months of 2021 data, Members' usage of Cyclistic bikes is almost on par with 2020 (2134621 Vs 2143042). For Casuals we see a significant jump of more than 600K in usage based on the first 9 months data alone (1952065 Vs 1340768). Cyclistic can be confident that by end of 2021, usage for both member categories will far surpass that of 2020.
Ridership peaks during the tail end of Summer and the start of Autumn, which is most probably due to the cooler weather. Ridership drops during the Winter Season in Chicago possibly due to the colder weather and safety conditions like visibility and slippery road surfaces
May to September are months where Casual users form a large portion of ridership. It is advisable to conduct marketing related activities targeted at this category of users within this 5 month period.
Member rides outnumber Casuals on weekdays while being overtaken on weekends. Do Members used the bikes primarily for work, while Casuals use it mainly for leisure?
Based on the start ride timings, Members usage show a spike in bike rides at times of the day (7 - 8am & 4 - 6pm) that seem to add weight to the notion that the primary usage of bikes are for work. We see a huge increase in bike usage from 4 to 6pm, peaking at 5pm for both Members and Casual users. Users might tend to be more relax, taking their time to go home as compared to rushing to start work on time which explains the spike in evening usage.
Although Casual users tend to ride longer durations than Members, we see that both categories have night owls riders with the longest rides per category at the witching hour.
The maps and graphs serve as a guide for the 15 Start and End Bike Stations with the highest density of Casual riders. The marketing team can consider to deploy visible marketing strategies and collaterals at the locations to maximize visibility of the Cyclistic brand.