case study on a ride-share company called cyclistic
In this case study, we will be reviewing a dataset containing Cyclistic trip data for 12 months. Cyclistic is a fictional company name and has been used to protect the anonymity of the source. In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.
Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members. For this case study, the datasets are appropriate and will enable us to answer the business questions. The data has been made available by Motivate International Inc. under this license.
The company’s future success depends on maximizing the number of annual memberships. Therefore, the goal is to understand how casual riders and annual members use Cyclistic bikes differently.
To understand why would casual riders buy Cyclistic annual memberships
To understand how can Cyclistic use digital media to influence casual riders to become members
The dataset in the form of 12 different zip files was downloaded and extracted. Initially, the CSV files were opened just to review what kind of data was being dealt with. The data of all 12 months summed up to be 5595075 rows and 13 columns.
Tools like Microsoft Excel, MySQL, Python, and Tableau have been used to process and analyze the data in the project.
Bad data have been categorized as four simple parameters:
Empty cells - Having empty cells could lead to wrong results.
Data in the wrong format - Cells with different/wrong formats could completely change our course of analysis and make it very difficult.
Wrong data - When the data entered is simply wrong.
Duplicates - Having duplicate data will lead to wrong results.
Opening the file for the month of January, bad data in the form of empty cells, wrong data format, duplicate data have been identified. The date was in a mixed format having both [dd/mm/yy HH:MM: SS] and [mm/dd/yy HH:MM: SS]. This was observed in data of other months as well. To make sure of having the right format throughout the columns, the columns, "started_at" and "ended_at" were changed to the format of [YYYY/MM/DD HH:MM: SS] as this format is accepted in SQL. Using macros, this process along with removing duplicate rows in the file of January was recorded and applied for all 12 months which saved a lot of time.
Before
After
The files were then imported into the Mysql server with localhost. Using Tableplus as the IDE, all the data was imported and the cleaning process began again. rows with NULL values were checked and eliminated. It has been observed that NULL/ empty values were prominent in columns "start_station_name" and "end_station_name" which were taken as leverage and rows were eliminated where these conditions were true. The reason to filter by only these two columns is that by using only these 2 columns, SQL need not filter through each and every value in a column. The query is done much faster. The data will be checked again for NULL values in the next stage.
The trip duration for each ride has been calculated and created as a new column called "trip_duration" which shows the trip duration for each ride in minutes. The day of the week for each ride has also been queried and added to a new column called "day_of_week". Using CTE and Union, these queries were done at once for every 3 months and exported as CSV files as 4 quarters of the year. Using the CTE saved time as the code to extract trip duration and day of the week was written once every quarter instead of writing it for each and every month.
Unioning all files into one single file was attempted but could not be done as the server kept disconnecting and the OS kept crashing. This is due to insufficient hardware and the heavy load on the limited memory led to system crashes. Hence, the files were extracted as quarters.
The query done for this complete process can be accessed here
The four CSV files were then imported to Tableau for further analysis. The four CSV files were unionized in Tableau to aggregate the data for all 12 months and perform our analysis on it.
The first sheet created was to represent the monthly number of rides taken by members vs casual riders.
The above visualization represents how the number of rides taken by casual riders differs from members. As we can see, throughout the year, members have a higher number of rides. It is between the duration of June to September that the number of casual riders is increasing. This data is based on Chicago and they have summer from June to September. It is safe to assume that since it is summer, more tourists use bikes to move. Summer is also the time when educational institutions give summer breaks and students could use them for leisure activities.
The above visualization shows the traffic at different stations filtered to show only for casual riders. As the number of rides increases the size of the circle increase and the color brightens. We can observe the highest traffic is at the station located near the dock which is a popular place to spend time for fun.
The above visualization represents the traffic at each start station filtered to show the popular stations for members only. As we can see unlike traffic at the start station for casual members the traffic within the city is spread out to different places within the city.
This is the same in the case of end-station traffic as well. To view the end station traffic, click on the link below.
The above visualization represents the number of rides aggregated to a day of a week throughout the year. We can observe that the maximum number of rides are during the weekends( Saturday and Sunday). We can also observe that number of rides taken by members not going through such a significant change as the number of rides taken by casual riders. During the weekdays, members are contributing to more rides than casual riders. During the weekends, the number of casual riders is twice the number of rides taken on any weekday.
The above visualization represents the comparison of the weekly average trip duration in minutes between members and casual riders. It can be clearly observed that casual riders dominate members on every single day.
The above visualization represents the hourly usage of bikes for all days in a week. We can observe that there is a sudden rise in the number of rides at 5 AM for the curve representing the riders who are members. The curve suddenly rises again from 4 PM and peaks at 5 PM. In the case of casual riders, we see a steady rise in the curve. The curve indicating the number of rides by members is greater than the curve representing the number of rides taken by casual riders. The curve changes significantly when we check the comparison of rides for specific days. The number of casual riders increases as the weekend approaches and becomes greater than the number of riders with memberships. The comparison can be viewed as a Tableau story from the link below.
The above visualization represents the average trip duration by casual riders and riders with memberships. We can observe that at every point, the curve representing the casual riders is overlapping the curve representing the riders with memberships significantly. The comparison has been done by checking for every single day of the week and the results are the same as above. The average trip duration taken by casual riders is always more than that of riders with memberships
The above visualization is a dashboard to represent the usage of different bikes by casual and membership riders. We can observe that the classic bike is the most preferred one by both members and casual riders. The second preferred bike is the electric bike. The dashboard also shows the average trip duration on each bike by members and casual riders.
We can get some insights from the visualizations above. First, we must acknowledge the power of visualization as it made it very easy for us to understand what seemed to be a very huge and confusing data from which we cannot get anything. Now, let us point out the insights we get from these visualizations.
The number of rides by members is greater than casual riders.
The average trip duration by members is less than casual riders.
June to September is the peak time for rides.
The rides taken by casual riders surpass the rides taken by members during summer.
The traffic of ride stations increases as the station gets closer to the harbor. Traffic is highest near the harbor and other popular places compared to regular places.
When it comes to riders with membership, the traffic is spread throughout the city. This could indicate that majority of riders with members are employees commuting to work and home.
The number of rides is highest during the weekends and casual riders have a huge contribution to it. The number of rides taken by members does not change too much. It decreases during the weekends and stays almost steady during weekdays.
Although the number of rides taken by casual riders is lower than the rides taken by members, the average duration of casual rides is significantly high.
The peak hours are 8 AM and 5 PM during weekdays which is hugely contributed by members. During weekdays, the rides taken by members are greater than casual riders. The peak hours during weekends are from 1 PM to 3 PM which is hugely contributed by casual riders. Here the casual rides are greater than rides by members.
The classic bike is the most preferred bike among both members and casual riders.
Casual riders being the target customers here, the company could offer lower hourly charges for members. Since casual riders are contributing more for the average trip duration, they would want to get the membership for the lower charges. The company can also run campaigns on social media campaigns with special offers on weekends for riders with memberships. This could also help in customer acquisition.
Casual rides are high during weekends and the traffic for casual riders is high near popular places like Polk bros park, Dusable Harbor, Millennium park, lakeshore. These are places usually visited in groups where families or a group of friends visit to have a fun time. Group memberships could be provided for a small discount which will increase the chances of converting the casual riders to members as the idea of being a group is well received by people.
Social media is a very powerful tool to promote these offers and discounts and attracts more customers.