Google Data Analytics Capstone Project - Sofía Velasco Gómez
The following case study is the Capstone project of my Google Analytics Professional Certificate. In this case, I have studied and analysed the situation of a fictional company called Cyclistic, where I am a junior analyst in the Marketing Analyst Team.
I have been given a scenario, data, and a business task, so my work as a junior analyst is to analyse these data and answer the questions our key stakeholders have following the data analysis process: ask, prepare, process, analyse, share, and act.
I have used R programming language, Tableau Public for data visualization and Prezi for the presentation of the case.
Cyclistic’s finance analysts have concluded that Annual members are much more profitable than Casual riders. Although the pricing flexibility helps Cyclistic attract more customers, the director of marketing believes that the company’s future success depends on maximizing the number of annual memberships.
Therefore, my team wants to understand how Casual riders and Annual members use Cyclistic bikes differently. From these insights, my team will design a new marketing strategy to convert Casual riders into Annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.
What is the problem I am trying to solve? How can our insights drive business decisions?
The company wants to increase the number of Annual Members because they are more profitable
Our insights will help the Marketing Team to design Marketing Campaigns to convert Casual riders into Annual Members which will increase the profitability of the company in the long run.
Business Task
We need to get a better understanding on how Annual Members and Casual Riders differ and what would make Casual riders to become Annual members, so we, as Marketing Team, are able to develop and design marketing strategies to make this conversion possible.
Key stakeholders
Cyclistic Executive Team, Marketing Analytics Team and Lily Moreno (Director of Marketing Team and my manager)
Guiding questions
· Where is your data located?
· How is the data organized?
· Are there issues with bias or credibility in this data? Does your data ROCCC?
· How are you addressing licensing, privacy, security, and accessibility?
· How did you verify the data’s integrity?
· How does it help you answer your question?
· Are there any problems with the data?
Deliverable
· A description of all data sources used
The data has been made available by Motivate International Inc. under this license agreement.
I downloaded the data and stored it on my computer and a copy is stored in an external device.
As well, all the data is properly structured in tabs what makes it comprehensible. Although the data is collected since 2013, I also have data of the last three years and, as I will explain in depth later, I will use that data as it is the most current. Finally, it is cited the source of information so the data is credible.
The data provided as a whole is not biased and there are no issues of credibility, because Motivate International Inc. through the fictional company Cyclistic has provided all the data they could collect from 2013 to 2022, so as the source of information is themselves, the data is Reliable and Original.
The data is provided under a Data License Agreement of Divvy, so it allows us to use it under certain conditions. Privacy and data security are ensured since there is no personal information like names or credit/debit cards information in order to connect pass purchases to determine if Casual riders live in the Cyclistic service area or if they have purchased multiple single passes. In conclusion, everything aforementioned verifies data integrity.
Regarding the way the data is organised, it is very badly organised and I found the below problems:
- From 2013 to 2019, Trips data is recorded sometimes by quarter, sometimes by month and other times by year. But what makes it worse is the fact that the data collected is not always the same. For example, the name of the columns differ, the type of data (character, integer, etc) for the same columns are different from one document to the other. This makes very difficult to be able to join the information in just one data frame, because if we want to do this, as you will see that I have tried, it requires many transformations.
- In regards to Stations data, there are different documents for each year except for 2018 and 2019 (there are not any Station information for these two years). Moreover, it occurs again the same thing mentioned on the paragraph above, each document is indistinctly collected by year, by quarter or semester.
- All these documents present information about the stations (station id, name, latitude and longitude), but when you try to join all this information in one master document you realise that maybe for the same station id, its name has been changed through the years; or, when station id and name remains the same, the coordinates also change.
- From 2013 to 2019, Stations information like latitude and longitude of start and end stations are not included in the Trips data documents. This is a mistake as it may be very helpful for us when analysing the data. So, when I tried to create a master Station List with all the information so that I am able to add it to Trips data set, I found out that I could not really find any function on R that would help me to VLOOKUP. (This does not mean that this possibility do not exist, it is just I did not have the necessary knowledge to be able to find the way of doing it).
- From 2020 to 2022, Trips data is recorded monthly. But again, same issues are shown, like data type are different from one month to other. In regards to Stations data, it is included on the Trips data set, so this is a good thing because then there is no need to VLOOKUP anything.
Guiding questions
• What tools are you choosing and why?
• Have you ensured your data’s integrity?
• What steps have you taken to ensure that your data is clean?
• How can you verify that your data is clean and ready to analyse?
• Have you documented your cleaning process so you can review and share those results?
Deliverable
• Documentation of any cleaning or manipulation of data
First of all and before proceeding with the cleaning of the data, I have spent time getting to know the data base, its information and its limitations. At first, I wanted to use all the data that had been provided from 2013 to 2022, so I could study the evolution and any trend along the years, however I faced some challenges and some troubles during the process of merging all that data (this process is registered on the next page called "Continuation - Process of merging all the data available) and finally I decided to focus just on the data collected from January 2020 to August 2022.
Once reached this conclusion, I start with the preparation of the data:
1st. Downloading and loading packages we will need during our analysis. If needed other packages, we will download and load them later.
2nd. Set up the working directory where we have all the data sets saved, so we can have easy access to it.
3rd. Uploading and merging the data sets to be used and analysed.
- I upload data sets from September 21 to August 22 first because that would be the most actual information we have and, also, there are several incompatibilities between these data sets and those from Q120 (1st quarter of 2020) to August 21. So, once I upload them to R, I merge them into one data frame called “merged_df”
- Now, I upload the data sets from Q120 to Aug 21.
- Then I attempt to join merged_df with the new data sets I have uploaded and I use the Q120 data set to check if this can be done.
Here, I realise that the data type of columns start_station_id and end_station_id are numeric and I want them to be character because for merged_df these columns are character. So, I proceed to change it and check later if it has really changed.
- Now, I can see how the above functions work. I proceed to merge the data sets from April 20 to November 20, because they have the same issue as Q120, so it is better to join them first and change the type of the columns later.
- Once I have all the data frames under the same conditions, same number of columns and same columns types, I proceed to join the data sets we have from 2020 to 2022.
The data from 2020 to 2022 is structured and presented in a different way to data from 2013 to 2019 where there are different number of columns, the names of the columns change from one year to the other and even from a quarter to the following one and, in addition to that, in some cases the column type differ as well. Consequently, this is the reason why I treat this data differently to merged_20_21_22 and if you want to follow that process where I try to merge it all, you can check it on the following page called "Continuation - Process of merging all the data available".
The conclusion of that long process is that I do not have the knowledge to be able to join it all, so I finally determine that the best thing to do is to continue the analysis with merged_20_21_22.
4th. Cleaning and manipulating merged_20_21_22
- First step in the cleaning process is to upload and load the below packages whose main function is cleaning data sets
- After this, I will start by checking the basic things about a data base, like columns names, summaries of the data I have, etc.
So on the above tab, for variables start_station_name, start_station_id, end_station_name and end_station_id, the complete_rate is not 1 which means that there is no information about this variable for every observation. So, for each variable mentioned it is missing less than a 12% of the information.
On the other hand, n_unique calculates the number of unique elements but does not include NA values. In our case, there are 13.023.689 unique elements in ride_id , but I know that the total number of observations is 13.024.898. Therefore, if n_missing for this variable is 0 and on the basis of each ride_id is unique for every ride and this should not be repeated, my thoughts are that exist 209 duplicated observations. So this duplications must be checked and removed if they cannot be amended.
So, the above tabs do not clear up my suspicions about duplicated values. So, it is time to start finding out about those values.
- In order to understand a little bit better these values, I use the below functions and I create a new data frame which I thought it would be helpful, but as you can see below... it is not.
- According to the result of anyDuplicated() function, there are no duplicates, which is weird, but let’s use unique() to see if any observation is removed from the data frame. If it is, it will mean that there were duplicates.
Regarding the above result, the number of observations is the same for both data frames and no rows have been removed. This means that no duplicates have been deleted from it.
- Let’s use distinct() function:
Again, according to distinct() there are no duplicates either, so I think that maybe there are no duplicated observations where every variable is the same, but the duplicates are in the ride_id variable. Then, I decide to double check this theory.
After this, my suspicions are confirmed, so I decide to create a data frame with the duplicates in order to analyse them in depth.
- Once I export it to Excel, there is a total of 419 observations where ride_id is the same but the rest of variables are different.
Analysing this data, I realise that for every couple of the duplicated ride_ids, at least in one of them the started_at is a later time than the ended_at time which is a problem because if you want to calculate the time difference (ride_length from now on) it will be negative as shown on the below picture on cell F3.
On the basis that time cannot be negative, it can be concluded that all the duplicates where this happen must be removed from the data base, because this is an example of data with substantial errors which could lead our analysis to the wrong place if they are not deleted.
- Having observed this, I continue the analysis trying to find out for how many of the 13 million observations the ride length is negative (less than 0). In order to do that, first I have to calculate that time difference for every observation.
- After that, I will run some codes to learn how many of these observations are lower and/or equal to 0.
The results show that 12101 observations out of 13 million are negative or 0, where in 10858 observations out of 12101 the ride_length is negative and for the other 1234 observations the ride_length is 0.
My conclusion is that these observations should be removed from our data base because negative ride_length does not make sense, and 0 ride_length would mean that the customer has not really used our services. Also, as explained before, in these 10858 negative observations would be included the 211 duplicated observations discovered earlier.
- How can we remove the above observations? Easy, taking just those observations whose ride_length is higher than 0.
Once this is obtained, our data set has 13.012.797 observations which fits with the calculations made.
In addition, the non existence of duplicates in our new data set can be proven and checked, so I ensure the data base has been correctly cleaned.
According to these results, the observations whose ride_id were duplicated have been removed.
- Next, I will have a look to the characteristic of the cleaned data frame to see if there is any significant change against the original.
Now, n_unique for ride_id coincides with the total number of observations.
This cleaning has also changed the n_missing for start_station_name, start_station_id , end_station_name, end_station_id which seems logical due to the fact that when duplicates and observations whose ride_length is lower or equal to 0 have been removed, probably these observations had NA values on any of these variables. However, from a proportional point of view it is feasible that the values on complete_rate and n_unique have not changed. The same would happen on n_missing for end_lat and end_lng.
In regards to the statistical values calculated for this variables, they have not changed either and this is due to the fact that it has only been removed 62 observations where there were any NA value, so 62 observations compared to the total amount of 13 million is a tiny percentage that would barely affect the result of this statistical parameters.
- Referring to our new variable ride_length, as it did not exist on our original data base we cannot compare the results. But, we could analyse the values that would appear if I had not cleaned values equal or under 0.
As a result, the minimum value for ride_length before the cleaning is negative.
- Having a look to the mean and median, the median has been affected by the cleaning but not as much as the mean.
The difference between both means is higher than the difference between medians. Also, both mean and median are higher once the values equal and under 0 have been removed.
On the other side, the maximum is quite suspicious as it is 156450.4 minutes, which is 2607.52 hours, which at the same time is 108.64 days. So, being realistic, it is very unlikely that a customer had been 108 days riding a bike. In conclusion, not only there were odd minimum values, but also there are odd maximum values.
- In order to find out which values would be considered as normal, I will use a Box Plot to check those outliers.
On the original data base, the quartiles would be as follows:
Where Q1 is 6.75 mins, Q2 is 12.15 mins and Q3 is 22.27 min approximately. Q1 would represent the lower quartile and Q3 the upper quartile. Meanwhile, Q2 is the median.
The results after calculating the whiskers are as follows:
Where -15.78 minutes would be the lower whisker and 45.53 minutes the upper one, so every ride_length under -15.78 or over 45.53 would be considered outliers.
- The above results led me to the creation of a data frame removing these outliers but also those values under and equal to 0.
When I check this data frame with the ride_length sorted in ascendent order, I realise that for every ride_length = 0’016666667 mins their start_station_name and end_station_name are the same. So, due to this and to the short period of time of ride_length, I suspect that these trips have been registered wrongly because there was no time to go for a ride and leave the bike in the same place.
My first thought is to remove from the data base all those trips where start_name_station = end_station_name AND the ride_length is over to 0, but this would not make sense because I would be deleting observations where people have started the trip in a certain station and then after a longer or shorter ride they may leave it in the same place. So, in order to keep the relevant data, which must make sense at the same time, I need to refine the conditions.
- On the other hand, after googling some information I found a study made by the company Sleep Cycle (link to the study here), where they confirmed on the 4th of July of 2022 that the average sleep hours of United States citizens is 7h and 10 min.
So, if 7h and 10 min is 430 minutes and we deduct it to the 1440 minutes a day has (24 hours), we have 1010 minutes (16h and 50 min) left where a citizen hypothetically could be using our bikes, at most, per day.
Although this is unlikely, there are observations in our database where the ride_length is over 24 hours of use, so I think that these 1010 minutes is the maximum limit to consider an observation as valid. Our Box Plot considers normal values those under 45.53 minutes, but that does not mean that there are customers who use our services for a longer time, so I think that 1010 min is a better limit and, at the same time, I do not discriminate against the existence of trips that last more than 45.54 minutes (I rounded it up).
In regards to the minimum limit, I decided to use more than 1 minute as a valid ride as maybe someone uses it to go to a very close place.
- Now, once I agree on those limits, I will create a cleaner data frame where these conditions are stablished:
Doing this, we have removed the observations with duplicated ride_id mentioned earlier and those with illogical ride_length.
- Once this is done, there is another thing to look at, the NA values. In order to find out how many NA values there are, I will use the below code:
As explained previously, I opt to remove NA values because I could not find any function which would do the same than VLOOKUP in order to complete all those cells.
- Let´s remove the NA values with the below function and that would be our data base cleaned:
Guiding questions
· How should you organize your data to perform analysis on it?
· Has your data been properly formatted?
· What surprises did you discover in the data?
· What trends or relationships did you find in the data?
· How will these insights help answer your business questions?
Deliverable
· A summary of your analysis
In order to be able to analyse trends along the months and years, I decide to create three different columns to specify the year, month and day of the week.
I will also add, the column ride_length in HH:mm:ss
Now, to have the columns in the order I want, I will run the following code:
After this, and for the purpose of checking everything makes sense I will run some calculations like the mean ride_length, the maximum ride_length and the mode of week_day.
According to the above results, the mean is 20 min and 25 sec, which coincides with the mean calculated before. Also, the maximum ride_length is 16 hours, 49 min and 53 sec which makes sense as it is in the range we considered earlier that an observation would be valid. Finally, the mode of week_day is Saturday (in Spanish is sábado – sá//.).
Next, I will proceed to study the differences between Members and Casual riders.
GENERAL TREND - Casual riders VS Members
- The number of Members who have used our service from 2020 to 2022 is higher than the number of Casual riders.
- On the other hand, Casual riders use our service for a longer time than Members, where the average of time riding is more than the double of the average time that Members use it.
TREND FROM 2020 TO 2o22 - Casual riders VS Members
- Number of riders grow from 2020 to 2021 for Casuals and Members, but it can be perceived a drop in the number of riders for both types on 2022.
- The time spent by Casual riders dramatically drops, but in the case of Members it is kept stable.
It is needed to be taken into account the fact that there is only data until August 2022. So, in order to check if the above trend is true, I will proceed to compare the trend only from January to August of each year.
- There is an increase on the number of riders for Members, while for Casual riders, it seems that there is a growth from 2020 to 2021 but it decreases a bit on 2022.
- As equally seen on the previous trend, the time spent by Casual riders dramatically drops, but for Members it is kept stable.
GENERAL TREND – Days of the Week
- The number of rides increases gradually from Monday (in Spanish is lunes - lu\\.) to Saturday (“sá\\.”), when it reaches its maximum. Then it decreases again on Sunday (in Spanish is domingo - do\\.), although the number of rides on this day it is still higher than the rest of the week.
- The time spent riding on average during the weekends is superior to 20 minutes. On Mondays (“lu\\.”) and Fridays (in Spanish is viernes - vi\\.) the average number of minutes is very close to 20, while on Tuesdays (in Spanish is martes - ma\\.), Wednesdays (in Spanish is miércoles - mi\\.) and Thursdays (in Spanish is jueves - ju\\.) the time spent is a bit over 17 minutes.
Casual Riders VS Members – Day of the week
- Number of Casual riders is higher than Members on the weekends (sá\\. and do\\.). While the number of Members are higher during working days (from lu\\. to vi\\.).
- Conclusion: Casual riders use it for leisure activities spending more time riding (over 25 minutes), while Members use it with more frequency during working days to commute from home to work/school, and viceversa, spending less than 15 minutes on their ride, except for the weekends when they spend a little bit more than 15 minutes.
MONTHLY TREND FROM 2020 to 2022 – Days of the week
- There is a seasonal trend repeated during the 3 years where the number of rides of each day reach their peak during Summer Season (from May to Sept).
- Looking at working days (lu\\. to vi\\.), the number of rides increases from 2020 to 2022.
- The time spent by our customers on a ride increases generally from April to June (sometimes even July).
- Regarding the average ride_length, in 2020 the difference between season and season is more noticeable for every day of the week. This seasonal difference becomes less pronounced on 2021 and 2022.
- Conclusion: the best months for our company use to be on Spring and Summer when the weather conditions are very good, there are pleasant temperatures, days are longer, and rainy season is over.
YEARLY TREND - Casual riders VS Members - Days of the week
YEAR 2020
YEAR 2021
YEAR 2022
- The general trend during these three years is that Members use our service a higher number of times than Casual riders during the week, and during the weekends is when the Casual riders increase their use over Members, although in 2020 Casual riders were likened to Members during these two days.
- In regards to ride_length, the time spent by Casual riders is always, on average, over to Members, and this trend is almost the same during the three years analysed.
Guiding questions
· Were you able to answer the question of how annual members and casual riders use Cyclistic bikes differently?
· What story does your data tell?
· How do your findings relate to your original question?
· Who is your audience? What is the best way to communicate with them?
· Can data visualization help you share your findings?
· Is your presentation accessible to your audience?
Deliverable
· Supporting visualizations and key findings
This has been already done on the previous step, but you can find a proper presentation on this link.
Guiding questions:
· What is your final conclusion based on your analysis?
· How could your team and business apply your insights?
· What next steps would you or your stakeholders take based on your findings?
· Is there additional data you could use to expand on your findings?
Deliverable:
· Your top three recommendations based on your analysis
Recommendations:
1) Creation of a Monthly membership. This way those Casual riders, who may see Annual memberships as very long period relationships, can buy this new pass and we will start building loyalty.
2) Creation of promotions and offers, such as:
- For Casual riders -> Every 40 minutes they have ridden with us, we will offer them promotions/discounts for Monthly passes, so we encourage them to use its benefits.
- For Monthly members -> Every 120 minutes they have ridden with us, we will offer promotions for Annual memberships, so the “half” loyal monthly members may become Annual Members.
- For Annual members -> Every 30 minutes they have ridden with us on the weekends, we will give them points and, at the end of the year, they will be able to exchange those points for discounts on their Annual membership for the following year. This way we will give them incentives to stay with us.
3) Launch promotions during festivities and during other specific times of the year, like “Christmas promotion”, “Black Friday Promotion”, etc., offering discounts and encouraging Casual riders to buy Monthly or Annual memberships.
Thank you for your attention and interest on my approach to this case