Machine Learning (ML) Case Studies acts as an enabler for people from both technical and nontechnical backgrounds to apply ML techniques to real-world problems. We have a task to decide the brand persona for a new sharing scheme of cycles. Basically, we have to represent a strong marketing plan for reaching out to potential customers for company XYZ. Company (XYZ) had the plan to put in investment to ensure infrastructure availability for the customer to start using it.
Company had the strategy to make sure the customer retention levels to be remained high for the feasibility of project, all this was in vein as results were contrary to what were supposed to be achieved.
There are three strategies to curb the problem at hand.
1. Getting sponsor on board (Not an option to solve the issue).
2. Increasing the service charges (This was not the option either because being public sponsored initiative).
3. Increasing the pool of the customer was the only option left.
The cycle sharing scheme provides means for the people of the city to commute using a convenient, cheap, and green transportation alternative. Our real focus on increasing the number of bike’s as well as parking stations in order to increase convenience and accessibility for his customers. Customer retention remained an issue. We have decided a marketing channel that guarantees broad reach on low cost incurred. We have to find a way to make things work around this problem. Understanding the persona of brand is essential, as it helps you to reach a targeted audience which is likely to convert at a higher probability. Moreover, this also helps in reaching out to sponsors who target a similar persona.
This two-fold approach can make our bottom line positive. Which attribute correlates the best with trip duration and number of trips? Which age generation adapts cycling services? We need to find out. The data for insight is available with company for year 2014 and 2015.
The company is gathering following information for each individual trip. 1 trip_id 2 starttime 3 stoptime 4 bikeid 5 tripduration 6 from_station_name 7 to_station_name 8 from_station_id 9 to_station_id 10 usertype 11 gender 12 birthyear There are more than 236000 trips in year 2014 and 2015 which are enough for us to make general assessment in terms for marketing strategy.
There are two kinds of users using the bicycle sharing services short-term pass holders and members. According to graph (Fig-1) we can say members tend to avail more trips than their counterparts.
Fig-1
Analysis (Fig-2) revealed that the gender gap resonates as well. Males seem to dominate the trips taken as part of the program. We wanted to know more about her target customers to whom to company’s marketing message will be targeted to.
Fig-2
Majority of the people (Fig-3) who had subscribed to this program belong to Generation Y (i.e., born in the early 1980s to mid to late 1990s, also known as millennials). millennials mean Anyone born between 1981 and 1996 (ages 23 to 38 in 2019) so most millennials would be members rather than short-term pass holders
Fig-3
In the (Fig-4), more than 70,000 members are millennials and we want to make sure that the brand is engaging millennials as part of the marketing plan.
Fig-4
Bar graph (Fig-5) signifying the distribution of birth years by gender type. We are splitting our genders into three categories, Male, Female, and Other. This mean that for each of the birth years we had the trip count for all three gender types. Majority of the trips were availed by males. However, subscribers born in 1947 were all females. Moreover, those born in 1964 and 1994 were dominated by females as well.
Fig-5
When we see the distribution of (Fig-6) only one user type and not two its mean that birth year information was only present for only one user type. The issue is all values of short-term pass holders in birth year are missing means null values.
Fig-6
There seems (Fig-7) to be a definitive pattern of trip duration over time. Cyclic Pattern in the notion that the patterns repeat over non-periodic time cycles can be used to look for more details and predictions.
Fig-7
We need to understand the trip durations in detail for that we calculated the mean and median of trip durations and we wanted to determine the station from which most trips originated in order to run promotional campaigns for existing customers.
tripduration.mean= 1196.1159457770057
tripduration.median= 639.4555
from_station_name.mode= Pier 69 / Alaskan Way & Clay St
It was revealed that most trips originated from Pier 69/Alaskan Way & Clay St station. Hence this was the ideal location for running promotional campaigns targeted to existing customers. Moreover, the output showed the mean is greater than the central value (i.e., median). This is due to the basis of some extreme values after the median or due to the majority of values lying after the median. In (Fig-8) has only one peak (mode). The distribution is not symmetric and has majority of values toward the right-hand side of the mode. These extreme values toward the right are negligible in quantity, but their extreme nature tends to pull the mean toward themselves. So, that reason the mean is greater than the median. This data is not normal distributed.
Fig-8
A huge number of outliers show (Fig-9) in trip duration from the box plot. Mean is highly affected by the presence of the outliers.
Fig-9
After handling the outliers, we are checking the mean, median, std of trip duration again. We want to understand how the center of measures appear in the transformed distribution.
tripduration.mean= 717.218601458982
tripduration.std= 435.46202597881165
tripduration.median= 639.4555
After the transformation in (Fig-10) mean is the same as that of the mode. We meant to say that median is approaching the mean, which means that the positive skewness.
Fig-10
We are checking trip duration (Fig-11) according to birthyear. We can say that mostly this age group 1950 to 1997 trip taken.
Fig-11
Correlation refers to the strength and direction of the relationship between two quantitative features. A correlation value of 1 means strong correlation in the positive direction and negative - 1 means strong correlation in the negative direction and 0 means no correlation between the quantitative features.
Fig-12
According to (Fig-12) none of the features in data are corelated. Central limit theorem has a pivotal value in sampling and analysis. Increase in sample size, the distribution seems to transform into a normal distribution. This validates the central limit theorem. We done random shuffling in order to change daily number of tickets. As show in below (Fig-13), increase in the size of samples, more normal distribution of daily tickets we achieved.
Fig-13
Conclusion:
• The data helped us to understand the audience to a better extent and garner valuables insights.
• We collected data from 2014 to 2015 with demographic information only available for the members and not short-term pass holders. For the future we need to collect demographic information of the short-term pass holder as well.
• Trip duration follows a definite seasonal pattern that repeats over time. Time series can be used to analysis pattern of users to run the promotional campaigns.
• As for the promotions, the best station at which to kick off the campaign would be Pier 69/Alaskan Way & Clay St.
• Outliers were a tiny portion of the dataset. We removed outliers because we wanted perfect accuracy in data.
• Central limit theorem used to check the daily distribution of tickets through different stations.