Supporting visualizations - are in the sections below.

Step 1 – Ask, questions/the business task

Step 3 – Process and clean data

Step 4 – Analyzing data

Supporting visualizations - are in the sections below

viz 8. Compare the percentage per day of the week trips total for the last 12 months.

Expand to see the image plotted. Click on the down pointing arrow.

viz 9. Bike rental each weekday. Compare members to casual (average for 12 months).

viz 10. Plotting casual and members users daily (average for 12 months).

viz 11. Plot - each_week day. Compared member and casual users. (average for 12 months).

viz 12. How each day members are compared to casual users (average for 12 months)

Expand to see the entire code.

bikeshare project

A bike rental business is analyzing how different customers are using the service.

Intro - bikeshare biz

Analyzing how different customers are using the service. There are two main user groups - members and casual users. We will begin with some basic statistics like, mean, median, standard deviation (SD), and 1st Q to 4th Q. Just to get an idea of what is all about.

The core of the project (in the next section below) is a presentation of all the process steps, an entire project description of work, tools, observations, and conclusions.

From asking, what is the business need and clarifying the right task to work on in our data analysis. Gathering data and data cleansian, and analyzing data.

Later on, some graphs/plots will further enhance the experience in understanding, how each group of users are behaving differently in the bike rental process.

In the end, the entire code is attached with notes in all the sections. From downloading and cleansing data to merging and analyzing data and data frames (DF).

To begin with, here it is some basic stats, resulting from our DA analysis below

standard deviation (SD) for members 41,242 SD for casual 100,272 ( a lot more widespread)

summary stats members

Min. 1st Qu. Median Mean 3rd Qu. Max.

415,866 457,865 470,277 480,501 520,274 521,086

summary stats casual

Min. 1st Qu. Median Mean 3rd Qu. Max.

272,826 289,692 315,046 359,074 410,224 525,812

The entire project description of work, tools, observations, and conclusions are below.

Supporting visualizations - are in the sections below.

Data analysis 6 stages/steps are followed in sections of the project:

1. Asking the right question or define/clarify the question

2. Prepare the needed data. Collection of data, organizing data in a useful form.

3. Processing and cleansing data.

4. Analyze – requested mean, median, and mode for trips

5. Share and

6. Act – the executive team must approve the marketing plan

Step 1 – Ask, questions/the business task

Need to deliver a report with a clear statement of the business task

· What is the problem you are trying to solve?

· How can your insights (based on data) drive business decisions?

There are 3 questions

1. How do annual members and casual riders use Cyclistic bikes differently?

2. Why would casual riders buy Cyclistic annual memberships?

3. How can Cyclistic use digital media to influence casual riders to become members?

From the stakeholder - My 1^st question and task at hand, per director of marketing, is what this project is based on:

How do annual members and casual riders use Cyclistic bikes differently? This is the main point of this DA project.

Step 2 – Prepare

A description of all data sources used. Data was provided in downloadable zip files containing data as comma-separated values (CSV). Per instructions, only the past 12 months are used. R Studio has been used in the process of reviewing, cleansing, and analyzing. Neally 6 000 000 rows and over 15 columns of combined data for the last 12 months is too large for a spreadsheet to be used as a tool

· Location of data – downloaded on local HD

· Data credibility and integrity – data is collected by the business itself.

· How data is organized – one folder per month (total of 12 folders) downloaded

· In the process entire data was concatenated (merged) into one file, data frame (DF) – data_12_months

Note:

variable, var. and column - may be used interchangeably below;

record and row - may be used interchangeably below

Step 3 – Process and clean data

Documentation of any cleaning or manipulation of data

· All data cleansing and manipulation is conducted using R Studio v 4.2.0, Windows 10

· Additional packages for R libraries installed and loaded

library(tidyverse)

ibrary(dplyr)

library(waldo)

library(lubridate)

library(janitor)

library(vtree)

library(CGPfunctions)

· Verifying the number of columns (number of variables) per data-file/table

· Check if all 12 CSV files are organized in the same way. For example:

1^st variable is ride_id,

2nd var. is rideable_type,

3rd var. is started_at,

4th var is ended_at, and so on

· Concatenate all 12 files in one large DF - 5,901,463 - total rows/records

· Creating new variables

new var. - Duration of time for each trip(rental) in H M S format

new var. - Duration of time for each trip(rental) in S (seconds) format

new var. - From duration in seconds variable, as number (NOT date or period format) - ride_length_time_as_number (for easy calculations)

new var. - Day of the week variable, Sunday as the first day

· Checking for duplicates - no duplicate has been removed, as none is found

· Checking for NA values – no NA found

· Checking for NaN values – no NaN found

· Checking for NULL values – no NULL found.

· Counting negative trips -

negative trips - 636 trips

· Counting trips under 30s and = to 1m

under 30s trips - 68,700 (including negative time trips)

under and = 60s trips - 105,613 (including negative time trips and 30 sec trips)

· Checking if there are any "empty cells" in each recode and in new the variables - non found

· Checking if there are any NA , NaN , NULL values in 3 newly created variables - non found

· Finding outliers, and how to process them (remove, fix if error generated,…). There were no outliers found in the data.

Step 4 – Analyzing data

The analysis below will be with DF data_12_months

1. Negative and below 10sec. trips time/length variables will be dropped.

For the purposes of these calculations, we will assume that a SME was consulted on the removal of trips of less than 10 sec, including the negative trip duration.

2. Mean and median calculation for the last 12 months

2..1. Mean for all trip times 1193 seconds = 19.9min

2..2. Mean for trips, 30 sec and below 1206.998 secs = 20 min.

2..3. The mean of 2..1 and 2..2 are both near 20 minutes

2..4. Median trip length 660 sec = 11 min

3. Max and min value for trips

3..1. Maximum ride length is 2497750 secs = 41629 min = 694 h = 28.9 days

3..2. 340 trips lasted over 1 week to 29 days

3..3. 4966 trips lasted between 1 day(24h) to max (almost 29 days)

3..4. The max trip length of 28.9 days was not an outlier due to the 3..2 and 3..3

3..5. The minimum ride length is 10 sec (expected, as all trips less than 10 sec were removed)

4. Compare casual and members trips for the last 12 months

4..1. counting values of members and casual

4..2. members count 3363505 - 57.2 %

4..3. casual count 2513515 - 42.8 %

4..4. compare the sum of members and casual to total records returned TRUE

MEAN, MEDIAN and MAX for members and casual (time of renting bikes)

4..5. mean time casual - 1758.78 secs

4..6. mean time member - 779.55 secs

4..7. median time casual - 868 secs

4..8. median time members - 544 secs

4..9. max time casual – 2,497,750 secs

4..10.max time members – 93,594secs

4..11.mean of casual and member per day of the week

· # casual Sun 2045 secs

· # member Sun 882 secs

· # casual Mon 1790 secs

· # member Mon 757 secs

· # casual Tue 1533 secs

· # member Tue 732 secs

· # casual Wed 1505 secs

· # member Wed 733 secs

· # casual Thu 1577 secs

· # member Thu 748 secs

· # casual Fri 1650 secs

· # member Fri 760 secs

· # casual Sat 1916 secs

· # member Sat 873 secs

Observation:

Members users/renters, on average, have shorter trips, by 2 times or more; compared to casual users/renters

5. Counting the number of members and casual-users for each day of the week

day_of_week casual member

Sun 473975 415866

Mon 298604 470277

Tue 272826 521086

Wed 280780 520100

Thu 315046 520447

Fri 346472 464470

Sat 525812 451259

Observation:

Members use bikes (rent them) more evenly when compared to each day of the week. Where the casual users pick on (renting more) Saturday and Sunday.

6. Counting the number of users for each day of the week, in percent

day_of_week casual member

Sun 18.9% 12.4%

Mon 11.9% 14.0%

Tue 10.9% 15.5%

Wed 11.2% 15.5%

Thu 12.5% 15.5%

Fri 13.8% 13.8%

Sat 20.9% 13.4%

Observation:

Members use bikes (rent them) more consistently, with a marginally lesser use on Sunday. When comparing each day of the week to casual users. Casual users are extremely high on Saturday and Sunday compared to weekdays.

7. Percentage of users per day, split between member and casual

– each day of the week is 100%

day_of_week casual member

Sun 53.3% 46.7%

Mon 38.8% 61.2%

Tue 34.4% 65.6%

Wed 35.1% 64.9%

Thu 37.7% 62.3%

Fri 42.7% 57.3%

Sat 53.8% 46.2%

Observation:

In percentage, just confirms an earlier remark.

Members use bikes (rent them) more Mon-Thru. However, casual users are prevailing on Sat and Sun.

Supporting visualizations - are in the sections below

viz. 8, 9, 10, 11, and 12

8. Compare the percentage per day of the week trips total for the last 12 months

8..1. Sun Mon Tue Wed Thru Fry Sat

8..2. 15.1 13.1 13.5 13.6 14.2 13.8 16.6

9. Bike rental each weekday. Compare member to casual

10. Plotting casual and members use per weekday

11. Plot - each_week-day-to-member_casual

12. Rplot03-member_casual-on_top_perWeek

13. Observations, to the stakeholder, director of marketing:

Members users/renters, on average, have shorter trips - by 2 times shorter or more, compared to casual users/renters

(section above, 4..11 - mean of casual and member per week days)

Members use bikes (rent them) more evenly, when compared to each day of the week. Where casual users pick on Saturday and Sunday. This is demonstrated in the SD stats, at top of this page.

(section above, 6. - Counting numbers of users for each day of the week, in percent)

Members use bikes (rent them) more consistently throughout the week, with a lesser use on Sunday. When comparing each day of the week to casual users. Casual users are extremely high on Saturday and Sunday compared to weekdays.

(section above, 6. - Counting numbers of users for entire week for each day, in percent)

In percentage, just confirms an earlier remark.

Members use bikes (rent them) more Mon-Thru. However, casual users are prevailing on Sat and Sun.

(section above, 7. - Percentage of users per day, split between member and casual

– each day of the week is 100%)

14. Top three recommendations based on analysis and observations

Promotions packages, for example targeting the weekends + weekdays as package for casual users.

My encourage more casual users to use bikes on weekdays too.

Making weekends more attractive for new casual users (adding benefits, extra services, discounts for X-amount of use and so on). Could bring fresh casual users, that may be possible to convert to members.

May not be so disadvantages to have one group using more bikes on Weekends, while the other is picking up on Weekdays. Let’s say number of members converted from casual users increase by X. This number “X”, may develop bike shortage in Weekdays. Of course, further research is needed.

• Cyclistic users are more likely to ride for leisure, but about 30% use them to commute to work each day.

Additional information

Challenges discovered in the process

There are files too large to be opened in MS Excell. R Studio did most of the “heavy lifting”. Ms Excell mostly used to review tables, when possible at the begging.

There were discrepancies and inconsistency in:

· files names, containing table

· structure of the data provided, some data was in months, other in quarters, third in 2quarteres

· column names (variable names) in data provided

· arrangement of the columns/variable. For example, in one table 3th column has a start-date variable, while in another table 3th column contains station_id variable

n Total records after combining all tables (all years, quarters, months) are very inconsistent. However, last 12m are used in this task.

Metadata for bike trips and docking stations

Column Names

“ride_id"

"rideable_type"

"started_at"

"ended_at"

"start_station_name"

"start_station_id"

"end_station_name"

"end_station_id"

"start_lat"

"start_lng"

"end_lat"

"end_lng"

"member_casual"

"ride_length_t"

"day_of_week"

"ride_length_time_only_sec"

"ride_length_time_as_number"

viz 8. Compare the percentage per day of the week trips total for the last 12 months.

Expand to see the image plotted. Click on the down pointing arrow.

viz 9. Bike rental each weekday. Compare members to casual (average for 12 months).

This plot is similar to the one above, viz 8. Nevertheless is represented in numbers of users instead of percent.
For example: on Sundays, for the last 12 months, casual users have rented 473,975 times. Where members have used the service on Sundays 415,866 times.

Expand to see the image plotted. Click on down pointing arrow.

viz 10. Plotting casual and members users daily (average for 12 months).

Expand to see the image plotted. Click on down pointing arrow.

viz 11. Plot - each_week day. Compared member and casual users. (average for 12 months).

Expand to see the image plotted. Click on down pointing arrow.

viz 12. How each day members are compared to casual users (average for 12 months)

Expand to see the image plotted. Click on down pointing arrow.

The graph below is comparing in percentage, how on each day of the week casual users are measured to members.

There are a total of 5,877,020 rents in the last 12 months. The "n=" at the bottom is displaying how many users are each day of the week.

We can observe that members are using bike rentals more on workdays. Where casual users/renters are renting more on weekends.

Expand to see the entire code.

R Studio has been used. There are about 360 lines including notes.

Click on down pointing arrow.

#individual .csv files test and review

# counting number of columns,

# checking if columns are are organized in the same order

# to be able to concat data/tables in one large file later

library(tidyverse) # loading package

library(dplyr) # loading package

# setting working directory with past 12 months of data in

setwd('C:/Users/PD truck/Documents/DA DS BA/Gogle DA Certificate - Coursera/capstone proj 1 - bikeShare/12months')

getwd() # confirm working directory settings

# creating individual DF from all 12 CSV files

# double check all variable names and organization in individual tables

# 2021 August

bike_trips_2021_08 <- read.csv('202108-divvy-tripdata.csv')

glimpse(bike_trips_2021_08)

# 2021 September

bike_trips_2021_09 <- read.csv('202109-divvy-tripdata.csv')

glimpse(bike_trips_2021_09)

# 2021 October

bike_trips_2021_10 <- read.csv('202110-divvy-tripdata.csv')

glimpse(bike_trips_2021_10)

# 2021 November

bike_trips_2021_11 <- read.csv('202111-divvy-tripdata.csv')

glimpse(bike_trips_2021_11)

# 2021 December

bike_trips_2021_12 <- read.csv('202112-divvy-tripdata.csv')

glimpse(bike_trips_2021_12)

# test on merging tables, on 5 tables only

test_data_tables05 <- list.files(path = 'C:/Users/PD truck/Documents/DA DS BA/Gogle DA Certificate - Coursera/capstone proj 1 - bikeShare/12months') %>%

lapply(read_csv) %>%

bind_rows

glimpse(test_data_tables05)

# 2022 January

bike_trips_2022_01 <- read.csv('202201-divvy-tripdata.csv')

glimpse(bike_trips_202201)

# 2022 February

bike_trips_2022_02 <- read.csv('202202-divvy-tripdata.csv')

glimpse(bike_trips_202202)

# 2022 March

bike_trips_2022_03 <- read.csv('202203-divvy-tripdata.csv')

glimpse(bike_trips_202203)

# 2022 April

bike_trips_2022_04 <- read.csv('202204-divvy-tripdata.csv')

glimpse(bike_trips_202204)

# 2022 May

bike_trips_2022_05 <- read.csv('202205-divvy-tripdata.csv')

glimpse(bike_trips_202205)

# 2022 June

bike_trips_2022_06 <- read.csv('202206-divvy-tripdata.csv')

glimpse(bike_trips_202206)

# 2022 July

bike_trips_2022_07 <- read.csv('202207-divvy-tripdata.csv')

glimpse(bike_trips_202207)

# extract variable names (col names) of 1st DF

colnames(bike_trips_2021_08)

#all tables should have variables/columns in the order below

# use this code example to convert variable names

library(dplyr)

#colnames(DF_here_that_needs_new_var_names) = c("ride_id", "rideable_type", "started_at",

# "ended_at", "start_station_name", "start_station_id",

# "start_lng", "end_lat", "end_lng",

# "member_casual" )

##################################################

########## compare column names #############

##################################################

install.packages("waldo")

library(waldo)

#compare the variable(column) names against other tables

compare(colnames(bike_trips_2021_08), colnames(bike_trips_2021_09))

compare(colnames(bike_trips_2021_09), colnames(bike_trips_2021_10))

compare(colnames(bike_trips_2021_09), colnames(bike_trips_2021_11))

compare(colnames(bike_trips_2021_09), colnames(bike_trips_2021_12))

compare(colnames(bike_trips_2022_01), colnames(bike_trips_2021_12))

compare(colnames(bike_trips_2022_02), colnames(bike_trips_2021_12))

compare(colnames(bike_trips_2022_02), colnames(bike_trips_2022_03))

compare(colnames(bike_trips_2022_04), colnames(bike_trips_2022_03))

compare(colnames(bike_trips_2022_05), colnames(bike_trips_2022_03))

compare(colnames(bike_trips_2022_06), colnames(bike_trips_2022_03))

compare(colnames(bike_trips_2022_07), colnames(bike_trips_2022_03))

compare(colnames(bike_trips_2021_09), colnames(bike_trips_2022_07))

# there is no difference in col/var names => all 12 files have the same

# name and order of arrangement

# merging all tables including all past 12 months

data_12_months <- list.files(path = 'C:/Users/PD truck/Documents/DA DS BA/Gogle DA Certificate - Coursera/capstone proj 1 - bikeShare/12months') %>%

lapply(read_csv) %>%

bind_rows

# rm(bike_trips_2022_07) # removing individual months from environment, clearing RAM

#the above is the last of 12 months used in DF

# preview last 12 months of data,in one DF

glimpse(data_12_months)

str(data_12_months)

View(data_12_months)

# last 12 months combined data has

# Rows: 5,901,463

# Variable/Columns: 13

# removing potential duplicate rows in R

data_12_months %>% distinct()

nrow(data_12_months) # the same row count as befor removing of duplicates

#- 5,901,463 => no duplicate rows/records have been removed

# setting and checking working directory

setwd('C:/Users/PD truck/Documents/DA DS BA/Gogle DA Certificate - Coursera/capstone proj 1 - bikeShare/12months')

getwd()

library(lubridate) # install package

# creating a new variable/column "ride_length_t" as duration of time(bike used)

#preview the new variable

#time of trip variable in format in H M S

data_12_months$ride_length_t <- as.period((data_12_months$ended_at - data_12_months$started_at), format("%H:%M:%S"))

# creating a new variable/column "ride_length_time_only_sec"

# - time of trip variable in seconds to calculate easy mean of trip time

data_12_months$ride_length_time_only_sec <- (data_12_months$ended_at - data_12_months$started_at)

# creating a new variable/column, from duration in seconds, as number (NOT date or period format)

data_12_months$ride_length_time_as_number <- as.numeric(data_12_months$ride_length_time_only_sec)

# creating a new variable/column "day_of_week" and preview the new variable

data_12_months$day_of_week <- wday(data_12_months$started_at, label=TRUE)

#preview variable and DF

colnames(data_12_months)

glimpse(data_12_months)

View(data_12_months)

# there are total of 16 col/variable including 3 new variables created

# counting negative trips duration and = to 30s, and = to 1m trips

print(sum(data_12_months$ride_length_time_only_sec <= 0))

print(sum(data_12_months$ride_length_time_only_sec <= 30))

print(sum(data_12_months$ride_length_time_only_sec <= 60))

# results are:

#trips with negative time 636;

#trips under 30s - 68,770;

#trips under 1min. - 10,5613

##################################################

########## analysis below #############

##################################################

# there are 5,901,463 - total rows/records

# checking if there are any "empty cells", looking for

# NA , NaN , NULL values in 3 variables

sum(is.na(data_12_months$started_at))

data_12_months %>% count(is.na(started_at)) # no NA found

data_12_months %>% count(is.nan(started_at)) # no NaN found

data_12_months %>% count(is.null(started_at))# no NULL found

sum(is.na(data_12_months$ended_at))

data_12_months %>% count(is.na(ended_at)) # no NA found

data_12_months %>% count(is.nan(ended_at)) # no NaN found

data_12_months %>% count(is.null(ended_at)) # no NULL found

sum(is.na(data_12_months$ride_length_t))

data_12_months %>% count(is.na(ride_length_t)) # no NA found

data_12_months %>% count(is.nan(ride_length_t)) # no NaN found

data_12_months %>% count(is.null(ride_length_t)) # no NULL found

# ---------- calculating the mean of bike-use time ---------

# mean calc --- did not work !!! --- because is in H M S format

#mean(data_12_months$ride_length_t)

#mean(as.period(data_12_months$ride_length_t))

#the mean of time is showing as 29.35581 - wrong

# test run,

#removing trips with <= 30 sec

#should remove only 8 records - remove rows under negtive 4000 sec. trips

data_12_months_exlude_30sec_test <- data_12_months[!(data_12_months$ride_length_time_only_sec < -4000), ]

glimpse(data_12_months_exlude_30sec_test)

# test run

#removing trips with <= 30 sec

data_12_months_exlude_30sec <- data_12_months[!(data_12_months$ride_length_time_only_sec < 30), ]

glimpse(data_12_months_exlude_30sec)

View(data_12_months_exlude_30sec)

#compare row numbers - entire DF vs 30sec and below trips exlculded

nrow(data_12_months) # 5,901,463

nrow(data_12_months_exlude_30sec) # 5,834,280 - this is 67,183 less records

#calculating the mean of bike-use time

mean(data_12_months$ride_length_time_only_sec) #the mean in seconds 1193.376 secs = 19.8896 min.

mean(data_12_months_exlude_30sec$ride_length_time_only_sec) #the mean in seconds 1206.998 secs = 20.116633 min.

# there is very small differences in the mean before and after removing the shorter form 30s trips

###############################################################################

########## analysis below will be with DF data_12_months #############

####### negative and below 10sec. time/length will be dropped ##########

###############################################################################

#removing negative and less of 10 seconds time - appreve by stake holder

data_12_months <- data_12_months[!(data_12_months$ride_length_time_only_sec < 10), ]

#Calculate the max ride_length

max(data_12_months$ride_length_time_as_number) # maximum ride length is 2497750 secs = 41629 min

mean(data_12_months$ride_length_time_only_sec) # 1198 secs = 19.97 min ~ 20 min.

mean(data_12_months$ride_length_time_as_number) # 1198 secs = 19.97 min ~ 20 min.

# both mean values above are the same => no discrepances/differences in values as number vs drtn-seconds

#median trip lenght

median(data_12_months$ride_length_time_only_sec) #median trip length 660 sec = 11 min

# max and min trip duration

max_time_sec <- max(data_12_months$ride_length_time_as_number, na.rm = TRUE)

print(max_time_sec) # # maximum ride length is 2497750 sec = 41629 min = 694 h = 28.9 days

max_time_days <- ((((max_time_sec)/60)/60)/24)

max_time_days #= 28.9 days is the maximum time of bike used

min_time_sec <- min(data_12_months$ride_length_time_only_sec)

min_time_sec # minimum ride length is 10 sec (expected as all trips less than 10 sec were removed)

#

########## checking for outliers #############

#

#as max values, checking top 10 max length values,

#as above max value returned almost 29 of time length

head(sort(data_12_months$ride_length_time_as_number, decreasing=TRUE),n=10) # there over 10 in the range of 20 days or more

# counting trips over 24 hours (86400 sec)

# counting trips over 168 hours = 7 days (604800 sec)

data_12_m_over_24_h <- data_12_months[(data_12_months$ride_length_time_only_sec >= 86400), ]

data_12_m_over_1week <- data_12_months[data_12_months$ride_length_time_only_sec >= 604800, ]

#calculating the mode of bike-use time

View(data_12_m_over_24_h)

nrow(data_12_months)

nrow(data_12_m_over_24_h) # there are 4966 trips lasted between 1 day(24h) to max (29 days)

nrow(data_12_m_over_1week) # there are 340 trips lasted between 1 week to max (29 days)

########### #############

########### the max trip length of 29 days was not an outlyer #############

########### #############

aggregate(data_12_months$ride_length_time_only_sec ~ data_12_months$member_casual, FUN = mean )

# mean for casual - 758.8 sec

# mean for members - 779.6 sec

aggregate(data_12_months$ride_length_time_only_sec ~ data_12_months$member_casual, FUN = median )

# median for casual - 868 secs

# median for members - 544 secs

aggregate(data_12_months$ride_length_time_only_sec ~ data_12_months$member_casual, FUN = max )

# max for casual - 2497750 secs

# max for member - 93594 secs

aggregate(data_12_months$ride_length_time_only_sec ~ data_12_months$member_casual, FUN = min )

# min for both is 10 sec

# mean of casual and member per week day

aggregate(data_12_months$ride_length_time_only_sec ~

data_12_months$member_casual +

data_12_months$day_of_week, FUN = mean)

# casual Sun 2045.0737 secs

# member Sun 882.2334 secs

# casual Mon 1789.5393 secs

# member Mon 757.4119 secs

# casual Tue 1532.6981 secs

# member Tue 731.9624 secs

# casual Wed 1505.4292 secs

# member Wed 733.8939 secs

# casual Thu 1577.0902 secs

# member Thu 747.5276 secs

# casual Fri 1649.9203 secs

# member Fri 759.9214 secs

# casual Sat 1916.4409 secs

# member Sat 872.7123 secs

# grouping by members/casual and week days

xtabs(~ day_of_week + member_casual, data = data_12_months)

install.packages("janitor")

install.packages("vtree")

install.packages("CGPfunctions")

install.packages("haven")

library(janitor)

library(vtree)

library(CGPfunctions)

library(haven)

head(data_12_months)

skimr::skim(data_12_months)

# day of the week breakdown for members/casual, using janitors - tabyl

tabyl(data_12_months, day_of_week, member_casual)

# getting basic statistical info for memebers and causl users

# stats members

data_12m_stats <- tabyl(data_12_months, day_of_week, member_casual)

sd(data_12m_stats$member, na.rm = TRUE)

summary(data_12m_stats$member, na.rm = TRUE)

#stats casual

sd(data_12m_stats$casual, na.rm = TRUE)

summary(data_12m_stats$casual, na.rm = TRUE)

# day of the week and members/casual in percentage

tabyl(data_12_months, day_of_week, member_casual) %>%

adorn_percentages("col") %>%

adorn_pct_formatting(digits = 1)

# percent of member/casual each week day - a week day is 100%

tabyl(data_12_months, day_of_week, member_casual) %>%

adorn_percentages("row") %>%

adorn_pct_formatting(digits = 1)

# visualize plot members/casual for each day of the week

CGPfunctions::PlotXTabs(data_12_months, day_of_week, member_casual)

# visualize plot for each week-day to members/casual

CGPfunctions::PlotXTabs(data_12_months, member_casual, day_of_week)

# plot bars on top of each one (member/casual users)

CGPfunctions::PlotXTabs2(data_12_months, day_of_week, member_casual)

#if needed remove the info(stats) on top of the graph by adding, results.subtitle = FALSE

#CGPfunctions::PlotXTabs2(data_12_months, day_of_week, member_casual, results.subtitle = FALSE)

#storing the current work - new concat data file

# storing as R binary file

setwd('C:/Users/PD truck/Documents/DA DS BA/Gogle DA Certificate - Coursera/capstone proj 1 - bikeShare')

getwd() # confirm working directory settings

save(data_12_months, file = "bike_data_last_12_months.RData")

# storing as text ".CSV" file

write.csv(data_12_months, file = "bike_data_last_12_months.csv")

setwd('C:/Users/PD truck/Documents/DA DS BA/Gogle DA Certificate - Coursera/capstone proj 1 - bikeShare/12months')

getwd() # confirm working directory settings

colnames(data_12_months)

The above stats and analysis are contributing to a better understanding of how annual members and casual users are renting bikes. How one group is behaving compared to the other: in rental time length, days of the week, and consistency of renting.

Next project