Analysis Methodology are as follows:
Data Quality : Data is gathered from Cyclist's own database.
library(dplyr)
library(readr)
#importing files
jan <- read_csv("E:/Downloads/DwnlData/202201-divvy-tripdata.csv")
feb <- read_csv("E:/Downloads/DwnlData/202202-divvy-tripdata.csv")
and so for all remaining months for the year 2022
#merging data & cleaning
all_ride_data <-
rbind(jan,feb,mar,apr,may,jun,jul,aug,sep,oct,nov,dec)
#cleaning
all_ride_data <- na.omit(all_ride_data)
#Cheking data columns
colnames(all_ride_data)
The provided R code demonstrates the process of importing, merging, and cleaning multiple CSV files containing Divvy bike trip data for the year 2022. Here's a brief description of the data cleaning steps:
Importing Files: The code utilizes the read_csv() function from the readr package to import CSV files for each month of the year 2022. Each CSV file contains Divvy bike trip data for a specific month.
Merging Data & Cleaning: The rbind() function is used to vertically merge (concatenate) the individual monthly datasets (jan, feb, ..., dec) into a single dataset named all_ride_data. This creates a unified dataset spanning the entire year. After merging, the na.omit() function is applied to remove any rows with missing values, ensuring data completeness.
Checking Data Columns: The colnames() function is employed to check the column names of the all_ride_data dataset. This step helps ensure that the dataset contains all the expected columns and that they are correctly labeled for subsequent analysis.
Overall, these data cleaning steps aim to consolidate multiple datasets into a single comprehensive dataset, remove any incomplete or missing data, and verify the integrity of the dataset's column names. This prepares the data for further analysis, ensuring that it is consistent, complete, and ready for exploration.
**All codes mentioned below left image**
The R code provided performs several data processing steps to prepare the dataset for analysis:Calculating Ride Length: The ride_length variable is computed by finding the difference in time between the ended_at and started_at columns, representing the duration of each bike ride in minutes. This is achieved using the difftime() function with the specified units as minutes.
Rounding Ride Length: The ride_length variable is rounded to two decimal places for precision using the round() function.
Extracting Day of Week: The day_of_week variable is derived from the started_at column to identify the day of the week when each ride took place. The wday() function is used to extract the day of the week in full name format (label = TRUE), and the result is stored as a factor variable with levels ordered from Sunday to Saturday.
Extracting Weekday Number: The weekday variable is extracted using the wday() function without label conversion, representing the day of the week as numeric values ranging from 1 (Sunday) to 7 (Saturday).
Ordering Data by Day of Week: The dataset is arranged in ascending order based on the day_of_week variable using the arrange() function.
Following the data processing steps, aggregate functions are applied to compare ride lengths between member types (casual users and annual members). These aggregate functions compute the mean, median, maximum, and minimum ride lengths for each member type, providing insights into differences in riding behaviors between casual users and annual members.
**All codes mentioned below right image**
install.packages('lubridate')
library(lubridate)
install.packages('hms')
#Processing for analysing
all_ride_data$ride_length <-
round(difftime(all_ride_data$ended_at,all_ride_data$started_at,units = "mins"),digits = 2)
all_ride_data$weekday <-
wday(all_ride_data$started_at,label=FALSE)
all_ride_data$day_of_week <-
wday(all_ride_data$started_at,label=TRUE,abbr = FALSE)
#Arrange in order
all_ride_data$day_of_week <-
factor(all_ride_data$day_of_week,levels = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
all_ride_data <-
arrange(all_ride_data,day_of_week)
# Compare members and casual users
aggregate(all_ride_data$ride_length ~
all_ride_data$member_casual, FUN = mean)
aggregate(all_ride_data$ride_length ~
all_ride_data$member_casual, FUN = median)
aggregate(all_ride_data$ride_length ~
all_ride_data$member_casual, FUN = max)
aggregate(all_ride_data$ride_length ~
all_ride_data$member_casual, FUN = min)
#descriptive
str(all_ride_data)
glimpse(all_ride_data)
summary(all_ride_data)
head(all_ride_data)
#Visualization
# the number of rides by rider type
library(ggplot2)
all_ride_data %>%
group_by(member_casual, day_of_week) %>%
summarise(number_of_rides = n()
,average_duration = mean(ride_length)) %>%
arrange(member_casual, day_of_week) %>%
ggplot(aes(x = day_of_week, y = number_of_rides, fill =
member_casual)) +
geom_col(position = "dodge")
#average duration
all_ride_data %>%
group_by(member_casual, day_of_week) %>%
summarise(number_of_rides = n()
,average_duration = mean(ride_length)) %>%
arrange(member_casual, day_of_week) %>%
ggplot(aes(x = day_of_week, y = average_duration, fill
= member_casual)) +
geom_col(position = "dodge")
The provided R code conducts descriptive analysis and visualization of the Divvy bike trip dataset. Here's a brief description of each section:
Descriptive Analysis:
str(all_ride_data): Displays the structure of the all_ride_data dataset, providing information about its variables and data types.
glimpse(all_ride_data): Offers a concise overview of the all_ride_data dataset, including variable names, data types, and a preview of the first few rows.
summary(all_ride_data): Presents summary statistics for numeric variables in the all_ride_data dataset, such as minimum, maximum, median, and quartiles.
head(all_ride_data): Shows the first few rows of the all_ride_data dataset, offering a glimpse into its contents.
Visualization:
Number of Rides by Rider Type: Utilizes the ggplot2 package to create a bar plot depicting the number of rides by rider type (casual vs. member) across different days of the week. Each bar represents the count of rides, with colors distinguishing between rider types.
Average Duration of Rides: Constructs another bar plot using ggplot2 to illustrate the average duration of rides by rider type across different days of the week. The height of each bar represents the mean ride duration, with colors indicating rider types.
These visualizations offer insights into ride distribution patterns and average ride durations for different rider types over the course of a week. They provide a clear and intuitive way to understand how ride behavior varies between casual riders and members of the bike-sharing service.
**All codes mentioned below left image**
Lack of Transaction Data: The absence of detailed customer transaction data due to privacy regulations posed a challenge in understanding the correlation between pricing strategies and ride frequency for both member and casual riders. This limitation restricted the depth of insights into consumer behavior and preferences.
Data Inconsistencies: Inconsistencies in station name addresses and discrepancies in latitude and longitude data complicated accurate calculations and visualizations. This inconsistency hindered the exploration of rider behavior based on trip locations, making it challenging to draw meaningful conclusions about geographic patterns in ride usage.
For access to the detailed documentation and interactive dashboard related to this project, please visit - Summary doc & Power BI report
In conclusion, while the analysis provided valuable insights into rider behavior and preferences, there are areas for further enhancement. Cyclistic can improve its data analysis by incorporating additional dimensions such as nominal data (e.g., customer gender) and text data from reviews and feedback. This enriched dataset would offer a more comprehensive understanding of customer behavior, enabling targeted marketing strategies that consider demographic specifics and qualitative insights. By integrating numerical metrics with nuanced qualitative data, Cyclistic can refine its marketing approach to enhance user engagement and satisfaction, ultimately driving business growth and success in the competitive bike-sharing market.