Data Cleaning and Transformation – Step by Step
1.Data Download and Preparation
Downloaded Divvy’s 2023 CSV files.
Extracted the compressed CSV files.
Uploaded the CSV files to Google Sheets.
2.Column Elimination
Removed irrelevant columns to streamline the dataset:
start_station_id
end_station_id
start_station_name
end_station_name
start_lat
start_lng
end_lat
end_lng
Note: The original files were retained with all columns for reference.
3.New Columns Added
ride_length: Calculated as ended_at - started_at (in HH:MM:SS format)
day_of_week: Derived using the formula =WEEKDAY(started_at), where:
1 = Sunday
2 = Monday
3 = Tuesday
4 = Wednesday
5 = Thursday
6 = Friday
7 = Saturday
4.In Google Sheets
Verify No Empty Cells: Ensured there are no empty cells using the Filter method
Calculate Metrics:
Average Ride Length: =AVERAGE(range)
Maximum Ride Length: =MAX(range)
Most Frequent Weekday per Month: =MODE(range)
Create Pivot Tables:
a. Average Ride Length for Members and Casual Users
Rows: member_casual
Values: Average of ride_length
Purpose: To identify usage patterns between different user types
b. Average Ride Length by Day of Week
Columns: day_of_week
Rows: member_casual
Values: Average of ride_length
Purpose: To compare ride durations across different days for each user type
c. Number of Trips by Day of Week
Columns: day_of_week
Values: Count of trip_id
Purpose: To analyse trip frequency across the week
5.In BigQuery Using SQL
Determine the Most Frequent Day of the Week per Month
Table: A summary table identifying the most frequent day of the week for each month
Columns: month, day_name, occurrences
Purpose: To analyse user behaviour by identifying peak activity days
Analyse Bike Type Usage by User Type
Table: Distribution of bike types (classic, electric) by user type (member, casual) each month
Columns: month_number, month, rideable_type, customer_type, total_trips
Purpose: To understand bike preferences and inform inventory adjustments or promotions
Calculate Average Ride Length per Month and User Type
Table: Average ride length by customer type (member, casual) and month
Columns: month, customer_type, avg_ride_length
Purpose: To compare ride behaviour and identify seasonal patterns in usage
Total Trips Per Weekday Across Months
Table: Summarizes the total number of trips for each day of the week across months
Columns: month, day_name, total_trips
Purpose: To identify trends in usage and peak days for all users combined