The Case study Roadmap recommends Excel for this stage, and that is the tool I considered too, until I found the following error and was forced to use a different tool.
The data is in the form of .csv (comma separated values) files, and these are essentially just simple text files with no data formatting of any kind; however, when you open the file in Excel, it will automatically perform its routine data formatting, and this can cause issues.
Ride _id are automatically generated strings, that are random combinations of numbers and letters that have no inherent meaning to them.
Consider the following two ride_ids from the June 2022 data set:
56302840944300E1
0562795493516E03
To the naked eye, these look like two unique and separate ride_id; however, due to default Excel behaviour, these strings are unfortunately considered as numbers[1], which are then converted by Excel into scientific notation[2], and thus we end up with two rows with the same ride_id of 5.63E+14.
(The same happens in google sheets)
This means that when searching for duplicate ride_id, we will get false positives; and if not accounted for, we could end up accidentally removing valid data.
This means that when searching for duplicate ride_id, we will get false positives; and if not accounted for, we could end up accidentally removing valid data.
Therefore the better tool to use would be something that directly processes the raw .csv files from the source without performing automated data formatting that Excel is wont to do, without having to resort to workarounds and hacks..
The two other software we have learnt in the Google Data analysis certification are SQL and R. I chose to use SQL because of the following reasons:
The dataset is heavy (each file is hundreds of MBs, and has hundreds of thousands of rows for each ride) and my computer was having difficulty processing it; by uploading the files on Google BigQuery[3] I could easily perform multiple queries easily without having to worry about processing time.
I don’t expect to perform any complex mathematical analyses that are the forte of R; the data is relatively simple in nature, it’s just large.
Since the data is most likely just monthly extracts from the company database, the data is already formatted in a manner more useable for SQL queries
The data is in the form of 12 csv files, whose names were mostly the same except for one month where the word public was randomly added for no clear reason, and then not present the next month.
In any case, all 12 files were renamed in to fit the following naming format: YYYY_MM_divvy_tripdata.csv. Standardised file naming formats allow ease of recognition when handling data, as well as other benefits we shall see below, which save us a step.
All 12 files (2022_05 to 2023_04 were uploaded to google cloud storage [4] which allows ease of access when creating tables in the the bigquery dataset, which I decided to name divvy in honour of the original data source.
I created a temporary table called “div” and uploaded all 12 files into it. One nifty benefit of using the google cloud storage bucket is that instead of being forced to import all 12 files as individual tables then use a JOIN command in sql to combine them… you can just use a wildcat operator!
When creating a table, as long as you have set up your data bucket appropriately and have renamed the files in a usable pattern, it can pick all 12 files in one go, and create one single table with the combined data of all 12 files.
So I asked it to make a table out of “*_divvy_tripdata.csv” from my data bucket, and it created one simple table (divvy.div) that I can now use for data cleaning and transformation.
The table has the following schema when imported:
ride_id STRING
rideable_type STRING
started_at TIMESTAMP
ended_at TIMESTAMP
start_station_name STRING
start_station_id STRING
end_station_name STRING
end_station_id STRING
start_lat FLOAT
start_lng FLOAT
end_lat FLOAT
end_lng FLOAT
member_casual STRING
We will be adding a few more column for our data analysis, but we will get to that later on.
Lets run some code to find out how many rows we have.
SQL code
SELECT
COUNT(*)
FROM
divvy.div ;
Result
5,859,061
5.8 million rows! That’s a lot of data, but there will be a lot of errors as well, so iit will be interesting how reduced the data gets after cleaning.
And check how many days data we have:
SQL code
SELECT
COUNT(DISTINCT(DATE(started_at))) AS start_days,
COUNT(DISTINCT(DATE(ended_at))) AS end_days
FROM
divvy.div ;
Result
start_days end_days
365 368
We have 365 (so a full year’s) worth of data, but the end date has more than a year's worth? Clearly someone is keeping their bike for more than a day, which we will deal with later.
Now we can start performing some data cleaning steps.
First, we shall check if ride_id is indeed unique or are there any duplicates.
SQL code
SELECT
ride_id,
COUNT(ride_id) AS ride_count
FROM
divvy.div
GROUP BY
ride_id
HAVING
COUNT(ride_id) > 1 ;
Result
There is no data to display.
It seems that importing the data directly from CSV into the database was the right path, this way we have no duplicate ride_ids
Next we need to assess the state of data in various attribute columns.
Check for NULLS:
SQL code
SELECT
SUM(CASE
WHEN ride_id IS NULL THEN 1
ELSE
0
END
) AS ride_id_null,
SUM(CASE
WHEN rideable_type IS NULL THEN 1
ELSE
0
END
) AS rideable_type_null,
SUM(CASE
WHEN started_at IS NULL THEN 1
ELSE
0
END
) AS started_at_null,
SUM(CASE
WHEN ended_at IS NULL THEN 1
ELSE
0
END
) AS ended_at_null,
SUM(CASE
WHEN start_station_name IS NULL THEN 1
ELSE
0
END
) AS start_station_name_null,
SUM(CASE
WHEN start_station_id IS NULL THEN 1
ELSE
0
END
) AS start_station_id_null,
SUM(CASE
WHEN end_station_name IS NULL THEN 1
ELSE
0
END
) AS end_station_name_null,
SUM(CASE
WHEN end_station_id IS NULL THEN 1
ELSE
0
END
) AS end_station_id_null,
SUM(CASE
WHEN start_lat IS NULL THEN 1
ELSE
0
END
) AS start_lat_null,
SUM(CASE
WHEN end_lat IS NULL THEN 1
ELSE
0
END
) AS end_lat_null,
SUM(CASE
WHEN start_lng IS NULL THEN 1
ELSE
0
END
) AS start_lng_null,
SUM(CASE
WHEN end_lng IS NULL THEN 1
ELSE
0
END
) AS end_lng_null,
SUM(CASE
WHEN member_casual IS NULL THEN 1
ELSE
0
END
) AS member_casual_null
FROM
divvy.div;
Result
ride_id_null 0
rideable_type_null 0
started_at_null 0
ended_at_null 0
start_station_name_null 832,009
start_station_id_null 832,141
end_station_name_null 889,661
end_station_id_null 889,802
start_lat_null 0
end_lat_null 5,973
start_lng_null 0
end_lng_null 5,973
member_casual_null 0
Around 0.8 million rides are void because of start/end station recording issues, which will render the data void and requiring removal. The nearly six thousand rows with no ending geocode are part of the previous set.
Now we need to check for rows where the rides we company run. These include test rides:
SQL code
SELECT
SUM(CASE
WHEN LOWER(start_station_id) LIKE '%test%' THEN 1
ELSE
0
END
) AS start_station_id_test,
SUM(CASE
WHEN LOWER(end_station_id) LIKE '%test%' THEN 1
ELSE
0
END
) AS end_station_id_test
FROM
divvy.div;
Result
start_station_id_test 1,649
end_station_id_test 209
As well as rides that were company-run leg of the valet service[5] or for repair:
SQL code
SELECT
SUM(CASE
WHEN LOWER(start_station_id) LIKE '%divvy%' THEN 1
ELSE
0
END
) AS start_station_id_divvy,
SUM(CASE
WHEN LOWER(end_station_id) LIKE '%divvy%' THEN 1
ELSE
0
END
) AS end_station_id_divvy
FROM
divvy.div;
Result
start_station_id_divvy 24
end_station_id_divvy 25
There will be some overlap with the test data since they also use the divvy label in the station id. Regardless, all these rows need to be removed too, since these are no actual customer ridden rides and thus not valid datapoints
For our data analysis ( and further data cleaning) we need to create the following calculated column: ride_length, which will be the difference between the started_at and ended_at timecodes.
SQL code
ALTER TABLE
divvy.div ADD COLUMN ride_length INT64;
UPDATE
divvy.div
SET
ride_length = DATETIME_DIFF(ended_at, started_at, SECOND)
WHERE
1=1;
Result
ride_length INTEGER
The schema now has a new field.
Now we can check for rides that end before they could start, indicating recording error:
SQL code
SELECT
COUNT(ride_length) AS time_error
FROM
divvy.div
WHERE
ride_length < 0;
Result
103
Rides that are less than 60 secs and thus docking errors and not counted:
SQL code
SELECT
COUNT(ride_length) AS time_error
FROM
divvy.div
WHERE
ride_length < 60;
Result
140,515
However, a quick check reveals that this number includes the 103 rides from above, which will be helpful later on.
And rides that are longer than 24 hours and thus considered lost/stolen:
SQL code
SELECT
COUNT(ride_length) AS time_error
FROM
divvy.div
WHERE
ride_length > (60*60*24);
Result
5,345
Station Names and ID have some issues:
SQL code
SELECT
COUNT(DISTINCT(start_station_name)) AS start_name,
COUNT(DISTINCT(start_station_id)) AS start_id,
COUNT(DISTINCT(end_station_name)) AS end_name,
COUNT(DISTINCT(end_station_id)) AS end_id
FROM
divvy.div;
Result
start_name 1,722
start_id 1,319
end_name 1,741
end_id 1,324
Clearly there isn’t a 1:1 correlation between station names and station IDs. We will be relying on station names since they have less errors. That means we will not be taking the id columns to the clean data set in our analysis phase.
We also need to check the latitudes and longitudes recorded for each station. Are they reliable? Let’s see if a station has more than one set of geo coordinates:
SQL code
SELECT
COUNT(*)
FROM (
SELECT
DISTINCT(start_station_name),
COUNT(DISTINCT start_lat) AS count_lat,
COUNT(DISTINCT start_lng) AS count_lng
FROM
divvy.div
GROUP BY
start_station_name
HAVING
COUNT(DISTINCT start_lat) > 1
OR COUNT(DISTINCT start_lng) > 1);
Result
823
Same is true for end_stations. Clearly the geocodes are all over the place (GPS devices accuracy can vary based on how many satellites it can access at the moment), and not usable for any geo-visualisation. But first let’s get some clean data to avoid fixing geocodes for rows that will end up being removed anyways.
With our boundaries defined, we now need to create a new table that will only have the clean data. We will be removing all rows that have NULLs, test or divvy rows.
We will be selecting data only within the time bounds of a ride length above 60 sec and below 24 hours, that way we can eliminate not only rides that are too short or long, but also recording errors ( negative rides lengths will not be within the time bounds we will be selecting for, and thus automatically eliminated.)
We will be adding some data analysis column, like started_day, which will allow us to do weekly analysis, a separate,clean, start date without the time codes so we can do our daily analysis more easily ( we don’t need the time code that much, we already have calculated ride_length before hand)
We will be adding a tracker for whether the ride was a one-way or round_trip, by matching start and end stations.
We will be moving important columns to the forefront but we will not be dropping any of the columns outright, we will let them stay tacked at the end in case we need them for data analysis.
SQL code
CREATE OR REPLACE TABLE
divvy.divclean AS
SELECT
TRIM(ride_id) AS ride_id,
CAST(started_at AS DATE) AS started_date,
FORMAT_DATE('%A', started_at) AS started_day,
TRIM(start_station_name) AS start_station_name,
TRIM(end_station_name) AS end_station_name,
TRIM(member_casual) AS member_casual,
ride_length,
CASE
WHEN LOWER(start_station_name) = LOWER(end_station_name) THEN 1
ELSE
0
END
AS round_trip,
TRIM(rideable_type) AS bike_type,
* EXCEPT (ride_id,
rideable_type,
start_station_name,
end_station_name,
member_casual,
ride_length)
FROM
divvy.div
WHERE
start_station_name IS NOT NULL
AND end_station_name IS NOT NULL
AND end_lat IS NOT NULL
AND end_lng IS NOT NULL
AND LOWER(start_station_id) NOT LIKE "%test%"
AND LOWER(start_station_id) NOT LIKE "%divvy%"
AND LOWER(end_station_id) NOT LIKE '%test%'
AND LOWER(end_station_id) NOT LIKE "%divvy%"
AND ride_length BETWEEN 60
AND 60*60*24;
Result
ride_id STRING
started_date DATE
started_day STRING
start_station_name STRING
end_station_name STRING
member_casual STRING
ride_length INTEGER
round_trip INTEGER
bike_type STRING
started_at TIMESTAMP
ended_at TIMESTAMP
start_station_id STRING
end_station_id STRING
start_lat FLOAT
start_lng FLOAT
end_lat FLOAT
end_lng FLOAT
With the data cleaned, let’s create a clean set of stations and appropriate geocode for them. First we will collect the list of all the start and end stations, and combine them, so that we have one big list of all stations
SQL code
CREATE OR REPLACE TABLE
divvy.divstation AS
SELECT
*
FROM (
SELECT
start_station_name AS station_name,
start_lat AS station_lat,
start_lng AS station_lng
FROM
divvy.divclean
UNION ALL
SELECT
end_station_name AS station_name,
end_lat AS station_lat,
end_lng AS station_lng
FROM
divvy.divclean);
Result
station_name STRING
station_lat FLOAT
station_lng FLOAT
SQL code
CREATE OR REPLACE TABLE
divvy.group_divstation AS
SELECT
station_name,
APPROX_QUANTILES(station_lat, 2)[
OFFSET
(1)] AS median_lat,
APPROX_QUANTILES(station_lng, 2)[
OFFSET
(1)] AS median_lng
FROM
divvy.divstation
GROUP BY
station_name;
Result
station_name STRING
median_lat FLOAT
median_lng FLOAT
We choose median over simple average to avoid the influence of very distant outliers affecting the location. We will be creating a separate dataset of cleaned stations and geocodes which we will merge with the rest of the data to create the final analysis table which we can then use for analysis.
SQL code
CREATE OR REPLACE TABLE
divvy.analysis AS
SELECT
clean.* EXCEPT (start_lat,
start_lng,
end_lat,
end_lng),
start.median_lat AS start_lat,
start.median_lng AS start_lng,
ended.median_lat AS end_lat,
ended.median_lng AS end_lng
FROM
divvy.divclean clean
LEFT JOIN
divvy.group_divstation start
ON
clean.start_station_name = start.station_name
LEFT JOIN
divvy.group_divstation ended
ON
clean.end_station_name = ended.station_name;
Result
ride_id STRING
started_date DATE
started_day STRING
start_station_name STRING
end_station_name STRING
member_casual STRING
ride_length INTEGER
round_trip INTEGER
bike_type STRING
started_at TIMESTAMP
ended_at TIMESTAMP
start_station_id STRING
end_station_id STRING
start_lat FLOAT
start_lng FLOAT
end_lat FLOAT
end_lng FLOAT
Before we start analysing the data, we need to run integrity checks to ensure no errors have slipped in our clean data. Our key problems were all the nulls and other unwanted rows, so let’s see if they have gone.
SQL code
SELECT
SUM(CASE
WHEN ride_id IS NULL THEN 1
ELSE
0
END
) AS ride_id_null,
SUM(CASE
WHEN started_date IS NULL THEN 1
ELSE
0
END
) AS started_date_null,
SUM(CASE
WHEN started_day IS NULL THEN 1
ELSE
0
END
) AS started_day_null,
SUM(CASE
WHEN start_station_name IS NULL THEN 1
ELSE
0
END
) AS start_station_name_null,
SUM(CASE
WHEN end_station_name IS NULL THEN 1
ELSE
0
END
) AS end_station_name_null,
SUM(CASE
WHEN member_casual IS NULL THEN 1
ELSE
0
END
) AS member_casual_null,
SUM(CASE
WHEN ride_length IS NULL THEN 1
ELSE
0
END
) AS ride_length_null,
SUM(CASE
WHEN round_trip IS NULL THEN 1
ELSE
0
END
) AS round_trip_null,
SUM(CASE
WHEN bike_type IS NULL THEN 1
ELSE
0
END
) AS bike_type_null,
SUM(CASE
WHEN started_at IS NULL THEN 1
ELSE
0
END
) AS started_at_null,
SUM(CASE
WHEN ended_at IS NULL THEN 1
ELSE
0
END
) AS ended_at_null,
SUM(CASE
WHEN start_station_id IS NULL THEN 1
ELSE
0
END
) AS start_station_id_null,
SUM(CASE
WHEN end_station_id IS NULL THEN 1
ELSE
0
END
) AS end_station_id_null,
SUM(CASE
WHEN start_lat IS NULL THEN 1
ELSE
0
END
) AS start_lat_null,
SUM(CASE
WHEN start_lng IS NULL THEN 1
ELSE
0
END
) AS start_lng_null,
SUM(CASE
WHEN end_lat IS NULL THEN 1
ELSE
0
END
) AS end_lat_null,
SUM(CASE
WHEN end_lng IS NULL THEN 1
ELSE
0
END
) AS end_lng_null,
SUM(CASE
WHEN LOWER(start_station_id) LIKE '%test%' THEN 1
ELSE
0
END
) AS start_station_id_id_test,
SUM(CASE
WHEN LOWER(end_station_id) LIKE '%divvy%' THEN 1
ELSE
0
END
) AS end_station_id_id_divvy,
SUM(CASE
WHEN LOWER(start_station_id) LIKE '%divvy%' THEN 1
ELSE
0
END
) AS start_station_id_id_divvy
FROM
divvy.analysis;
Result
ride_id_null 0
started_date_null 0
started_day_null 0
start_station_name_null 0
end_station_name_null 0
member_casual_null 0
ride_length_null 0
round_trip_null 0
bike_type_null 0
started_at_null 0
ended_at_null 0
start_station_id_null 0
end_station_id_null 0
start_lat_null 0
start_lng_null 0
end_lat_null 0
end_lng_null 0
start_station_id_id_test 0
end_station_id_id_divvy 0
start_station_id_id_divvy 0
Now we can move to the next stage.