Cyclistic - PROCESS

PROCESS

Third stage is PROCESS. This is the data cleaning and transformation stage, where we convert raw data into something more manageable.for analysis.

The Case study Roadmap recommends Excel for this stage, and that is the tool I considered too, until I found the following error and was forced to use a different tool.

The data is in the form of .csv (comma separated values) files, and these are essentially just simple text files with no data formatting of any kind; however, when you open the file in Excel, it will automatically perform its routine data formatting, and this can cause issues.

Ride _id are automatically generated strings, that are random combinations of numbers and letters that have no inherent meaning to them.

Consider the following two ride_ids from the June 2022 data set:

56302840944300E1

0562795493516E03

To the naked eye, these look like two unique and separate ride_id; however, due to default Excel behaviour, these strings are unfortunately considered as numbers[1], which are then converted by Excel into scientific notation[2], and thus we end up with two rows with the same ride_id of 5.63E+14.

(The same happens in google sheets)

This means that when searching for duplicate ride_id, we will get false positives; and if not accounted for, we could end up accidentally removing valid data.

This means that when searching for duplicate ride_id, we will get false positives; and if not accounted for, we could end up accidentally removing valid data.

Therefore the better tool to use would be something that directly processes the raw .csv files from the source without performing automated data formatting that Excel is wont to do, without having to resort to workarounds and hacks..

The two other software we have learnt in the Google Data analysis certification are SQL and R. I chose to use SQL because of the following reasons:

The dataset is heavy (each file is hundreds of MBs, and has hundreds of thousands of rows for each ride) and my computer was having difficulty processing it; by uploading the files on Google BigQuery[3] I could easily perform multiple queries easily without having to worry about processing time.

I don’t expect to perform any complex mathematical analyses that are the forte of R; the data is relatively simple in nature, it’s just large.

Since the data is most likely just monthly extracts from the company database, the data is already formatted in a manner more useable for SQL queries

The data is in the form of 12 csv files, whose names were mostly the same except for one month where the word public was randomly added for no clear reason, and then not present the next month.

In any case, all 12 files were renamed in to fit the following naming format: YYYY_MM_divvy_tripdata.csv. Standardised file naming formats allow ease of recognition when handling data, as well as other benefits we shall see below, which save us a step.

All 12 files (2022_05 to 2023_04 were uploaded to google cloud storage [4] which allows ease of access when creating tables in the the bigquery dataset, which I decided to name divvy in honour of the original data source.

I created a temporary table called “div” and uploaded all 12 files into it. One nifty benefit of using the google cloud storage bucket is that instead of being forced to import all 12 files as individual tables then use a JOIN command in sql to combine them… you can just use a wildcat operator!

When creating a table, as long as you have set up your data bucket appropriately and have renamed the files in a usable pattern, it can pick all 12 files in one go, and create one single table with the combined data of all 12 files.

So I asked it to make a table out of “*_divvy_tripdata.csv” from my data bucket, and it created one simple table (divvy.div) that I can now use for data cleaning and transformation.

The table has the following schema when imported:

ride_id STRING

rideable_type STRING

started_at TIMESTAMP

ended_at TIMESTAMP

start_station_name STRING

start_station_id STRING

end_station_name STRING

end_station_id STRING

start_lat FLOAT

start_lng FLOAT

end_lat FLOAT

end_lng FLOAT

member_casual STRING

We will be adding a few more column for our data analysis, but we will get to that later on.

Lets run some code to find out how many rows we have.

SQL code

SELECT

COUNT(*)

FROM

divvy.div ;

Result

5,859,061

5.8 million rows! That’s a lot of data, but there will be a lot of errors as well, so iit will be interesting how reduced the data gets after cleaning.

And check how many days data we have:

SQL code

SELECT

COUNT(DISTINCT(DATE(started_at))) AS start_days,

COUNT(DISTINCT(DATE(ended_at))) AS end_days

FROM

divvy.div ;

Result

start_days end_days

365 368

We have 365 (so a full year’s) worth of data, but the end date has more than a year's worth? Clearly someone is keeping their bike for more than a day, which we will deal with later.

Now we can start performing some data cleaning steps.

First, we shall check if ride_id is indeed unique or are there any duplicates.

SQL code

SELECT

ride_id,

COUNT(ride_id) AS ride_count

FROM

divvy.div

GROUP BY

ride_id

HAVING

COUNT(ride_id) > 1 ;

Result

There is no data to display.

It seems that importing the data directly from CSV into the database was the right path, this way we have no duplicate ride_ids

Next we need to assess the state of data in various attribute columns.

Check for NULLS:

SQL code

SELECT

SUM(CASE

WHEN ride_id IS NULL THEN 1

ELSE

0

END

) AS ride_id_null,

SUM(CASE

WHEN rideable_type IS NULL THEN 1

ELSE

0

END

) AS rideable_type_null,

SUM(CASE

WHEN started_at IS NULL THEN 1

ELSE

0

END

) AS started_at_null,

SUM(CASE

WHEN ended_at IS NULL THEN 1

ELSE

0

END

) AS ended_at_null,

SUM(CASE

WHEN start_station_name IS NULL THEN 1

ELSE

0

END

) AS start_station_name_null,

SUM(CASE

WHEN start_station_id IS NULL THEN 1

ELSE

0

END

) AS start_station_id_null,

SUM(CASE

WHEN end_station_name IS NULL THEN 1

ELSE

0

END

) AS end_station_name_null,

SUM(CASE

WHEN end_station_id IS NULL THEN 1

ELSE

0

END

) AS end_station_id_null,

SUM(CASE

WHEN start_lat IS NULL THEN 1

ELSE

0

END

) AS start_lat_null,

SUM(CASE

WHEN end_lat IS NULL THEN 1

ELSE

0

END

) AS end_lat_null,

SUM(CASE

WHEN start_lng IS NULL THEN 1

ELSE

0

END

) AS start_lng_null,

SUM(CASE

WHEN end_lng IS NULL THEN 1

ELSE

0

END

) AS end_lng_null,

SUM(CASE

WHEN member_casual IS NULL THEN 1

ELSE

0

END

) AS member_casual_null

FROM

divvy.div;

Result

ride_id_null 0

rideable_type_null 0

started_at_null 0

ended_at_null 0

start_station_name_null 832,009

start_station_id_null 832,141

end_station_name_null 889,661

end_station_id_null 889,802

start_lat_null 0

end_lat_null 5,973

start_lng_null 0

end_lng_null 5,973

member_casual_null 0

Around 0.8 million rides are void because of start/end station recording issues, which will render the data void and requiring removal. The nearly six thousand rows with no ending geocode are part of the previous set.

Now we need to check for rows where the rides we company run. These include test rides:

SQL code

SELECT

SUM(CASE

WHEN LOWER(start_station_id) LIKE '%test%' THEN 1

ELSE

0

END

) AS start_station_id_test,

SUM(CASE

WHEN LOWER(end_station_id) LIKE '%test%' THEN 1

ELSE

0

END

) AS end_station_id_test

FROM

divvy.div;

Result

start_station_id_test 1,649

end_station_id_test 209

As well as rides that were company-run leg of the valet service[5] or for repair:

SQL code

SELECT

SUM(CASE

WHEN LOWER(start_station_id) LIKE '%divvy%' THEN 1

ELSE

0

END

) AS start_station_id_divvy,

SUM(CASE

WHEN LOWER(end_station_id) LIKE '%divvy%' THEN 1

ELSE

0

END

) AS end_station_id_divvy

FROM

divvy.div;

Result

start_station_id_divvy 24

end_station_id_divvy 25

There will be some overlap with the test data since they also use the divvy label in the station id. Regardless, all these rows need to be removed too, since these are no actual customer ridden rides and thus not valid datapoints

For our data analysis ( and further data cleaning) we need to create the following calculated column: ride_length, which will be the difference between the started_at and ended_at timecodes.

SQL code

ALTER TABLE

divvy.div ADD COLUMN ride_length INT64;

UPDATE

divvy.div

SET

ride_length = DATETIME_DIFF(ended_at, started_at, SECOND)

WHERE

1=1;

Result

ride_length INTEGER

The schema now has a new field.

Now we can check for rides that end before they could start, indicating recording error:

SQL code

SELECT

COUNT(ride_length) AS time_error

FROM

divvy.div

WHERE

ride_length < 0;

Result

103

Rides that are less than 60 secs and thus docking errors and not counted:

SQL code

SELECT

COUNT(ride_length) AS time_error

FROM

divvy.div

WHERE

ride_length < 60;

Result

140,515

However, a quick check reveals that this number includes the 103 rides from above, which will be helpful later on.

And rides that are longer than 24 hours and thus considered lost/stolen:

SQL code

SELECT

COUNT(ride_length) AS time_error

FROM

divvy.div

WHERE

ride_length > (60*60*24);

Result

5,345

Station Names and ID have some issues:

SQL code

SELECT

COUNT(DISTINCT(start_station_name)) AS start_name,

COUNT(DISTINCT(start_station_id)) AS start_id,

COUNT(DISTINCT(end_station_name)) AS end_name,

COUNT(DISTINCT(end_station_id)) AS end_id

FROM

divvy.div;

Result

start_name 1,722

start_id 1,319

end_name 1,741

end_id 1,324

Clearly there isn’t a 1:1 correlation between station names and station IDs. We will be relying on station names since they have less errors. That means we will not be taking the id columns to the clean data set in our analysis phase.

We also need to check the latitudes and longitudes recorded for each station. Are they reliable? Let’s see if a station has more than one set of geo coordinates:

SQL code

SELECT

COUNT(*)

FROM (

SELECT

DISTINCT(start_station_name),

COUNT(DISTINCT start_lat) AS count_lat,

COUNT(DISTINCT start_lng) AS count_lng

FROM

divvy.div

GROUP BY

start_station_name

HAVING

COUNT(DISTINCT start_lat) > 1

OR COUNT(DISTINCT start_lng) > 1);

Result

823

Same is true for end_stations. Clearly the geocodes are all over the place (GPS devices accuracy can vary based on how many satellites it can access at the moment), and not usable for any geo-visualisation. But first let’s get some clean data to avoid fixing geocodes for rows that will end up being removed anyways.

With our boundaries defined, we now need to create a new table that will only have the clean data. We will be removing all rows that have NULLs, test or divvy rows.

We will be selecting data only within the time bounds of a ride length above 60 sec and below 24 hours, that way we can eliminate not only rides that are too short or long, but also recording errors ( negative rides lengths will not be within the time bounds we will be selecting for, and thus automatically eliminated.)

We will be adding some data analysis column, like started_day, which will allow us to do weekly analysis, a separate,clean, start date without the time codes so we can do our daily analysis more easily ( we don’t need the time code that much, we already have calculated ride_length before hand)

We will be adding a tracker for whether the ride was a one-way or round_trip, by matching start and end stations.

We will be moving important columns to the forefront but we will not be dropping any of the columns outright, we will let them stay tacked at the end in case we need them for data analysis.

SQL code

CREATE OR REPLACE TABLE

divvy.divclean AS

SELECT

TRIM(ride_id) AS ride_id,

CAST(started_at AS DATE) AS started_date,

FORMAT_DATE('%A', started_at) AS started_day,

TRIM(start_station_name) AS start_station_name,

TRIM(end_station_name) AS end_station_name,

TRIM(member_casual) AS member_casual,

ride_length,

CASE

WHEN LOWER(start_station_name) = LOWER(end_station_name) THEN 1

ELSE

0

END

AS round_trip,

TRIM(rideable_type) AS bike_type,

* EXCEPT (ride_id,

rideable_type,

start_station_name,

end_station_name,

member_casual,

ride_length)

FROM

divvy.div

WHERE

start_station_name IS NOT NULL

AND end_station_name IS NOT NULL

AND end_lat IS NOT NULL

AND end_lng IS NOT NULL

AND LOWER(start_station_id) NOT LIKE "%test%"

AND LOWER(start_station_id) NOT LIKE "%divvy%"

AND LOWER(end_station_id) NOT LIKE '%test%'

AND LOWER(end_station_id) NOT LIKE "%divvy%"

AND ride_length BETWEEN 60

AND 60*60*24;

Result

ride_id STRING

started_date DATE

started_day STRING

start_station_name STRING

end_station_name STRING

member_casual STRING

ride_length INTEGER

round_trip INTEGER

bike_type STRING

started_at TIMESTAMP

ended_at TIMESTAMP

start_station_id STRING

end_station_id STRING

start_lat FLOAT

start_lng FLOAT

end_lat FLOAT

end_lng FLOAT

With the data cleaned, let’s create a clean set of stations and appropriate geocode for them. First we will collect the list of all the start and end stations, and combine them, so that we have one big list of all stations

SQL code

CREATE OR REPLACE TABLE

divvy.divstation AS

SELECT

*

FROM (

SELECT

start_station_name AS station_name,

start_lat AS station_lat,

start_lng AS station_lng

FROM

divvy.divclean

UNION ALL

SELECT

end_station_name AS station_name,

end_lat AS station_lat,

end_lng AS station_lng

FROM

divvy.divclean);

Result

station_name STRING

station_lat FLOAT

station_lng FLOAT

SQL code

CREATE OR REPLACE TABLE

divvy.group_divstation AS

SELECT

station_name,

APPROX_QUANTILES(station_lat, 2)[

OFFSET

(1)] AS median_lat,

APPROX_QUANTILES(station_lng, 2)[

OFFSET

(1)] AS median_lng

FROM

divvy.divstation

GROUP BY

station_name;

Result

station_name STRING

median_lat FLOAT

median_lng FLOAT

We choose median over simple average to avoid the influence of very distant outliers affecting the location. We will be creating a separate dataset of cleaned stations and geocodes which we will merge with the rest of the data to create the final analysis table which we can then use for analysis.

SQL code

CREATE OR REPLACE TABLE

divvy.analysis AS

SELECT

clean.* EXCEPT (start_lat,

start_lng,

end_lat,

end_lng),

start.median_lat AS start_lat,

start.median_lng AS start_lng,

ended.median_lat AS end_lat,

ended.median_lng AS end_lng

FROM

divvy.divclean clean

LEFT JOIN

divvy.group_divstation start

ON

clean.start_station_name = start.station_name

LEFT JOIN

divvy.group_divstation ended

ON

clean.end_station_name = ended.station_name;

Result

ride_id STRING

started_date DATE

started_day STRING

start_station_name STRING

end_station_name STRING

member_casual STRING

ride_length INTEGER

round_trip INTEGER

bike_type STRING

started_at TIMESTAMP

ended_at TIMESTAMP

start_station_id STRING

end_station_id STRING

start_lat FLOAT

start_lng FLOAT

end_lat FLOAT

end_lng FLOAT

Before we start analysing the data, we need to run integrity checks to ensure no errors have slipped in our clean data. Our key problems were all the nulls and other unwanted rows, so let’s see if they have gone.

SQL code

SELECT

SUM(CASE

WHEN ride_id IS NULL THEN 1

ELSE

0

END

) AS ride_id_null,

SUM(CASE

WHEN started_date IS NULL THEN 1

ELSE

0

END

) AS started_date_null,

SUM(CASE

WHEN started_day IS NULL THEN 1

ELSE

0

END

) AS started_day_null,

SUM(CASE

WHEN start_station_name IS NULL THEN 1

ELSE

0

END

) AS start_station_name_null,

SUM(CASE

WHEN end_station_name IS NULL THEN 1

ELSE

0

END

) AS end_station_name_null,

SUM(CASE

WHEN member_casual IS NULL THEN 1

ELSE

0

END

) AS member_casual_null,

SUM(CASE

WHEN ride_length IS NULL THEN 1

ELSE

0

END

) AS ride_length_null,

SUM(CASE

WHEN round_trip IS NULL THEN 1

ELSE

0

END

) AS round_trip_null,

SUM(CASE

WHEN bike_type IS NULL THEN 1

ELSE

0

END

) AS bike_type_null,

SUM(CASE

WHEN started_at IS NULL THEN 1

ELSE

0

END

) AS started_at_null,

SUM(CASE

WHEN ended_at IS NULL THEN 1

ELSE

0

END

) AS ended_at_null,

SUM(CASE

WHEN start_station_id IS NULL THEN 1

ELSE

0

END

) AS start_station_id_null,

SUM(CASE

WHEN end_station_id IS NULL THEN 1

ELSE

0

END

) AS end_station_id_null,

SUM(CASE

WHEN start_lat IS NULL THEN 1

ELSE

0

END

) AS start_lat_null,

SUM(CASE

WHEN start_lng IS NULL THEN 1

ELSE

0

END

) AS start_lng_null,

SUM(CASE

WHEN end_lat IS NULL THEN 1

ELSE

0

END

) AS end_lat_null,

SUM(CASE

WHEN end_lng IS NULL THEN 1

ELSE

0

END

) AS end_lng_null,

SUM(CASE

WHEN LOWER(start_station_id) LIKE '%test%' THEN 1

ELSE

0

END

) AS start_station_id_id_test,

SUM(CASE

WHEN LOWER(end_station_id) LIKE '%divvy%' THEN 1

ELSE

0

END

) AS end_station_id_id_divvy,

SUM(CASE

WHEN LOWER(start_station_id) LIKE '%divvy%' THEN 1

ELSE

0

END

) AS start_station_id_id_divvy

FROM

divvy.analysis;

Result

ride_id_null 0

started_date_null 0

started_day_null 0

start_station_name_null 0

end_station_name_null 0

member_casual_null 0

ride_length_null 0

round_trip_null 0

bike_type_null 0

started_at_null 0

ended_at_null 0

start_station_id_null 0

end_station_id_null 0

start_lat_null 0

start_lng_null 0

end_lat_null 0

end_lng_null 0

start_station_id_id_test 0

end_station_id_id_divvy 0

start_station_id_id_divvy 0

Now we can move to the next stage.

[1]:https://answers.microsoft.com/en-us/msoffice/forum/all/opening-csv-with-very-long-number-as-string/518dbf33-adfb-485c-9a55-98d48aac20dc

[2]:https://answers.microsoft.com/en-us/msoffice/forum/all/excel-csv-files-converting-long-numbers-to/32fe77a7-2201-4da6-8614-cfe90aba4ff6

[3]: https://console.cloud.google.com/bigquery

[4]: https://console.cloud.google.com/storage

[5]:https://ride.divvybikes.com/valet