Our data is located at the following address [1] and is, as stated earlier, based on the data generated by a real-life company, so affected by the data process defined by them, as can be found here [2] and here [3].
The dataset is in the form of monthly zipped csv files, and they contain the (uncleaned) data of every ride that occurred in that month by any bike in the system at any time, and each ride has a unique ride_id, along with information on start/stop times and locations of the ride.
The dataset is being licensed under the agreement found at address [4] and none of the data analysis that will ensue will be in breach of the licence.
We can check the soundness of the data by applying the ROCCC test:
Reliable: The data is automatically system generated, and is the full dataset of each individual ride ( and not just a small sample), and while data cleaning is required, there is no reason to suspect any bias in the data generation.
Original: the data is from a first party source that’s directly generating the data
Comprehensive: It is as comprehensive as it can be under the licensing terms; it contains enough data for analysis, whilst not content further identifying data that could have helped at an individual user level.
Current: The dataset is uploaded monthly and the latest batch was uploaded less than a week ago, so it is the latest data available and being used.
Cited: the source of the data and the licensing are cited below.
Data integrity will be maintained by ensuring that the data is complete (so contains data for every day of the calendar month, and no attribute is missing data) and correct (by performing data cleaning operations to ensure each row is a unique instance).
Since the data comes pre-labeled with “member or casual” for each ride, this will provide a direct path in answering the question asked in the previous stage, since we will not be required to join this dataset with a separate data set of members, and perform the data analysis in a more straightforward manner.
However we will have to create some calculated columns based on existing data to gather more information (like the length of the ride, what day it happened etc)
The dataset is not without errors and addendums, and require the following checks for cleaning:
1- ride_id is supposed to be unique for each ride; any duplicates will have to be dealt with.
2- ride lengths have to make sense. Rides cannot end before they start, be too short to be considered a ride, or be so long that they are most likely lost/stolen.
3- Rides have to be valid. Ride takes for test purposes by company staff have to be removed from the data set.[2]
4- Any row containing incomplete (null) data has to be eliminated, as a data collection error.
The company considers rides less than 60 Sec as docking errors[2], as and rides longer as 24 hrs as lost/stolen[5], so that sets our lower and upper bounds for the ride lengths.