Data Cleaning
"No Data is clean, but most is useful."
"No Data is clean, but most is useful."
The data obtained is in its raw form and needs a lot of cleaning in order to proceed with applying any machine learning model. As a result, this section focuses on doing univariate analysis and removing outlier/illegitimate values which may be caused due to some errors.
The data cleaning code can be found here.
Missing values contributes to around 3% of the data and mostly the values are missing from important columns like passenger_count. As a result, these data are dropped.
BoxPlot of Trip Time shows that there are outliers.
In accordance with the NYC Taxi & Limousine Commission state that a trip may not last longer than 12 hours in any 24-hour period. In order to comply with this rule, trip duration is calculated and any outliers are removed from the dataset.
The timestamps are converted to Unix so as to get duration(trip-time) & speed also pickup times in Unix are used while binning in out data we have time in the format “YYYY-MM-DD HH:MM:SS” .
Image: Boxplot which shows outliers.
From the above observations, it is evident that there are outliers in the data. Both the box plot and percentile calculation verifies the same.
Image: Boxplot after removing the outliers
Upon calculation, the avg speed in New York City is 12.45 miles/Hr, so a cab driver can travel 2 miles per 10min on avg.
From the above observation, one can conclude that, there are no outliers.
As per NYC Taxi & Limousine Commission, payment type is categorized into 6 types. A numeric code signifying how the passenger paid for the trip.
1= Credit card
2= Cash
3= No charge
4= Dispute
5= Unknown
6= Voided trip
Above boxplot indicates outliers in the Total Amounts.
Upon closer inspection of the distribution, one can confirm the presence of outliers.
This is around 0.4% of the dataset. As a result, these outliers are dropped.
Upon closer inspection, all the values lie within the bounded range of 1 to 265 zones.
Check for outliers in pickup and drop off times.
It is evident from the data that there are few outliers in terms of Pickup and Drop Times.
As a part of data cleaning, these incorrect data is cleaned and only data required for the timeline is considered valid and the rest of the data from other timeline are removed.
The final NYC Taxi cleaned dataset can be found here.
Above Fig: Shows the frequency count of missing values in the dataset.
Below Fig: Shows the frequency of missing values after data cleaning.
The frequency plot on the right depicts the missing value in the weather datasets.
Some of these columns have more than 90% of data missing.
Moreover, these columns provide doesn't provide valuable information when considering the topic of taxi prediction.
Keeping these in mind, these columns are dropped.
The final Weather cleaned dataset can be found here.