Data Cleaning

"No Data is clean, but most is useful."

NYC Taxi Data

The data obtained is in its raw form and needs a lot of cleaning in order to proceed with applying any machine learning model. As a result, this section focuses on doing univariate analysis and removing outlier/illegitimate values which may be caused due to some errors.

The data cleaning code can be found here.

Removing Missing Values

Missing values contributes to around 3% of the data and mostly the values are missing from important columns like passenger_count. As a result, these data are dropped.

2. Trip Duration

BoxPlot of Trip Time shows that there are outliers.

In accordance with the NYC Taxi & Limousine Commission state that a trip may not last longer than 12 hours in any 24-hour period. In order to comply with this rule, trip duration is calculated and any outliers are removed from the dataset.

The timestamps are converted to Unix so as to get duration(trip-time) & speed also pickup times in Unix are used while binning in out data we have time in the format “YYYY-MM-DD HH:MM:SS” .

3. Speed

Image: Boxplot which shows outliers.

From the above observations, it is evident that there are outliers in the data. Both the box plot and percentile calculation verifies the same.

Image: Boxplot after removing the outliers

Upon calculation, the avg speed in New York City is 12.45 miles/Hr, so a cab driver can travel 2 miles per 10min on avg.

4. Payment Types

From the above observation, one can conclude that, there are no outliers.

As per NYC Taxi & Limousine Commission, payment type is categorized into 6 types. A numeric code signifying how the passenger paid for the trip.

1= Credit card
2= Cash
3= No charge
4= Dispute
5= Unknown
6= Voided trip

5. Total Pay Amount

Above boxplot indicates outliers in the Total Amounts.

Upon closer inspection of the distribution, one can confirm the presence of outliers.

This is around 0.4% of the dataset. As a result, these outliers are dropped.

6. Pickup and Drop Locations

Upon closer inspection, all the values lie within the bounded range of 1 to 265 zones.

7. Pickup and Drop Time

Check for outliers in pickup and drop off times.

It is evident from the data that there are few outliers in terms of Pickup and Drop Times.

As a part of data cleaning, these incorrect data is cleaned and only data required for the timeline is considered valid and the rest of the data from other timeline are removed.

Final Dataset after Data Cleaning

The final NYC Taxi cleaned dataset can be found here.

Weather Data

Cleaning Missing Data & Selecting Required Columns

Above Fig: Shows the frequency count of missing values in the dataset.

Below Fig: Shows the frequency of missing values after data cleaning.

Missing Values

The frequency plot on the right depicts the missing value in the weather datasets.
Some of these columns have more than 90% of data missing.
Moreover, these columns provide doesn't provide valuable information when considering the topic of taxi prediction.
Keeping these in mind, these columns are dropped.

Final Dataset after Data Cleaning

The final Weather cleaned dataset can be found here.

Page updated

Report abuse

Data Cleaning

NYC Taxi Data

Removing Missing Values

2. Trip Duration

3. Speed

4. Payment Types

5. Total Pay Amount

6. Pickup and Drop Locations

7. Pickup and Drop Time

Final Dataset after Data Cleaning

Weather Data

Cleaning Missing Data & Selecting Required Columns

Missing Values

Final Dataset after Data Cleaning