Data Preprocessing

Once the data is collected, then comes the hectic job of preprocessing it. Preprocessing involves various process including data cleaning and data transformation. Preprocessing is the time most data professionals spend majority of the time.

Data Source

Cleaned Dataset

View Code

Data Visualisation

STATIC DATA

SOURCE

The files downloaded from the Bureau of Transportation Statistics was initially stored in the csv files and stored together.

Snapshot of Raw Data of Oct 2022.csv

The next step involves merging files from both years for subsequent analysis and preprocessing. Upon concatenation, the left image displays the total number of rows, amounting to 12,442,853 entries. This dataset includes undesired records, null values, missing values, outliers, etc. The actual cleaning process is now underway.

Snapshot of Raw data of concatenated dataframe:

Before cleaning , it is better to remove unwanted columns as they take extra space and if removed we only have less data to clean. the columns 'CANCELLED','CANCELLATION_CODE','DIVERTED' are removed as the main focus is on delays. Following that the data is filtered on the basis of whether the columns has weather_delay as its our next point of focus. After dropping these columns and filtering the number of rows significantly decreased.

After filtering null values were checked for the columns and only two columns had null. The late_aircraft_delay column was removed as 89% of the column was null and again on the round two of removing unwanted columns ('CRS_DEP_TIME','DEP_TIME','CRS_ARR_TIME','ARR_TIME','CRS_ELAPSED_TIME','ACTUAL_ELAPSED_TIME','AIR_TIME','DISTANCE','TAXI_OUT','TAXI_IN','WHEELS_OFF','WHEELS_ON') were also removed as our focus is not on these variables.

Before correcting data types

Correcting the data types:

Once there is no null values the next set of focus is to correct the datatypes present in the data frame. The date was changed into pandas datetime object and three columns month, day and week columns were derived and added to the dataframe.

After correcting data types and adding columns:

WEB SCRAPING

SOURCE

SNAPSHOT OF WEBSCRAPED DATA

The data scraped from the website was clean and doesn't require any cleaning. A function was written to join both this data and the flight delay data and both the data frames were joined.

After joining the state, latitude and longitude we can see some nulls back. Upon further analyzing there are two states present in the flight data frame which are not present in the scraped data.

Upon further analysis found that VI is Virginia Islands and TT is trusted territories.

Then the other problem is upon exploring TT (Trusted Territories) has a range of latitude and longitudes, so the proportion of TT was checked and it was very less, so the records with state TT was removed.

The coordinates of VI was found and the nulls were filled with the corresponding value.

Then we can find that there is no nulls present.

Removal of redunant columns:

Now we have the origin state and destination state, the columns origin_state_abbr and destination_state_abbr can be removed to avoid redunancy.

Now, the goal is to bring in API data.

The issues faced:

The API limit per day is 10000.
There are many records repeated so we don't to fetch it again and again. (Eg On 20th there might be multiple flights departing from Denver, but we need weather of Denver only once).
There is no time of flight mentioned in the dataset, so average temperature for the location should be calculated.
The API data returns data of every 2 hrs. So the average must be computed each time.

SOLUTION:

A function to calculate average temperature was written and the data was distributed across multiple system and temperature was calculated. Finally the data is merged at last.

API

SOURCE

After combining the data sets, a check is conducted for null values, and given their absence, the subsequent phase of data transformation can be initiated.

data transformation:

Following data transformation has been done in this data.

The delays(weather_delay, nas_delay, security_delay, arrival_delay, departure_delay, carrier_delay) are converted into hours which was initially in mins.
The temperatures are binned based on US temperature ranges.
The column names are renamed for suitability.

FINAL SNAPSHOT of Dataframe after cleaning

Page updated

Google Sites

Report abuse