Data Preprocessing

Limitation of data set

· Missing data

Missing data, or missing values, occur when no data value is stored for the variable in an observation. Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data.

· Inconsistent data

The data in dataset is not compulsory accurate. Survey respondents were provided with the opportunity to elaborate on why they thought their data might be wrong. They might misunderstand and given wrong data for dataset. This can cause wrong prediction by using the dataset.

· Noisy Data

Noisy data is a meaningless data. The term has often been used as a synonym for corrupt data. Any data that has been received, stored, or changed in such a manner that it cannot be read or used by the program that originally created, it can be described as noisy.

· Incomplete data

Data can be incomplete from many different reasons. From simply missing data, censored data or truncated data. Incomplete data from missing data is caused by data sets simply missing values. Incomplete data is considered censored when the number of values in a set are known, but the values themselves are unknown. Incomplete data is said to be truncated when there are values in a set that are excluded.

· Useless data attribute

Some of the attribute in the dataset is unneeded in the analysis of data. These attributes can be removed from the dataset. This is because the attributes have no effect for the conclusion and the result afterwards. For example, all the airline in this selected dataset is MH.

Treatment

· Data cleaning

Data cleaning is very important in data mining. Data cleaning is the process of altering data in a given storage resource to make sure that it is accurate and correct. Data cleaning including fill in missing values, identify outliers and smooth out noisy data, correct inconsistent data, resolve redundancy caused by data integration. Thus, the result of data mining will become more accurate.

· Data reduction

A dataset may store many attributes and millions of data. Complex data mining may take a very long time to run on the complete data set. Thus, data reduction is used to obtain a reduced representation of dataset that is much smaller in volume but still produce the same or almost the same analytic results. There are many data reduction strategies, such as data cube aggregation, dimensionality reduction, data compression, numerosity reduction, discretization and concept hierarchy generation. The strategy using for this dataset is dimensionality reduction.

Objective 1

RapidMiner is a data science platform for data scientists to unites data preprocessing, machine learning, and predictive model deployment. It is very useful as it provides many data mining techniques and help in analysis data.

The dataset is import into RapidMiner as shown in the figure above. The dataset is found with some problems, which are missing data, inconsistent data, noisy data, incomplete data and useless data attribute.

To eliminate these problems, one of the functions of RapidMiner, which is Turbo Prep, is used. Turbo Prep is designed to make data preparation easier. It provides a user interface where your data is always visible front and center, where you can make changes step-by-step and instantly see the results, with a wide range of supporting functions to prepare your data for model-building or presentation. Turbo Prep's supporting functions are divided into five broad categories:

· Transform - these functions help to create useful subsets of data (Filter, Range, Sample, Remove) or to modify the data in individual columns (Replace).

· Cleanse - these functions help with missing values, duplicates, normalization and binning.

· Generate - these functions help you generate new data columns from existing data columns.

· Pivot - these functions simplify the task of creating summary tables (pivot tables) from your data.

· Merge - these functions help you to combine two or more data sets (Join).

Two Turbo Prep’s function are used in dataset to clean out the data. The functions are transformed and cleanse.

The example of missing data in the dataset is shown as above. The attribute in OP_SUFFIX is written as “?”.

The remove function has been used to remove out all the useless missing data from the dataset and left all the data that are meaningful to our research. We removed all the column that are meaningless to our research. The remove function in RapidMiner is as above.

Some of the attributes in TOT_PAX_CT is missing and written as “?”.

The missing value in TOT_PAX_CT is replacing with zero value. This is because it is consider as a cargo plane and it won’t carry any passengers.

Besides, we also generate a new column for distance. We calculate each of the distance from the departure airport to arrival airport.

We also generate another column for arrival delay. For those the arrival delay time is greater than zero, it is consider as delay – “D”.

We use the generate function in RapidMiner to create the new column stated above.

Furthermore, we converted the date format and separate it into different column. We throw the data that are useless. For example, the month and year remain unchanged throughout the dataset, so we delete the column and reserved only the date, period and week column.

We also remove all the column that are totally without any value inside. For example, the DIV_SUFFIX and others as shown in the figure above.

These are the result after the transformation. After data cleansing by the treatment, the data set consistent with other similar data sets in the system. This is the result after using Turbo Prop’s function to clean the data.

The statistics of the dataset after data cleansing is shown as above. All the attributes in missing column are “0”. That means that the dataset is cleaned. The dataset is now ready use for data mining task.

Objective 2

Same as above task, we use the Turbo Prep to process the dataset.

As can be seen from the diagram, the dataset contain many attribute that do not have any value.

From the dataset, there are also a lot of attributes that contain many missing value. For example, as can be seen from the above diagram, the LASTEST_DIV_ARP_CD contain 6029 missing value, which mean 99% of the data is missing.

So, we use RapidMiner to remove all the useless attributes.

Besides, we also converted the date format and separate it into different column. Like what we did in the first task, we throw the data that are useless. For example, the month and year remain unchanged throughout the dataset, so we delete the column and reserved only the date, period and week column.

Based on the above diagram, there are 588 data in TOT_PAX_CT is missing and written as “?”.

Then, the missing value in TOT_PAX_CT is replacing with zero value. This is because it is consider as a cargo plane and it won’t carry any passengers.

From the diagram, we can see that there are 5 column of delay code and 5 column of delay time.

We transform all the column into 9 categories. Each of the column represents respective type of delay and the average time of the delay.

These are the result after the transformation for task 2. After data cleansing by the treatment, the data set consistent with other similar data sets in the system. This is the result after using Turbo Prop’s function to clean the data.

Google Sites

Report abuse