Data preprocessing is one of the crucial steps in Data Mining before we do data visualization, model development, model evaluation and data product. This steps is critical since we will be removing inconsistent data and noise in here. This steps will produce quality data mining result and increase accuracy of the machine learning model. There are several techniques and preprocessing we can use for this step so that the results produce by the machine learning model will as accurate as possible.
We did not do basic cleaning using Excel as there is no inconsistent data and missing data. We also did not make a new attribute by deriving or merging any attributes. Thus, we straight away did preprocessing to our dataset using RapidMiner. There are a few preprocessing that was done before proceeding to machine learning.
The dataset that was given to us was very unbalance; thus, we need to balance it so that the accuracy of the predictive model will increase. We use the SMOTE operator in RapidMiner to upsample the distress level to make it balance. The role was set to Leveldistress as this is the attribute that we were trying to upsample and make it balance.
This is the Distresslevel after upsampling using SMOTE operator. The low distress level remained 607 rows while the distress level from 95 rows became 607 rows after resampling. The dataset has 1214 rows after resampling while the original dataset has 702 rows of data. So, there are 512 new rows of data were added to the new dataset.
We standardized all attributes as we want to test standardization dataset for predictive data mining. For standardization we use the Z-transformation in which the data will be rescaled to have a mean of zero and a variance of one. This is especially would be helpful since the attributes have different measurements. For instance, age and salary.
For discretization, we discretize the age into different bins by using range. The minimum was set to 22 and maximum set to 55 while the number of bins was set to seven. Below is how the age was set to different bins.
20-25 = 1
26-30 = 2
31-35 = 3
36-40 = 4
41-45 = 5
46-50 = 6
51-55 = 7