Data preprocessing is a crucial step in data mining to prepare a dataset that could increase the performance of the developed model. Several steps may be taken such as data extraction, data selection, data smoothing and data transformation. In this section you are expected to provide the data preprocessing steps suited to your chosen dataset. You should explain WHY you perform such step, and the pre-and-post step MUST to be provided. The example is provided below.
Example:
Since the dataset contains data from 5 years (2010-2015) and comprises of data from various states and locality, a data extraction technique must be performed. This can be done by using spreadsheet software such as Microsoft Excel, visual analytics tool such as Tableau, or by crafting codes in R or Python. The dataset is processed to prepare the following:
Separate data annually.
Extract data from selected state: In this study, records on Selangor is focused, so that comparison against the benchmark work as explained in the Introduction section can be performed. Besides, Selangor is one of the most critical state which is dengue prevalent.
Clean the data. As can be seen from the data exploration stage, some of the weeks have null values. There are 2 options to clean this data: either by smoothing by mean so that the missing values can be replaced with the mean value from the recent week's records or to maintain the cells but assigning it with value 0. For the purpose of experiment and to compare the effect of this data preprocessing techniques, both setting will be performed. The following video shows the example to extract the data using Pivot Table and to replace the missing values with 0.
The next step of the data preprocessing is to extract data from specific local council. This can be done by summing the values of the cases in that council weekly. Then, data integration is conducted where the weekly temperature and rainfall data is added as features/attributes so that the model between these and the number of cases could be developed using any prediction or classification model.