Data pre-processing is one of the most important step that helps enhance the quality of the data. Pre-processing is a data mining technique which is used to transform raw data in a useful and efficient format. Some of the steps that involved in data pre-processing is data selection, data integration, data transformation, normalization, and data reduction.
Remove low quality attributes, "Time", "Component", "Event name", "Description", "Origin" and "IP address". We removed this attributes because it have too many possible values. Plus we don't need that data for our analysis.
Pivot the attribute Event Context and StudentID.
For columns that contain missing values, we use the average where it is calculated using the sum of a list of numbers divided by the number of items on the list. Then, we filled those values with the average. The mean/ average is useful in determining the overall trend of a data set or providing a rapid snapshot of our data.
Group the "Event context" data into seven groups which are "Assignment", "Forum", "Activity", "File", "LectureNote", "Tutorial", "Questionnaire" and "Quiz".
Merge marks with data after the above steps using vlookup.
Create a grade based on marks. Next, we have created marks bin to indicate size of the marks.
Lastly, we do the standardization for the seven attributes above using RapidMiner. We also do the discretization to the data.
Data after the pre-processing
Note that the data above is before the standardization and normalization. We do the normalization using turbo prep in rapidminer. The pre-processing done above is just general pre-processing like data cleaning (delete low quality column, filling missing values, merge data).
To perform prediction on the grade using machine learning models, we have identified specific features/attributes that will be used. These include "StudentID", "Marks", "Marksbin","Grade", "Assignment", "Forum", "Activity", "File", "LectureNote", "Tutorial", "Questionnaire" and "Quiz".
DATA CLEANING
to remove irrelevant data and ensure that data is consistent, balance and usable. Data is cleaned by identifying errors, correcting or deleting them, or manually processing data as needed to prevent the same errors from occurring
DATA SELECTION AND EXTRACTION
removing redundant data to increase training and reference speed.
Dealing with small dataset
Retrieve studentEventDataset
select all attributes
sample bootstrapping is used to create a sample for the dataset (set sample size to 100)
Multiply creates a copy of data to excel file
Write Excel is used to rename resample file to studentEvent_resample.xls
After preprocessing step, our dataset consists of 42 samples and a total of 12 attributes. Unfortunately, our dataset is too small to proceed to modeling. To handle this problem, we decided to use Sample (Bootstrapping) in RapidMiner to resample our dataset in order to add lines into our data. We used this method because it uses sampling with replacement.
In sampling with replacement, at every step, all examples have equal probability of being selected. More importantly, a sample with replacement can be used to generate a sample that is greater in size than the original dataset. The number of examples in the sample can be specified on absolute or relative basis depending on the setting of the sample parameter.
For this project, we decided to create 100 sample size for our datasets. Note that we have 3 in total of datasets which have 3 variants of pre-processing. One dataset is the one that we only do general cleaning on it, another one is cleaning with discretization and the last one is cleaning with normalization. We resample all 3 datasets before we finally proceed to do modeling part.
Example of our data after the resampling process