For this project, we are given two different datasets which are the logs dataset and marks dataset. These two datasets have one column that can be connected with each other which is a Student ID.
This dataset contains information about SSK4604 online course logs from 10/07/2020 until 11/08/2020. There are about 11518 rows and 8 columns.
This dataset contains information about about students marks for the same online course. There are 49 rows with 2 columns.
Make a pivot table based on the logs dataset. Student ID as a row, and event context as a column and count value.
Replace missing value with average value for each column.
Merge a column with a similar event name and create as a new column.
Create a new column as Marks and search student marks using function VLOOKUP. By doing this, we are merging the logs dataset and the marks dataset. VLOOKUP(A2,AN:AO,2,FALSE)
Create a new column as Grade and create student grade based on student marks using IF ELSE function.
IF(J2<40,"F", IF(J2<44,"D", IF(J2<47,"D+", IF(J2<50,"C-", IF(J2<55,"C", IF(J2<60, "C+", IF(J2<65,"B-", IF(J2<70,"B", IF(J2<75,"B+", IF(J2<80, "A-", "A"))))))))))
Create a new column as MarksBin based on student marks using the IF-ELSE function.
IF(J2<40,11, IF(J2<44,10, IF(J2<47,9, IF(J2<50,8, IF(J2<55,7, IF(J2<60, 6, IF(J2<65,5, IF(J2<70,4, IF(J2<75,3, IF(J2<80,2, 1))))))))))
Remove column File because it is similar to accessing the page.
After all the preprocessing steps, only 35 rows left in the dataset! Supposed our dataset have the same rows as in Marks dataset which is 49 rows. 11 rows of student data missing from the original dataset! So, we do an investigation by repeat the same step to figure out where and how the data is missing. Then we found out that, after we make a pivot table the data in the logs dataset only for 35 students. Its means that there is a student that has marks but doesn't have log information for the SSK4604 online course. Hence, our clean dataset only consists of 35 rows with 10 columns.
Since we are predicting student grades, therefore we deciding to bin the marks based on the UPM grading system. For the MarksBin column, we create it so we do not need to create it in Python.
Data preprocessing is crucial in any data mining process as they directly impact success rate of the project.