Introduction To Dataset

Dataset Information

For this project, we are given two different datasets which are the logs dataset and marks dataset. These two datasets have one column that can be connected with each other which is a Student ID.

Logs Dataset

This dataset contains information about SSK4604 online course logs from 10/07/2020 until 11/08/2020. There are about 11518 rows and 8 columns.

Marks Dataset

This dataset contains information about about students marks for the same online course. There are 49 rows with 2 columns.

Data Pre-processing (Excel)

Step 1: Create Pivot Table

Make a pivot table based on the logs dataset. Student ID as a row, and event context as a column and count value.

Step 2: Replace Missing Value

Replace missing value with average value for each column.

Step 3: Merge Column

Merge a column with a similar event name and create as a new column.

Step 4: Create New Column

Create a new column as Marks and search student marks using function VLOOKUP. By doing this, we are merging the logs dataset and the marks dataset. VLOOKUP(A2,AN:AO,2,FALSE)

Step 5: Create New Column (Grade)

Create a new column as Grade and create student grade based on student marks using IF ELSE function.

IF(J2<40,"F", IF(J2<44,"D", IF(J2<47,"D+", IF(J2<50,"C-", IF(J2<55,"C", IF(J2<60, "C+", IF(J2<65,"B-", IF(J2<70,"B", IF(J2<75,"B+", IF(J2<80, "A-", "A"))))))))))

Step 6: Create New Column (MarksBin)

Create a new column as MarksBin based on student marks using the IF-ELSE function.
IF(J2<40,11, IF(J2<44,10, IF(J2<47,9, IF(J2<50,8, IF(J2<55,7, IF(J2<60, 6, IF(J2<65,5, IF(J2<70,4, IF(J2<75,3, IF(J2<80,2, 1))))))))))

Step 7: Remove Unnecessary Column

Remove column File because it is similar to accessing the page.

Clean Dataset

After all the preprocessing steps, only 35 rows left in the dataset! Supposed our dataset have the same rows as in Marks dataset which is 49 rows. 11 rows of student data missing from the original dataset! So, we do an investigation by repeat the same step to figure out where and how the data is missing. Then we found out that, after we make a pivot table the data in the logs dataset only for 35 students. Its means that there is a student that has marks but doesn't have log information for the SSK4604 online course. Hence, our clean dataset only consists of 35 rows with 10 columns.

Since we are predicting student grades, therefore we deciding to bin the marks based on the UPM grading system. For the MarksBin column, we create it so we do not need to create it in Python.

Summary

Data preprocessing is crucial in any data mining process as they directly impact success rate of the project.

Next Topic: Data Standardization

Page updated

Report abuse