In statistics, standardization (sometimes called data normalization or feature scaling) refers to the process of rescaling the values of the variables in the dataset so they share a common scale. Often performed as a pre-processing step, particularly for cluster analysis, standardization is important if working with data where each variable has a different unit (e.g., inches, meters, tons, and kilograms), or where the scales of each of your variables are very different from one another (e.g., 0-1 vs 0-1000). The reason this importance is particularly high in cluster analysis is that groups are defined based on the distance between points in mathematical space.
K-Means clustering is sensitive to the distance. Therefore we decide to standardized the data using Rapidminer in Turbo Prep Section.
In this step, we are using the clean dataset which is StudentEvent dataset. This dataset contains 35 rows and 11 columns.
First, we need to import the StudentEvent.xlsx file into our local repository.
Select the excel file and choose all columns.
Ignoring this option and click Next.
Save the file in local repository as StudentEvent
Select all the columns
Choose Normalization
Choose Standardization
Click Apply
We export the dataset to excel because Python will use this dataset to analyzed.
Choose Excel file type.
Select location and save file as StudentEvent.xlsx
Select Repository option to save the processed file in the local repository.
Save the file as StudentEvent in local repository.
After standardization, we can see that all values are changes adn different from before cleansing. We