SSK4604-Data Mining

Find Optimum k-value

Determining the optimal number of clusters in a data set is a fundamental issue in partitioning clustering, such as K-Means clustering, which requires the user to specify the number of clusters k to be generated. In our project, we applying an Elbow Method and Silhouette Method to find the optimum k-value for K-Means clustering.

Introduction To Dataset

In this step, we are using StudentEvent dataset. The value for this dataset has been standardized in Rapidminer.

Import Dataset

path1 = "/content/drive/My Drive/Colab Notebooks/StudentEvent.xlsx"

dataf1 = pd.read_excel(path1)

dataf1.head(3)

Display Data Information

Using info() function in python, we can see the dataset information such as data type, no. of rows, and total no. of columns in the dataset.

dataf1.info()

Display The First Top Three Rows

Using .head() function, we can observe the value in the dataset.

dataf1.head(3)

Convert StudentID To Numeric Value

Since StudentID is an object datatype, we need to change the data to a numeric value. For that, we just removed the first character for StudentID and change the data to numeric using to_numeric() function.

df1['StudentID'] = df1['StudentID'].str[1:]

df1['StudentID'] = pd.to_numeric(df1['StudentID'])

df1.info()

Select Data From Dataset

Select data to be analyzed in this activity.

data_std = new_df[['Assignment','Forum','Activity','LectureNote',

'Tutorial','Questionnaire','Quiz','MarksBin']].copy()

scaled_data = data_std

scaled_data

Initialize X and Y value

This is how we initialized x and y value.

x = scaled_data.iloc[:, [0, 1, 2, 3,4,5,6]].values

y_true = scaled_data.iloc[:, [7]].values

x_label = scaled_data.iloc[:, [0, 1, 2, 3,4,5,6]].columns

y_label = scaled_data.iloc[:, [7]].columns

Elbow Method

The KElbowVisualizer implements the “elbow” method select the optimal number of clusters by fitting the model with a range of values for k. If the line chart resembles an arm, then the “elbow” (the point of inflection on the curve) is a good indication that the underlying model fits best at that point. In the visualizer, “elbow” will be annotated with a dashed line.

For our project, KElbowVisualizer fits the KMeans model for a range of K values from 2 to 11 on our dataset which has 7 features with 8 random clusters of points. When the model is fit with 8 clusters, we can see a line annotating the “elbow” in the graph, which in this case we know to be the optimal number.

Visualized The Elbow

This is how the x value is plotting to visualized the elbow. Using yellowbrick library, the line will be an automatic draw to find the optimum k-value. From the visualization, it suggested that the k-value is 6

# Elbow Method for K means

# Import ElbowVisualizer

from yellowbrick.cluster import KElbowVisualizer

modelkm = KMeans()

# k is range of number of clusters.

visualizer = KElbowVisualizer(modelkm, k=(2,11), timings= True)

visualizer.fit(x) # Fit data to visualizer

visualizer.show() # Finalize and render figure

Silhouette Method

The Silhouette Coefficient is used when the ground-truth about the dataset is unknown and computes the density of clusters computed by the model. The score is computed by averaging the silhouette coefficient for each sample, computed as the difference between the average intra-cluster distance and the mean nearest-cluster distance for each sample, normalized by the maximum value. This produces a score between 1 and -1, where 1 is highly dense clusters and -1 is completely incorrect clustering.

The Silhouette Visualizer displays the silhouette coefficient for each sample on a per-cluster basis, visualizing which clusters are dense and which are not. This is particularly useful for determining cluster imbalance, or for selecting a value for k by comparing multiple visualizers.

Visualized Silhouette

This is how we visualized the silhouette using the same x value. From the visualization, the suggested k-value is same with the elbow method which is 6.

#Silhouette Score for K means

# Import ElbowVisualizer

from yellowbrick.cluster import KElbowVisualizer

model = KMeans()

# k is range of number of clusters.

visualizer = KElbowVisualizer(model, k=(2,11),metric='silhouette', timings= True)

visualizer.fit(x) # Fit the data to the visualizer

visualizer.show() # Finalize and render the figure

Summary

Hence, for our project, our k-value for K-Means Clustering is 6.

Next Topic: K-Means Clustering In Rapidminer

Page updated

Report abuse

Find Optimum k-value

Find Optimum k-value

Introduction To Dataset

Import Dataset

Display Data Information

Display The First Top Three Rows

Convert StudentID To Numeric Value

Select Data From Dataset

Initialize X and Y value

Elbow Method

Visualized The Elbow

Silhouette Method

Visualized Silhouette

Summary

Next Topic: K-Means Clustering In Rapidminer

Copyright by 199607-Build using sites.google.com