Determining the optimal number of clusters in a data set is a fundamental issue in partitioning clustering, such as K-Means clustering, which requires the user to specify the number of clusters k to be generated. In our project, we applying an Elbow Method and Silhouette Method to find the optimum k-value for K-Means clustering.
In this step, we are using StudentEvent dataset. The value for this dataset has been standardized in Rapidminer.
path1 = "/content/drive/My Drive/Colab Notebooks/StudentEvent.xlsx"
dataf1 = pd.read_excel(path1)
dataf1.head(3)
Using info() function in python, we can see the dataset information such as data type, no. of rows, and total no. of columns in the dataset.
dataf1.info()
Using .head() function, we can observe the value in the dataset.
dataf1.head(3)
Since StudentID is an object datatype, we need to change the data to a numeric value. For that, we just removed the first character for StudentID and change the data to numeric using to_numeric() function.
df1['StudentID'] = df1['StudentID'].str[1:]
df1['StudentID'] = pd.to_numeric(df1['StudentID'])
df1.info()
Select data to be analyzed in this activity.
data_std = new_df[['Assignment','Forum','Activity','LectureNote',
'Tutorial','Questionnaire','Quiz','MarksBin']].copy()
scaled_data = data_std
scaled_data
This is how we initialized x and y value.
x = scaled_data.iloc[:, [0, 1, 2, 3,4,5,6]].values
y_true = scaled_data.iloc[:, [7]].values
x_label = scaled_data.iloc[:, [0, 1, 2, 3,4,5,6]].columns
y_label = scaled_data.iloc[:, [7]].columns
The KElbowVisualizer implements the “elbow” method select the optimal number of clusters by fitting the model with a range of values for k. If the line chart resembles an arm, then the “elbow” (the point of inflection on the curve) is a good indication that the underlying model fits best at that point. In the visualizer, “elbow” will be annotated with a dashed line.
For our project, KElbowVisualizer fits the KMeans model for a range of K values from 2 to 11 on our dataset which has 7 features with 8 random clusters of points. When the model is fit with 8 clusters, we can see a line annotating the “elbow” in the graph, which in this case we know to be the optimal number.
This is how the x value is plotting to visualized the elbow. Using yellowbrick library, the line will be an automatic draw to find the optimum k-value. From the visualization, it suggested that the k-value is 6
# Elbow Method for K means
# Import ElbowVisualizer
from yellowbrick.cluster import KElbowVisualizer
modelkm = KMeans()
# k is range of number of clusters.
visualizer = KElbowVisualizer(modelkm, k=(2,11), timings= True)
visualizer.fit(x) # Fit data to visualizer
visualizer.show() # Finalize and render figure
The Silhouette Coefficient is used when the ground-truth about the dataset is unknown and computes the density of clusters computed by the model. The score is computed by averaging the silhouette coefficient for each sample, computed as the difference between the average intra-cluster distance and the mean nearest-cluster distance for each sample, normalized by the maximum value. This produces a score between 1 and -1, where 1 is highly dense clusters and -1 is completely incorrect clustering.
The Silhouette Visualizer displays the silhouette coefficient for each sample on a per-cluster basis, visualizing which clusters are dense and which are not. This is particularly useful for determining cluster imbalance, or for selecting a value for k by comparing multiple visualizers.
This is how we visualized the silhouette using the same x value. From the visualization, the suggested k-value is same with the elbow method which is 6.
#Silhouette Score for K means
# Import ElbowVisualizer
from yellowbrick.cluster import KElbowVisualizer
model = KMeans()
# k is range of number of clusters.
visualizer = KElbowVisualizer(model, k=(2,11),metric='silhouette', timings= True)
visualizer.fit(x) # Fit the data to the visualizer
visualizer.show() # Finalize and render the figure
Hence, for our project, our k-value for K-Means Clustering is 6.