For our project, we use k=6.
In this step, we are using StudentEvent dataset. The value for this dataset has been standardized in Rapidminer.
path1 = "/content/drive/My Drive/Colab Notebooks/StudentEvent.xlsx"
dataf1 = pd.read_excel(path1)
dataf1.head(3)
Select data to be analyzed in this activity.
data_std = new_df[['Assignment','Forum','Activity','LectureNote',
'Tutorial','Questionnaire','Quiz','MarksBin']].copy()
scaled_data = data_std
scaled_data
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
data_km = new_df[['Assignment','Forum','Activity','LectureNote',
'Tutorial','Questionnaire','Quiz','MarksBin']].copy()
data_km.info()
This is how we initialized x and y value.
xkm = data_km.iloc[:, [0, 1, 2, 3,4,5,6]].values
ykm = data_km.iloc[:, [7]].values
xkm_label = data_km.iloc[:, [0, 1, 2, 3,4,5,6]].columns
ykm_label = data_km.iloc[:, [7]].columns
This is how the x value is plotting to visualized the K-Means Clustering.
#Visualising the clusters
plt.scatter(xkm[y_kmeans == 0, 0], xkm[y_kmeans == 0, 1], s = 100, c = 'yellow', label = 'Cluster 0')
plt.scatter(xkm[y_kmeans == 1, 0], xkm[y_kmeans == 1, 1], s = 100, c = 'green', label = 'Cluster 1')
plt.scatter(xkm[y_kmeans == 2, 0], xkm[y_kmeans == 2, 1], s = 100, c = 'cyan', label = 'Cluster 2')
plt.scatter(xkm[y_kmeans == 3, 0], xkm[y_kmeans == 3, 1], s = 100, c = 'grey', label = 'Cluster 3')
plt.scatter(xkm[y_kmeans == 4, 0], xkm[y_kmeans == 4, 1], s = 100, c = 'black', label = 'Cluster 4')
plt.scatter(xkm[y_kmeans == 5, 0], xkm[y_kmeans == 5, 1], s = 100, c = 'blue', label = 'Cluster 5')
plt.legend()
from sklearn.manifold import TSNE
# Project the data: this step will take several seconds
tsne = TSNE(n_components=2, init='random', random_state=0)
digits_proj = tsne.fit_transform(xkm)
# Compute the clusters
kmeans = KMeans(n_clusters=6, random_state=0)
clusters = kmeans.fit_predict(ykm)
# Permute the labels
labels = np.zeros_like(clusters)
for i in range(10):
mask = (clusters == i)
labels[mask] = mode(newy[mask])[0]
# Compute the accuracy
accuracy_score(ykm, labels)
0.9428571428571428
This is how we visualized the silhouette using the same x value. From the visualization, the suggested k-value is same with the elbow method which is 6.