For our project, we use k=6.
In this step, we are using StudentEvent dataset. The value for this dataset has been standardized in Rapidminer. It has 35 rows with 11 columns.
import numpy as np
from matplotlib import pyplot as plt
from scipy.cluster.hierarchy import dendrogram
from sklearn.cluster import AgglomerativeClustering
#Alternate to perform scaling using in-built function
from scipy.cluster.hierarchy import dendrogram, linkage
#agglomerative clustering
dag = new_df[['Assignment','Forum','Activity','LectureNote',
'Tutorial','Questionnaire','Quiz','MarksBin']].copy()
dag.info()
This is how we initialized x and y value.
X_dag = dag.iloc[:, [0,1,2,3,4,5,6]].values
y_dag = dag.iloc[:, [7]].values
This is how Agglomerative Clustering is plotting using dendrogram (ward). Python produces the graph, but the line is drawn so it is easy to see how many clusters it should have. From the graph, we can see that the cluster is 6 based on the vertical line pass-through from the drawn line.
Z=linkage(X_dag,method="ward")
#Plot a Dendogram
fig, ax = plt.subplots(figsize=(10, 10))
ax=dendrogram(Z,orientation="top",labels=np.array(y_dag),leaf_rotation=30,leaf_font_size=10)
plt.tight_layout()
plt.show()
#5 Visualizing the clusters. This code is similar to k-means #visualization code. We only replace the y_kmeans vector name to #y_hc for the hierarchical clustering
plt.scatter(X_dag[y_hc==0, 0], X_dag[y_hc==0, 1], s=100, c='red', label ='Cluster 1')
plt.scatter(X_dag[y_hc==1, 0], X_dag[y_hc==1, 1], s=100, c='blue', label ='Cluster 2')
plt.scatter(X_dag[y_hc==2, 0], X_dag[y_hc==2, 1], s=100, c='green', label ='Cluster 3')
plt.scatter(X_dag[y_hc==3, 0], X_dag[y_hc==3, 1], s=100, c='cyan', label ='Cluster 4')
plt.scatter(X_dag[y_hc==4, 0], X_dag[y_hc==4, 1], s=100, c='magenta', label ='Cluster 5')
plt.scatter(X_dag[y_hc==5, 0], X_dag[y_hc==5, 1], s=100, c='black', label ='Cluster 6')
plt.title('Clusters of Online Learning Participation')
plt.legend()
plt.show()
Based on the reference below, it's up to us where we would like to create the threshold. Therefore, I decided to create 6 clusters based on the line drawn. Therefore the cluster for Agglomerative Clustering in Python is 6.
Reference: https://stackabuse.com/hierarchical-clustering-with-python-and-scikit-learn/