SSK4604-Data Mining - Agglomerative Clustering In Python

Agglomerative Clustering (Python)

Agglomerative Clustering In Python

For our project, we use k=6.

Introduction To Dataset

In this step, we are using StudentEvent dataset. The value for this dataset has been standardized in Rapidminer. It has 35 rows with 11 columns.

Import Libraries

import numpy as np

from matplotlib import pyplot as plt

from scipy.cluster.hierarchy import dendrogram

from sklearn.cluster import AgglomerativeClustering

#Alternate to perform scaling using in-built function

from scipy.cluster.hierarchy import dendrogram, linkage

Create New Dataset

#agglomerative clustering

dag = new_df[['Assignment','Forum','Activity','LectureNote',

'Tutorial','Questionnaire','Quiz','MarksBin']].copy()

dag.info()

Initialize X and Y value

This is how we initialized x and y value.

X_dag = dag.iloc[:, [0,1,2,3,4,5,6]].values

y_dag = dag.iloc[:, [7]].values

Visualized The Agglomerative Clustering

This is how Agglomerative Clustering is plotting using dendrogram (ward). Python produces the graph, but the line is drawn so it is easy to see how many clusters it should have. From the graph, we can see that the cluster is 6 based on the vertical line pass-through from the drawn line.

Z=linkage(X_dag,method="ward")

#Plot a Dendogram

fig, ax = plt.subplots(figsize=(10, 10))

ax=dendrogram(Z,orientation="top",labels=np.array(y_dag),leaf_rotation=30,leaf_font_size=10)

plt.tight_layout()

plt.show()

Visualized Agglomerative Using Scatter Plot

#5 Visualizing the clusters. This code is similar to k-means #visualization code. We only replace the y_kmeans vector name to #y_hc for the hierarchical clustering

plt.scatter(X_dag[y_hc==0, 0], X_dag[y_hc==0, 1], s=100, c='red', label ='Cluster 1')

plt.scatter(X_dag[y_hc==1, 0], X_dag[y_hc==1, 1], s=100, c='blue', label ='Cluster 2')

plt.scatter(X_dag[y_hc==2, 0], X_dag[y_hc==2, 1], s=100, c='green', label ='Cluster 3')

plt.scatter(X_dag[y_hc==3, 0], X_dag[y_hc==3, 1], s=100, c='cyan', label ='Cluster 4')

plt.scatter(X_dag[y_hc==4, 0], X_dag[y_hc==4, 1], s=100, c='magenta', label ='Cluster 5')

plt.scatter(X_dag[y_hc==5, 0], X_dag[y_hc==5, 1], s=100, c='black', label ='Cluster 6')

plt.title('Clusters of Online Learning Participation')

plt.legend()

plt.show()

Summary

Based on the reference below, it's up to us where we would like to create the threshold. Therefore, I decided to create 6 clusters based on the line drawn. Therefore the cluster for Agglomerative Clustering in Python is 6.

Reference: https://stackabuse.com/hierarchical-clustering-with-python-and-scikit-learn/

Next Topic: Predictive Analytics

Page updated

Report abuse