K-Means clustering is one of the simplest and widely used unsupervised machine learning algorithms. K-means algorithm identifies the k number of centroids and then it will allocate every data point to the nearest cluster while keeping the centroid as small as possible.
Hierarchical clustering treats each data as a singleton cluster and successively merges clusters until all points have been merged into a single remaining cluster. In single link clustering, if dn the distance of the two clusters merged in step n, and G(n) is the graph that links all data points with a distance of at most dn. Then, the clusters after step n are the connected components of G(n).
Now we will look at how descriptive data mining for health is carried out. I will be explaining in detail the steps and methods used for both RapidMiner and Pyhton. For descriptive me and my group decided to do clustering by using K-Means and Agglomerative Clustering (Single Link).
Before proceeding to steps on descriptive data mining, I will show the features we used for health target.
The features selected are Age, Socio economic, Salary, self-rate health, self care and f2healthstat.
To find the optimum k value we used three different methods which are Elbow method, Silhouette method and Davies Bouldin method.
The Elbow method selects the optimal number of clusters by fitting the model with a range of values for k. The line will usually resemble an arm and the elbow refers to the point of inflection on the curve. The elbow will be the optimal k-value.
The Silhouette Coefficient is unknown and used to calculate the density of clusters computed by the model. The score is calculated by averaging the silhouette coefficient for each sample, computed as the difference between the average intra-cluster distance and the mean nearest-cluster distance for each sample and normalized by the maximum value.
Davies Bouldin Index is defined as a ratio between the cluster scatter and the cluster's seperation. In which basically means a ratio of within-cluster distance and between cluster distance. The objective is to find the optimal value in which the clusters are less dispersed internally and are farther apart from each other.
Based on all three methods, we can conclude the optimum k value is 6. According to Elbow method, Davies Bouldin and the Silhouette method, they show 6 as the best k value, we decided to use optimum k value = 6.
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score
from sklearn.metrics import davies_bouldin_score
from numpy import unique
from numpy import where
import matplotlib.pyplot as plt
import seaborn as sns
x = data.iloc[:, [0, 1, 3, 4, 6, 8, 10]].values
#For this clustering, we choose 6 clusters
#K-Means
kmeans_6 = KMeans(n_clusters=6)
y_kmeans_6 = kmeans_6.fit_predict(x)
centroids = kmeans_6.cluster_centers_
# insert into predicted table
frame = pd.DataFrame(x)
frame['cluster'] = y_kmeans_6
# to calculate how many are in each cluster
frame['cluster'].value_counts()
Below is the result:
3 371
1 339
5 210
2 150
4 115
0 29
Name: cluster, dtype: int64
#visualizing clustering
#in this one, each column has the code, for example ,
#we wanted to see realtionship between AGE index 0 and SELF_RATE_HEALTH index 4
plt.figure(figsize=(7, 5))
plt.scatter(x[:,0], x[:,4],c=y_kmeans_6, cmap='rainbow')
plt.xlabel('AGE')
plt.ylabel('SELF_RATE_HEALTH')
plt.show()
As we can see from the visualization for self-rate health and age, there are six clusters visible and the biggest cluster is purple. We can see that the clusters are linear and the reason why is because self-rate health has fixed values which are 1,2,3 and 4.
# import hierarchical clustering libraries
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import linkage, dendrogram
#create a dendogram
Z = linkage(x, 'single')
fig = plt.figure(figsize=(20, 10))
plt.title("Dendrograms (Single-linkage)")
dn = dendrogram(Z)
plt.show()
For dendrogram, we used Z = linkage(x, 'single') because we decided to use single link so that is why the linkage used is 'single'. Based on the dendrogram, it is hard and unclear to find the right k so we used the suggested k value from the elbow, silhouette and Davies Bouldin method which is 6.
# create clusters
agglo1 = AgglomerativeClustering(n_clusters=6, affinity = 'euclidean', linkage = 'single')
# save clusters for chart
y_agglo1 = agglo1.fit_predict(x)
The number of cluster is 6. For the affinity or metruc used to computer the linkage, we used euclidean while linkage criterion is set to single.
# insert into predicted table
frame = pd.DataFrame(x)
frame['cluster'] = y_agglo1
# to calculate how many are in each cluster
frame['cluster'].value_counts()
The result:
4 1185
2 14
5 7
1 4
3 2
0 2
Name: cluster, dtype: int64
#visualizing clustering
#in this one, each column has the code, for example ,
#we wanted to see realtionship between AGE index 0 and SELF_RATE_HEALTH index 4
plt.figure(figsize=(7, 5))
plt.scatter(x[:,0], x[:,4],c=y_agglo1, cmap='rainbow')
plt.xlabel('AGE')
plt.ylabel('SELF_RATE_HEALTH')
plt.show()
Based on this visualization, there are only four clusters visible and the biggest cluster is orange. Since we used agglomeration clustering (single link) for this one, the clusters are different from K-Means in which the clusters are quite balanced compared to this one.
The first operator is used to retrieve data that will be used for the clustering.
By using this operator, we can choose the attributes that are only required for health level. If we look at the column on the right (Selected Attributes), these are all of the attributes we chose for health.
The reason we used this operator is because the data for f2healthstat is nominal so we need to use this operator to convert the nominal data to numerical. After that only we can continue with the clustering since K-Means requires numerical data.
To create clusters, we used the Clustering operator (K-Means). The number of clusters was set to 6 since the optimum k-value suggested is 6 after calculating using the three different methods as stated previously which are Elbow method, Silhouette score and Davies Bouldin score. As for measure type, we are using Euclidean.
We used the performance to calculate the clusters performance by using Davies Bouldin Score. The criterion was set to average within centroid since we are doing K-Means and K-Means are using centroid. Besides, we also maximize the result. This is because if we do not maximize the result, the result will be in negative. Since we are going to observe Davies Bouldin Score, we decided to make the result positive. Meanwhile, to visualize the clusters, we used the cluster model visualizer operator.
First we retrieved dataset that will be used for this agglomerative clustering by using the Retrieve operator as shown below.
Before continuing with the clustering, we selected only required attributes. The attributes under the "Selected Attributes" columns are attributes that will be used in this clustering.
To create the clusters, we are using the Clustering (Agglomerative Clustering) operator. The mode was set to SingleLink since we decided to do single link agglomerative. The measures types was set to MixedMeasures because f2healthstat is nominal and not munerical. Lastly, for measurement we used MixedEuclideanDistance.