DESCRIPTIVE DATA MINING

K-Means clustering is one of the simplest and widely used unsupervised machine learning algorithms. K-means algorithm identifies the k number of centroids and then it will allocate every data point to the nearest cluster while keeping the centroid as small as possible.

Hierarchical clustering treats each data as a singleton cluster and successively merges clusters until all points have been merged into a single remaining cluster. In single link clustering, if dn the distance of the two clusters merged in step n, and G(n) is the graph that links all data points with a distance of at most dn. Then, the clusters after step n are the connected components of G(n).

Now we will look at how descriptive data mining for health is carried out. I will be explaining in detail the steps and methods used for both RapidMiner and Pyhton. For descriptive me and my group decided to do clustering by using K-Means and Agglomerative Clustering (Single Link).

Before proceeding to steps on descriptive data mining, I will show the features we used for health target.

The features selected are Age, Socio economic, Salary, self-rate health, self care and f2healthstat.

Optimal K-Value

To find the optimum k value we used three different methods which are Elbow method, Silhouette method and Davies Bouldin method.

The Elbow method selects the optimal number of clusters by fitting the model with a range of values for k. The line will usually resemble an arm and the elbow refers to the point of inflection on the curve. The elbow will be the optimal k-value.

The Silhouette Coefficient is unknown and used to calculate the density of clusters computed by the model. The score is calculated by averaging the silhouette coefficient for each sample, computed as the difference between the average intra-cluster distance and the mean nearest-cluster distance for each sample and normalized by the maximum value.

Davies Bouldin Index is defined as a ratio between the cluster scatter and the cluster's seperation. In which basically means a ratio of within-cluster distance and between cluster distance. The objective is to find the optimal value in which the clusters are less dispersed internally and are farther apart from each other.

Based on all three methods, we can conclude the optimum k value is 6. According to Elbow method, Davies Bouldin and the Silhouette method, they show 6 as the best k value, we decided to use optimum k value = 6.

Python

K-Means

Step 1: Import Library

from sklearn.cluster import KMeans

from sklearn.cluster import AgglomerativeClustering

from sklearn.metrics import silhouette_score

from sklearn.metrics import davies_bouldin_score

from numpy import unique

from numpy import where

import matplotlib.pyplot as plt

import seaborn as sns

Step 2: Select features

x = data.iloc[:, [0, 1, 3, 4, 6, 8, 10]].values

Step 3: Create K-Means with 6 clusters

#For this clustering, we choose 6 clusters

#K-Means

kmeans_6 = KMeans(n_clusters=6)

y_kmeans_6 = kmeans_6.fit_predict(x)

centroids = kmeans_6.cluster_centers_

# insert into predicted table

frame = pd.DataFrame(x)

frame['cluster'] = y_kmeans_6

Step 4: Calculate how many values is in each cluster

# to calculate how many are in each cluster

frame['cluster'].value_counts()

Below is the result:

3 371

1 339

5 210

2 150

4 115

0 29

Name: cluster, dtype: int64

Step 5: Visualize the clustering

#visualizing clustering

#in this one, each column has the code, for example ,

#we wanted to see realtionship between AGE index 0 and SELF_RATE_HEALTH index 4

plt.figure(figsize=(7, 5))

plt.scatter(x[:,0], x[:,4],c=y_kmeans_6, cmap='rainbow')

plt.xlabel('AGE')

plt.ylabel('SELF_RATE_HEALTH')

plt.show()

As we can see from the visualization for self-rate health and age, there are six clusters visible and the biggest cluster is purple. We can see that the clusters are linear and the reason why is because self-rate health has fixed values which are 1,2,3 and 4.

Agglomerative Clustering

Step 1: Import library

# import hierarchical clustering libraries

from sklearn.cluster import AgglomerativeClustering

Step 2: Create a dendrogram

from scipy.cluster.hierarchy import linkage, dendrogram

#create a dendogram

Z = linkage(x, 'single')

fig = plt.figure(figsize=(20, 10))

plt.title("Dendrograms (Single-linkage)")

dn = dendrogram(Z)

plt.show()

For dendrogram, we used Z = linkage(x, 'single') because we decided to use single link so that is why the linkage used is 'single'. Based on the dendrogram, it is hard and unclear to find the right k so we used the suggested k value from the elbow, silhouette and Davies Bouldin method which is 6.

Step 3: Create cluster

# create clusters

agglo1 = AgglomerativeClustering(n_clusters=6, affinity = 'euclidean', linkage = 'single')

# save clusters for chart

y_agglo1 = agglo1.fit_predict(x)

The number of cluster is 6. For the affinity or metruc used to computer the linkage, we used euclidean while linkage criterion is set to single.

Step 4: Calculate how many values in each cluster

# insert into predicted table

frame = pd.DataFrame(x)

frame['cluster'] = y_agglo1

# to calculate how many are in each cluster

frame['cluster'].value_counts()

The result:

4 1185

2 14

5 7

1 4

3 2

0 2

Name: cluster, dtype: int64

Step 5: Visualize the cluster

#visualizing clustering

#in this one, each column has the code, for example ,

#we wanted to see realtionship between AGE index 0 and SELF_RATE_HEALTH index 4

plt.figure(figsize=(7, 5))

plt.scatter(x[:,0], x[:,4],c=y_agglo1, cmap='rainbow')

plt.xlabel('AGE')

plt.ylabel('SELF_RATE_HEALTH')

plt.show()

Based on this visualization, there are only four clusters visible and the biggest cluster is orange. Since we used agglomeration clustering (single link) for this one, the clusters are different from K-Means in which the clusters are quite balanced compared to this one.

RapidMiner

K-Means

Step 1: Retrieve data

The first operator is used to retrieve data that will be used for the clustering.

Step 2: Select features

By using this operator, we can choose the attributes that are only required for health level. If we look at the column on the right (Selected Attributes), these are all of the attributes we chose for health.

Step 3: Include the Nominal to Numerical operator

The reason we used this operator is because the data for f2healthstat is nominal so we need to use this operator to convert the nominal data to numerical. After that only we can continue with the clustering since K-Means requires numerical data.

Step 4: Create clusters

To create clusters, we used the Clustering operator (K-Means). The number of clusters was set to 6 since the optimum k-value suggested is 6 after calculating using the three different methods as stated previously which are Elbow method, Silhouette score and Davies Bouldin score. As for measure type, we are using Euclidean.

Step 5: Calculate the performance and visualize the clusters

We used the performance to calculate the clusters performance by using Davies Bouldin Score. The criterion was set to average within centroid since we are doing K-Means and K-Means are using centroid. Besides, we also maximize the result. This is because if we do not maximize the result, the result will be in negative. Since we are going to observe Davies Bouldin Score, we decided to make the result positive. Meanwhile, to visualize the clusters, we used the cluster model visualizer operator.

Agglomerative Clustering

Step 1: Retrieve data

First we retrieved dataset that will be used for this agglomerative clustering by using the Retrieve operator as shown below.

Step 2: Select features

Before continuing with the clustering, we selected only required attributes. The attributes under the "Selected Attributes" columns are attributes that will be used in this clustering.

Step 3: Create the clusters

To create the clusters, we are using the Clustering (Agglomerative Clustering) operator. The mode was set to SingleLink since we decided to do single link agglomerative. The measures types was set to MixedMeasures because f2healthstat is nominal and not munerical. Lastly, for measurement we used MixedEuclideanDistance.

Page updated

Report abuse

DESCRIPTIVE DATA MINING

Optimal K-Value

Python

K-Means

Step 1: Import Library

Step 2: Select features

Step 3: Create K-Means with 6 clusters

Step 4: Calculate how many values is in each cluster

Step 5: Visualize the clustering

Agglomerative Clustering

Step 1: Import library

Step 2: Create a dendrogram

Step 3: Create cluster

Step 4: Calculate how many values in each cluster

Step 5: Visualize the cluster

RapidMiner

K-Means

Step 1: Retrieve data

Step 2: Select features

Step 3: Include the Nominal to Numerical operator

Step 4: Create clusters

Step 5: Calculate the performance and visualize the clusters

Agglomerative Clustering

Step 1: Retrieve data

Step 2: Select features

Step 3: Create the clusters

© 2021 by 202333 Project Portfolio