DESCRIPTIVE DATA MINING

K-Means clustering is one of the simplest and widely used unsupervised machine learning algorithms. K-means algorithm identifies the k number of centroids and then it will allocate every data point to the nearest cluster while keeping the centroid as small as possible.

Hierarchical clustering treats each data as a singleton cluster and successively merges clusters until all points have been merged into a single remaining cluster. In single link clustering, if dn the distance of the two clusters merged in step n, and G(n) is the graph that links all data points with a distance of at most dn. Then, the clusters after step n are the connected components of G(n).

Now we will look at how descriptive data mining for financial level is carried out. I will be explaining in detail the steps and methods used for both RapidMiner and Pyhton. For descriptive me and my group decided to do clustering by using K-Means and Agglomerative Clustering (Single Link).

Before proceeding to steps on descriptive data mining, I will show the features we used for financial target.

The features selected are Age, Martial status, Level of education, Socio economic, Salary, self-rate health, financial well being and Leveldistress.

Optimal K-value

To find the optimum k value we used three different methods which are Elbow method, Silhouette method and Davies Bouldin method.

Step 1: Select features using this code

x1 = data.iloc[:, [0, 1, 2, 3, 4, 6, 7, 11]].values

By using Elbow Method

The Elbow method selects the optimal number of clusters by fitting the model with a range of values for k. The line will usually resemble an arm and the elbow refers to the point of inflection on the curve. The elbow will be the optimal k-value.

#Using ElBOW METHOD to identify best k values

from sklearn.cluster import KMeans

Error =[]

for i in range(2, 11):

kmeans = KMeans(n_clusters = i).fit(x1)

kmeans.fit(x1)

Error.append(kmeans.inertia_)

plt.plot(range(2, 11), Error)

plt.title('Elbow method')

plt.xlabel('No of clusters')

plt.ylabel('Error')

plt.show()

By using Silhouette Score

The Silhouette Coefficient is unknown and used to calculate the density of clusters computed by the model. The score is calculated by averaging the silhouette coefficient for each sample, computed as the difference between the average intra-cluster distance and the mean nearest-cluster distance for each sample and normalized by the maximum value.

#using Silhouette Score

Silhouette = []

for i in range(2, 11):

kmeans = KMeans(n_clusters = i)

samplem = kmeans.fit_predict(x1)

Silhouette.append(silhouette_score(x1, samplem))

import matplotlib.pyplot as plt

plt.plot(range(2, 11), Silhouette)

plt.title('silhouette_score')

plt.xlabel('No of clusters')

plt.ylabel('Score')

plt.show()

By using Davies Bouldin Score

Davies Bouldin Index is defined as a ratio between the cluster scatter and the cluster's seperation. In which basically means a ratio of within-cluster distance and between cluster distance. The objective is to find the optimal value in which the clusters are less dispersed internally and are farther apart from each other.

#using Silhouette Score

Silhouette = []

for i in range(2, 11):

kmeans = KMeans(n_clusters = i)

samplem = kmeans.fit_predict(x1)

Silhouette.append(silhouette_score(x1, samplem))

import matplotlib.pyplot as plt

plt.plot(range(2, 11), Silhouette)

plt.title('silhouette_score')

plt.xlabel('No of clusters')

plt.ylabel('Score')

plt.show()

Based on the three different methods, we can conclude that the optimum k value is 6. All Elbow method, Silhouette score and Davies Bouldin score show the best k value is 6.

Python

K-Means

Step 1: Import Library

from sklearn.cluster import KMeans

from sklearn.cluster import AgglomerativeClustering

from sklearn.metrics import silhouette_score

from sklearn.metrics import davies_bouldin_score

from numpy import unique

from numpy import where

import matplotlib.pyplot as plt

import seaborn as sns

Step 2: Create six clusters using K-Means

#For this clustering, we choose 6 clusters

#K-Means

kmeans6 = KMeans(n_clusters=6)

y_kmeans6 = kmeans6.fit_predict(x1)

centroids = kmeans6.cluster_centers_

# insert into predicted table

frame = pd.DataFrame(x1)

frame['cluster'] = y_kmeans6

Step 3: Calculate number of values in each cluster

# to calculate how many are in each cluster

frame['cluster'].value_counts()

Below is the result:

3 371

1 339

5 210

4 150

0 115

2 29

Name: cluster, dtype: int64

Step 4: Visualize the cluster

#visualizing clustering

#in this one, each column has the code, for example ,

#we wanted to see realtionship between AGE index 0 and SALARY index 4

plt.figure(figsize=(7, 5))

plt.scatter(x1[:,0], x1[:,4],c=y_kmeans6, cmap='rainbow')

plt.xlabel('AGE')

plt.ylabel('SALARY')

plt.show()

Based on this visualization, we can see there are six clusters altogether. The biggest cluster is the purple one which age ranges from 22-55 and salary ranges from 0-3000. This cluster has respondents from all age group. The smallest cluster is the red cluster which the age ranges from 41-55 while the salary has the range from 11000 to 16000. We can say the reason this cluster is the smallest has the highest salary because these respondents has been working for a long time and they all may be professionals.

Agglomerative Clustering

Step 1: Import Library

# import hierarchical clustering libraries

from sklearn.cluster import AgglomerativeClustering

Step 2: Create a dendrogram

from scipy.cluster.hierarchy import linkage, dendrogram

#create a dendogram

Z = linkage(x1, 'single')

fig = plt.figure(figsize=(20, 10))

plt.title("Dendrograms (Single-linkage)")

dn = dendrogram(Z)

plt.show()

From the dendrogram, it is unclear to find the k-value since one tree is bigger than the other. Thus, we decided to use k=6 as suggested by Elbow method, Silhouette score and Davies Bouldin score to visualize the clusters later on.

Step 3: Create six cluster for agglomerative clustering

# create clusters

agglo2 = AgglomerativeClustering(n_clusters=6, affinity = 'euclidean', linkage = 'single')

# save clusters for chart

y_agglo2 = agglo2.fit_predict(x1)

The number of cluster is 6. For the affinity or metruc used to computer the linkage, we used euclidean while linkage criterion is set to single.

Step 4: Calculate how many values in each cluster

# insert into predicted table

frame = pd.DataFrame(x1)

frame['cluster'] = y_agglo2

# to calculate how many are in each cluster

frame['cluster'].value_counts()

4 1185

2 14

5 7

1 4

3 2

0 2

Name: cluster, dtype: int64

Step 5: Visualize the clusters

#visualizing clustering

#in this one, each column has the code, for example ,

#we wanted to see realtionship between AGE index 0 and SALARY index 4

plt.figure(figsize=(7, 5))

plt.scatter(x1[:,0], x1[:,4],c=y_agglo2, cmap='rainbow')

plt.xlabel('AGE')

plt.ylabel('SALARY')

plt.show()

Based on this visualization, there are six clusters can be seen. The biggest cluster is the cluster with orange colour in which age ranges from 22 to 55 and the salary ranges from zero to 10000. The smallest clusters are the purple coloured and light green coloured. The light green cluster has the highest salary which is 16000 and the age is around 48 to 55.

RapidMiner

Now we will look at clustering using RapidMiner.

K-Means

Step 1: Retrieve data

The first operator is used to retrieve data that will be used for the clustering.

Step 2: Select features

By using this operator, we can choose the attributes that are only required for financial level. If we look at the column on the right (Selected Attributes), these are all of the attributes we chose for financial.

Step 3: Include the Nominal to Numerical operator

The reason we used this operator is because the data for Leveldistress is nominal so we need to use this operator to convert the nominal data to numerical. After that only we can continue with the clustering since K-Means requires numerical data.

Step 4: Create clusters

To create clusters, we used the Clustering operator (K-Means). The number of clusters was set to 6 since the optimum k-value is 6 after calculating using the three different methods as stated previously which are Elbow method, Silhouette score and Davies Bouldin score. As for measure type, we are using Euclidean.

Step 5: Calculate the performance and visualize the clusters

We used the performance to calculate the clusters performance by using Davies Bouldin Score. The criterion was set to average within centroid since we are doing K-Means and we maximize the result. This is because if we do not maximize the result, the result will be in negative. Since we are going to observe Davies Bouldin Score, we decided to make the result positive. Meanwhile, to visualize the clusters, we used the cluster model visualizer operator.

Agglomerative Clustering

Step 1: Retrieve data

First we retrieved dataset that will be used for this agglomerative clustering by using the Retrieve operator as shown below.

Step 2: Select features

Before continuing with the clustering, we selected only required attributes. The attributes under the "Selected Attributes" columns are attributes that will be used in this clustering.

Step 3: Create the clusters

To create the clusters, we are using the Clustering (Agglomerative Clustering) operator. The mode was set to SingleLink since we decided to do single link agglomerative. The measures types was set to MixedMeasures because Leveldistress is nominal and not munerical. Lastly, for measurement we used MixedEuclideanDistance.

Page updated

Report abuse

DESCRIPTIVE DATA MINING

Optimal K-value

Step 1: Select features using this code

By using Elbow Method

By using Silhouette Score

By using Davies Bouldin Score

Python

K-Means

Step 1: Import Library

Step 2: Create six clusters using K-Means

Step 3: Calculate number of values in each cluster

Step 4: Visualize the cluster

Agglomerative Clustering

Step 1: Import Library

Step 2: Create a dendrogram

Step 3: Create six cluster for agglomerative clustering

Step 4: Calculate how many values in each cluster

Step 5: Visualize the clusters

RapidMiner

K-Means

Step 1: Retrieve data

Step 2: Select features

Step 3: Include the Nominal to Numerical operator

Step 4: Create clusters

Step 5: Calculate the performance and visualize the clusters

Agglomerative Clustering

Step 1: Retrieve data

Step 2: Select features

Step 3: Create the clusters

© 2021 by 202333 Project Portfolio