K-Means clustering is one of the simplest and widely used unsupervised machine learning algorithms. K-means algorithm identifies the k number of centroids and then it will allocate every data point to the nearest cluster while keeping the centroid as small as possible.
Hierarchical clustering treats each data as a singleton cluster and successively merges clusters until all points have been merged into a single remaining cluster. In single link clustering, if dn the distance of the two clusters merged in step n, and G(n) is the graph that links all data points with a distance of at most dn. Then, the clusters after step n are the connected components of G(n).
Now we will look at how descriptive data mining for financial level is carried out. I will be explaining in detail the steps and methods used for both RapidMiner and Pyhton. For descriptive me and my group decided to do clustering by using K-Means and Agglomerative Clustering (Single Link).
Before proceeding to steps on descriptive data mining, I will show the features we used for financial target.
The features selected are Age, Martial status, Level of education, Socio economic, Salary, self-rate health, financial well being and Leveldistress.
To find the optimum k value we used three different methods which are Elbow method, Silhouette method and Davies Bouldin method.
x1 = data.iloc[:, [0, 1, 2, 3, 4, 6, 7, 11]].values
The Elbow method selects the optimal number of clusters by fitting the model with a range of values for k. The line will usually resemble an arm and the elbow refers to the point of inflection on the curve. The elbow will be the optimal k-value.
#Using ElBOW METHOD to identify best k values
from sklearn.cluster import KMeans
Error =[]
for i in range(2, 11):
kmeans = KMeans(n_clusters = i).fit(x1)
kmeans.fit(x1)
Error.append(kmeans.inertia_)
plt.plot(range(2, 11), Error)
plt.title('Elbow method')
plt.xlabel('No of clusters')
plt.ylabel('Error')
plt.show()
The Silhouette Coefficient is unknown and used to calculate the density of clusters computed by the model. The score is calculated by averaging the silhouette coefficient for each sample, computed as the difference between the average intra-cluster distance and the mean nearest-cluster distance for each sample and normalized by the maximum value.
#using Silhouette Score
Silhouette = []
for i in range(2, 11):
kmeans = KMeans(n_clusters = i)
samplem = kmeans.fit_predict(x1)
Silhouette.append(silhouette_score(x1, samplem))
import matplotlib.pyplot as plt
plt.plot(range(2, 11), Silhouette)
plt.title('silhouette_score')
plt.xlabel('No of clusters')
plt.ylabel('Score')
plt.show()
Davies Bouldin Index is defined as a ratio between the cluster scatter and the cluster's seperation. In which basically means a ratio of within-cluster distance and between cluster distance. The objective is to find the optimal value in which the clusters are less dispersed internally and are farther apart from each other.
#using Silhouette Score
Silhouette = []
for i in range(2, 11):
kmeans = KMeans(n_clusters = i)
samplem = kmeans.fit_predict(x1)
Silhouette.append(silhouette_score(x1, samplem))
import matplotlib.pyplot as plt
plt.plot(range(2, 11), Silhouette)
plt.title('silhouette_score')
plt.xlabel('No of clusters')
plt.ylabel('Score')
plt.show()
Based on the three different methods, we can conclude that the optimum k value is 6. All Elbow method, Silhouette score and Davies Bouldin score show the best k value is 6.
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score
from sklearn.metrics import davies_bouldin_score
from numpy import unique
from numpy import where
import matplotlib.pyplot as plt
import seaborn as sns
#For this clustering, we choose 6 clusters
#K-Means
kmeans6 = KMeans(n_clusters=6)
y_kmeans6 = kmeans6.fit_predict(x1)
centroids = kmeans6.cluster_centers_
# insert into predicted table
frame = pd.DataFrame(x1)
frame['cluster'] = y_kmeans6
# to calculate how many are in each cluster
frame['cluster'].value_counts()
Below is the result:
3 371
1 339
5 210
4 150
0 115
2 29
Name: cluster, dtype: int64
#visualizing clustering
#in this one, each column has the code, for example ,
#we wanted to see realtionship between AGE index 0 and SALARY index 4
plt.figure(figsize=(7, 5))
plt.scatter(x1[:,0], x1[:,4],c=y_kmeans6, cmap='rainbow')
plt.xlabel('AGE')
plt.ylabel('SALARY')
plt.show()
Based on this visualization, we can see there are six clusters altogether. The biggest cluster is the purple one which age ranges from 22-55 and salary ranges from 0-3000. This cluster has respondents from all age group. The smallest cluster is the red cluster which the age ranges from 41-55 while the salary has the range from 11000 to 16000. We can say the reason this cluster is the smallest has the highest salary because these respondents has been working for a long time and they all may be professionals.
# import hierarchical clustering libraries
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import linkage, dendrogram
#create a dendogram
Z = linkage(x1, 'single')
fig = plt.figure(figsize=(20, 10))
plt.title("Dendrograms (Single-linkage)")
dn = dendrogram(Z)
plt.show()
From the dendrogram, it is unclear to find the k-value since one tree is bigger than the other. Thus, we decided to use k=6 as suggested by Elbow method, Silhouette score and Davies Bouldin score to visualize the clusters later on.
# create clusters
agglo2 = AgglomerativeClustering(n_clusters=6, affinity = 'euclidean', linkage = 'single')
# save clusters for chart
y_agglo2 = agglo2.fit_predict(x1)
The number of cluster is 6. For the affinity or metruc used to computer the linkage, we used euclidean while linkage criterion is set to single.
# insert into predicted table
frame = pd.DataFrame(x1)
frame['cluster'] = y_agglo2
# to calculate how many are in each cluster
frame['cluster'].value_counts()
4 1185
2 14
5 7
1 4
3 2
0 2
Name: cluster, dtype: int64
#visualizing clustering
#in this one, each column has the code, for example ,
#we wanted to see realtionship between AGE index 0 and SALARY index 4
plt.figure(figsize=(7, 5))
plt.scatter(x1[:,0], x1[:,4],c=y_agglo2, cmap='rainbow')
plt.xlabel('AGE')
plt.ylabel('SALARY')
plt.show()
Based on this visualization, there are six clusters can be seen. The biggest cluster is the cluster with orange colour in which age ranges from 22 to 55 and the salary ranges from zero to 10000. The smallest clusters are the purple coloured and light green coloured. The light green cluster has the highest salary which is 16000 and the age is around 48 to 55.
Now we will look at clustering using RapidMiner.
The first operator is used to retrieve data that will be used for the clustering.
By using this operator, we can choose the attributes that are only required for financial level. If we look at the column on the right (Selected Attributes), these are all of the attributes we chose for financial.
The reason we used this operator is because the data for Leveldistress is nominal so we need to use this operator to convert the nominal data to numerical. After that only we can continue with the clustering since K-Means requires numerical data.
To create clusters, we used the Clustering operator (K-Means). The number of clusters was set to 6 since the optimum k-value is 6 after calculating using the three different methods as stated previously which are Elbow method, Silhouette score and Davies Bouldin score. As for measure type, we are using Euclidean.
We used the performance to calculate the clusters performance by using Davies Bouldin Score. The criterion was set to average within centroid since we are doing K-Means and we maximize the result. This is because if we do not maximize the result, the result will be in negative. Since we are going to observe Davies Bouldin Score, we decided to make the result positive. Meanwhile, to visualize the clusters, we used the cluster model visualizer operator.
First we retrieved dataset that will be used for this agglomerative clustering by using the Retrieve operator as shown below.
Before continuing with the clustering, we selected only required attributes. The attributes under the "Selected Attributes" columns are attributes that will be used in this clustering.
To create the clusters, we are using the Clustering (Agglomerative Clustering) operator. The mode was set to SingleLink since we decided to do single link agglomerative. The measures types was set to MixedMeasures because Leveldistress is nominal and not munerical. Lastly, for measurement we used MixedEuclideanDistance.