2.1. Gaussian Mixture x K-means on HS6 Weight

The next code reads and treats OCDB data stored in an excel file with multiple sheets, as done in the Track 06, section "2.6. Gaussian Mixture on OCDB":

https://colab.research.google.com/drive/1XviDrKZ3RTBks8vEqqD7Fa6ED6XJ4wqF?usp=sharing

The code should be modified to separate data into three groups instead of four groups as done in the next code:

https://colab.research.google.com/drive/1bVAtSKvnkYOFoMp3ihcm8PbrvB4AnTaN?usp=sharing

Applying K-means to the data

Employing the previous notebook, the following code should be added to evaluate a K-means method for different values of k centers.

import matplotlib.pyplot as plt

from sklearn.cluster import KMeans

# Using the elbow method to find the optimal number of clusters

wcss = []

for i in range(1, 11):

kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)

kmeans.fit(x.reshape(-1, 1))

wcss.append(kmeans.inertia_)

plt.plot(range(1, 11), wcss)

plt.title('The Elbow Method')

plt.xlabel('Number of clusters')

plt.ylabel('WCSS')

plt.show()

The code to employ the Elbow method to determine the optimal number of K is given by the next commands.

!pip install kneed

from kneed import KneeLocator

kl = KneeLocator(range(1, 11), wcss, curve="convex", direction="decreasing")

kl.elbow

Collecting kneed

Downloading kneed-0.8.5-py3-none-any.whl (10 kB)

Requirement already satisfied: numpy>=1.14.2 in /usr/local/lib/python3.10/dist-packages (from kneed) (1.23.5)

Requirement already satisfied: scipy>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from kneed) (1.11.3)

Installing collected packages: kneed

Successfully installed kneed-0.8.5

Since the number of optimal K is 3, the next code employ three labels to classify data into three groups.

from sklearn.cluster import KMeans

# fit the model

kmeans = KMeans(n_clusters = 3, init = 'k-means++', random_state = 42)

yhat = kmeans.fit_predict(x.reshape(-1, 1))

yhat

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2,

2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

Drawing Data Labels from K-means

The next commands are useful to show the labeling made by the K-means.

list_x = [[] for d in range(n_clusters)]

list_y = [[] for d in range(n_clusters)]

k = 0

for elem in yhat:

if (elem == 1):

ind = 2

elif (elem == 2):

ind = 1

else:

ind = elem

list_x[ind].append(x[k])

list_y[ind].append(y[k])

k = k+1

list_colors = ['red','orange','green']

#plt.hist(df_filtered_75p_lower['weight_kg'], bins=40)

for cluster,color in zip(range(n_clusters),list_colors):

plt.scatter(list_x[cluster], list_y[cluster], color=color, label='N_1')

plt.show()

Comparing Gaussian Mixture and K-means labels

The next code compares data labels from Gaussian Mixture Method and K-means.

# Sequential numbering the classes given by GMM.

ygmm = target_class.copy()

for i in range(len(target_class)):

value = ygmm[i]

if (value == 2):

elem = 0

elif (value == 0):

elem = 2

else:

elem = value

ygmm[i] = elem

# Sequential numbering the classes given by K-Means.

ykme = yhat.copy()

for i in range(len(target_class)):

value = ykme[i]

if (value == 2):

elem = 1

elif (value == 1):

elem = 2

else:

elem = value

ykme[i] = elem

print('Classification using GMM: \n', ygmm)

print('Classification using Kmeans: \n', ykme)

The Python code with all the steps is summarized in this Google Colab (click on the link):

https://colab.research.google.com/drive/1J2IgGQbgCmvyxgDZHw3bkMNS79gikelb?usp=sharing

Page updated

Google Sites

Report abuse