1. Concepts & Definitions
1.1. Regression versus Classification
1.3. Parameter versus Hyperparameter
1.4. Training, Validation, and Test
2. Problem & Solution
2.1. Gaussian Mixture x K-means on HS6 Weight
2.2. Evaluation of classification method using ROC curve
2.3. Comparing logistic regression, neural network, and ensemble
2.4. Fruits or not, split or encode and scale first?
The next code reads and treats OCDB data stored in an excel file with multiple sheets, as done in the Track 06, section "2.6. Gaussian Mixture on OCDB":
https://colab.research.google.com/drive/1XviDrKZ3RTBks8vEqqD7Fa6ED6XJ4wqF?usp=sharing
The code should be modified to separate data into three groups instead of four groups as done in the next code:
https://colab.research.google.com/drive/1bVAtSKvnkYOFoMp3ihcm8PbrvB4AnTaN?usp=sharing
Employing the previous notebook, the following code should be added to evaluate a K-means method for different values of k centers.
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
# Using the elbow method to find the optimal number of clusters
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
kmeans.fit(x.reshape(-1, 1))
wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
The code to employ the Elbow method to determine the optimal number of K is given by the next commands.
!pip install kneed
from kneed import KneeLocator
kl = KneeLocator(range(1, 11), wcss, curve="convex", direction="decreasing")
kl.elbow
Collecting kneed
Downloading kneed-0.8.5-py3-none-any.whl (10 kB)
Requirement already satisfied: numpy>=1.14.2 in /usr/local/lib/python3.10/dist-packages (from kneed) (1.23.5)
Requirement already satisfied: scipy>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from kneed) (1.11.3)
Installing collected packages: kneed
Successfully installed kneed-0.8.5
3
Since the number of optimal K is 3, the next code employ three labels to classify data into three groups.
from sklearn.cluster import KMeans
# fit the model
kmeans = KMeans(n_clusters = 3, init = 'k-means++', random_state = 42)
yhat = kmeans.fit_predict(x.reshape(-1, 1))
yhat
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)
The next commands are useful to show the labeling made by the K-means.
list_x = [[] for d in range(n_clusters)]
list_y = [[] for d in range(n_clusters)]
k = 0
for elem in yhat:
if (elem == 1):
ind = 2
elif (elem == 2):
ind = 1
else:
ind = elem
list_x[ind].append(x[k])
list_y[ind].append(y[k])
k = k+1
list_colors = ['red','orange','green']
#plt.hist(df_filtered_75p_lower['weight_kg'], bins=40)
for cluster,color in zip(range(n_clusters),list_colors):
plt.scatter(list_x[cluster], list_y[cluster], color=color, label='N_1')
plt.show()
The next code compares data labels from Gaussian Mixture Method and K-means.
# Sequential numbering the classes given by GMM.
ygmm = target_class.copy()
for i in range(len(target_class)):
value = ygmm[i]
if (value == 2):
elem = 0
elif (value == 0):
elem = 2
else:
elem = value
ygmm[i] = elem
# Sequential numbering the classes given by K-Means.
ykme = yhat.copy()
for i in range(len(target_class)):
value = ykme[i]
if (value == 2):
elem = 1
elif (value == 1):
elem = 2
else:
elem = value
ykme[i] = elem
print('Classification using GMM: \n', ygmm)
print('Classification using Kmeans: \n', ykme)
The Python code with all the steps is summarized in this Google Colab (click on the link):
https://colab.research.google.com/drive/1J2IgGQbgCmvyxgDZHw3bkMNS79gikelb?usp=sharing