1. Concepts & Definitions
1.1. Regression versus Classification
1.3. Parameter versus Hyperparameter
1.4. Training, Validation, and Test
2. Problem & Solution
2.1. Gaussian Mixture x K-means on HS6 Weight
2.2. Evaluation of classification method using ROC curve
2.3. Comparing logistic regression, neural network, and ensemble
2.4. Fruits or not, split or encode and scale first?
Clustering, or grouping, consists of implementing computational techniques to separate a set of data into different groups based on their similarities. Unlike classification and regression algorithms, Clustering is part of the Unsupervised Learning universe, in which algorithms must understand the relationships between data without being labeled to any prior category.
The next figure illustrates this concept.
Grouping techniques are often used for multiple functionalities, among the main ones we have:
In data analysis problems, clustering processes help in extracting patterns from the studied database. This process tends to be very common in the segmentation of a company's customers, in which the grouping allows the main profiles of users to be understood, allowing the optimization of acquisition and retention strategies.
In the identification of outliers, dividing the data into different groups makes it possible to more easily identify data that are not similar to any other, or that do not show any information gain. An example of this technique would be the identification of very different values for the height and width of an animal in a dataset of dog breeds. If the observation of a single individual is not close to any grouping (dog breeds), this data can be identified as “anomalous”.
In Feature Engineering, new unclassified data (without label) can gain a label based on which grouping established by the clustering method the data is closest to. As an application, for example, we see the attempt to add new height and weight data for a person in a search, but without having sex information, in this case, we can use a grouping of the total data and try to insert it into the closest group.
In cases of creating automatic filters, Clustering techniques can be used to separate data. An example is unsupervised spam filters, in which data can be grouped into two groups to separate those with similarly suspicious messages, for example, grouping messages that contain words such as “won”, “congratulations”, “draw ”, “won” and “three numbers from behind”.
There are several types of Clustering algorithms, the big difference between them is how their complexity increases with the growth of the dataset. Therefore, understanding the different types of algorithms is interesting to choose those that provide the best results for your data. These types can be grouped into 4 groups:
Centroid-based Clustering
Centroid-based clustering techniques start from a given number of groups, find the centroids (geometric centers) that represent the “middle” of each cluster and, from them, identify which cluster each of the points belongs to based on their distances to each of the centroids.
Once the identification of the centroids is completed, the classification of all points in the clusters becomes very simple, since it is based only on calculating the distance of the points to the centroids, therefore, algorithms of this type are usually efficient.
Furthermore, thanks to the formulation of the algorithms, this type of clustering does not usually deal adequately with outliers, inserting them into the nearest centroid cluster. Representatives of this type of algorithm are K-Means and Mini Batch K-Means.
Density-based Clustering
Density-based clustering algorithms aim to identify regions with a high concentration of points and connect them into clusters, thus identifying clusters. This type of method allows the identification of clusters with arbitrary geometries and also allows the identification of outliers. However, this type of approach does not usually work well with data that presents clusters with different densities.
The best-known representative of this type of clustering is DBScan.
Clustering, Distribution-based
Algorithms of this type try to assume different distributions from the data and define each distribution found as a different cluster. In other words, the algorithms try to fit different distributions to the data, to relate each distribution to a cluster. Based on this, the classification of each point into groups can be done in a probabilistic way, in which, given the distributions of each group, the probability of the point belonging to that distribution can be estimated.
Furthermore, it is also possible to note that these algorithms are not very useful when the type of data distribution is not known or when, for some reason, the data clusters belong to different types of distributions (Gaussian, Poisson, Pearson, etc.) .
An example of this type of use is in the Gaussian Mixture algorithm, which tries to fit different Gaussian models to the data.
Clustering and Hierarchical Clustering
Algorithms of this type aim to group similar data using tools such as the distance between them. The difference between this type of algorithm and the others is the creation of several clusters (some groupings within others), which, finally, ends up generating a tree of clusters, in which data belongs to smaller and larger groups, thus creating form, a hierarchy. Having this hierarchy of clusters, we can then control how much we want the data to be similar to each other to the point of belonging to the same group, which will make the algorithm “go up” or “go down” in the cluster hierarchy. Furthermore, we can, conversely, inform how many clusters we want to have, leading to a configuration of the hierarchy that will provide that number.
However, these algorithms have two main disadvantages: do not deal very well with outliers and are inefficient to dealing with a lot of data.
The main examples of this type are BIRCH.
The next figure summarizes concept of the four possible clustering techniques groups.
The next figure from [1] presents the results and computation effort (the number in lower right corner of each subfigure) from several clustering algorithms for different datasets.
First, load all the code available in the following Google Colab (click on the link):
https://colab.research.google.com/drive/1ANZ6eRdPCr7gCGmg4tkgLyFwG_pwLymJ?usp=sharing
Now, it is possible to start employing K-means clustering algorithm. In order to identify the optimal number of clusters, we need to use the Elbow Method. When the slope of the tangent line starts to be almost horizontal, that is the optimal number of cluster [2].
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from scipy.spatial import ConvexHull
import folium
# 1. Clustering your data into KMeans clustering one of the unsupervise clsutering method
# Using the elbow method to find the optimal number of clusters
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
kmeans.fit(X_train)
wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
Now, to identify which number of clusters correspond to the Elbow, there is a command which employs the WCSS metric to find it: KneeLocator from the library Knee [3].
!pip install kneed
from kneed import KneeLocator
kl = KneeLocator(range(1, 11), wcss, curve="convex", direction="decreasing")
kl.elbow
2
Since, according to the Elbow method, the optimal number of clusters is 2, we employed this at the K-means method to make predictions in the Test dataset as the following code.
# 1.2 Training the K-Means model regarding to your elbow method or business logic groups
kmeans = KMeans(n_clusters = 2, init = 'k-means++', random_state = 42)
yhat = kmeans.fit_predict(X_test)
y_test, yhat
(array([0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0]),
array([0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0], dtype=int32))
This would result in the following accurancy of the method.
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
# evaluate predictions
acc = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % acc)
Accuracy: 1.000
Finally to visualize the classified data and centroids generated by K-means the next code will be helpful [4].
import pandas as pd
import matplotlib.pyplot as plt
# scatter plot, dots colored by class value
df = pd.DataFrame(dict(x=X[:,0], y=X[:,1], label=y))
colors = {0:'red', 1:'blue'}
fig, ax = plt.subplots()
grouped = df.groupby('label')
# Plot the data
for key, group in grouped:
group.plot(ax=ax, kind='scatter', x='x', y='y', label=key, color=colors[key])
#Getting the Centroids
centroids = kmeans.cluster_centers_
plt.scatter(centroids[:,0] , centroids[:,1] , s = 80, color = 'k')
plt.legend()
plt.show()
The Gaussian Mixture code described in Track 06 - Section 2.5 could be adapted and employed in the problem previously described.
from sklearn.mixture import GaussianMixture
import numpy as np
n_clusters = 2
gmm = GaussianMixture(n_components=n_clusters, random_state=42)
gmm.fit(X_test)
yhat = gmm.predict(X_test)
yhat
array([0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0])
Let's check the corresponding accuracy of the predictions made by the method.
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
# evaluate predictions
acc = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % acc)
Accuracy: 1.000
We can also draw the classification made by the Gaussian Mixture and the corresponding centers found.
import pandas as pd
import matplotlib.pyplot as plt
# scatter plot, dots colored by class value
df = pd.DataFrame(dict(x=X[:,0], y=X[:,1], label=y))
colors = {0:'red', 1:'blue'}
fig, ax = plt.subplots()
grouped = df.groupby('label')
# Plot the data
for key, group in grouped:
group.plot(ax=ax, kind='scatter', x='x', y='y', label=key, color=colors[key])
#Getting the centers
centers = gmm.means_
plt.scatter(centers[:,0] , centers[:,1] , s = 80, color = 'k')
plt.legend()
plt.show()
Finally, a comparison between the centroids and centers obtained from K-means and Gaussian Mixture methods, respectively, could be printed.
print('Centroids (K-means) = \n', centroids)
print('Centers (Centers) = \n', centers)
Centroids (K-means) = [[ -1.49176229 4.73736414] [-10.21314316 -4.34230638]]
Centers (Centers) = [[ -1.52811538 4.70909836] [-10.22772304 -4.37242357]]
The Python code with all the steps is summarized in this Google Colab (click on the link):
https://colab.research.google.com/drive/1teKf-587axtSMesQJaEGmtl3vvPfJML6?usp=sharing
About clustering algorithms and their differences with regression and classification
https://medium.com/@arunp77/machine-learning-an-introductory-tutorial-for-beginners-1957475e6c0
10 Clustering algorithms
https://machinelearningmastery.com/clustering-algorithms-with-python/
Clustering with Gaussian Mixture Models (comparison with K-means):
https://towardsdev.com/clustering-with-gaussian-mixture-models-c2c3ecdc6640
Using K-means and georeference (motivation):
https://medium.com/codex/clustering-geographic-data-on-an-interactive-map-in-python-60a5d13d6452
Discussion of criteria to use metrics to find the number of centers - Using AIC and BIC in GMM:
Various clustering methods:
Comparing K-means and other clustering methods:
K-means constrained: