GOAL : learn about clustering such as K-means, mean shift, DBSCAN, AgglomerativeClustering, and implement in scikit-learn.
Learning experience: In this report, we explored how to leverage clustering techniques to tackle unsupervised problems. Through hands-on experience, we realized that selecting the appropriate clustering method is crucial for effectively analyzing data. Different datasets possess distinct features and structures, necessitating the selection of the most suitable clustering algorithm based on the inherent nature of the data.
Initially, we attempted to apply traditional clustering algorithms to the Olivetti faces dataset and evaluated the clustering effects under various parameter settings. However, when subsequently attempting to apply these algorithms to the 20 Newsgroups text dataset, we encountered several challenges and limitations. This lesson underscored the importance of considering the data's characteristics when choosing the appropriate unsupervised learning method.
For high-dimensional, sparse text data, we need to explore more advanced and specialized unsupervised techniques, such as topic modeling and latent semantic analysis. These methods can better capture the underlying semantics and topic structures within text data, thereby providing more valuable insights and perspectives.
working environment : This work's Github
OS: Windows 11 home
CPU : intel i9-13900k
GPU : Nvidia RTX 4090
Python Version : 3.12.2
Development environment: jupyter notebook.
19.0 Introduction
In much of this book we have looked at supervised machine learning—where we have access to both the features and the target. This is, unfortunately, not always the case. Frequently, we run into situations where we only know the features. For example, imagine we have records of sales from a grocery store and we want to break up sales by whether the shopper is a member of a discount club. This would be impossible using supervised learning because we don’t have a target to train and evaluate our models. However, there is another option: unsupervised learning. If the behavior of discount club members and nonmembers in the grocery store is actually disparate, then the average difference in behavior between two members will be smaller than the average difference in behavior between a member and nonmember shopper. Put another way, there will be two clusters of observations.
The goal of clustering algorithms is to identify those latent groupings of observations, which, if done well, allows us to predict the class of observations even without a target vector. There are many clustering algorithms, and they have a wide variety of approaches to identifying the clusters in data. In this chapter, we will cover a selection of clustering algorithms using scikit-learn and how to use them in practice.
For Example, In the graph given below, we can clearly see that there are 3 circular clusters forming on the basis of distance.
Now it is not necessary that the clusters formed must be circular in shape. The shape of clusters can be arbitrary. There are many algortihms that work well with detecting arbitrary shaped clusters. [2]
Typical cluster models include:
Connectivity models: for example, hierarchical clustering builds models based on distance connectivity.
Centroid models: for example, the k-means algorithm represents each cluster by a single mean vector.
Distribution models: clusters are modeled using statistical distributions, such as multivariate normal distributions used by the expectation-maximization algorithm.
Density models: for example, DBSCAN and OPTICS defines clusters as connected dense regions in the data space.
Subspace models: in biclustering (also known as co-clustering or two-mode-clustering), clusters are modeled with both cluster members and relevant attributes.
Group models: some algorithms do not provide a refined model for their results and just provide the grouping information.
Graph-based models: a clique, that is, a subset of nodes in a graph such that every two nodes in the subset are connected by an edge can be considered as a prototypical form of cluster. Relaxations of the complete connectivity requirement (a fraction of the edges can be missing) are known as quasi-cliques, as in the HCS clustering algorithm.
Signed graph models: Every path in a signed graph has a sign from the product of the signs on the edges. Under the assumptions of balance theory, edges may change sign and result in a bifurcated graph. The weaker "clusterability axiom" (no cycle has exactly one negative edge) yields results with more than two clusters, or subgraphs with only positive edges.[6]
Neural models: the most well-known unsupervised neural network is the self-organizing map and these models can usually be characterized as similar to one or more of the above models, and including subspace models when neural networks implement a form of Principal Component Analysis or Independent Component Analysis.
A "clustering" is essentially a set of such clusters, usually containing all objects in the data set. Additionally, it may specify the relationship of the clusters to each other, for example, a hierarchy of clusters embedded in each other. Clusterings can be roughly distinguished as:
Hard clustering: each object belongs to a cluster or not
Soft clustering (also: fuzzy clustering): each object belongs to each cluster to a certain degree (for example, a likelihood of belonging to the cluster)
There are also finer distinctions possible, for example:
Strict partitioning clustering: each object belongs to exactly one cluster
Strict partitioning clustering with outliers: objects can also belong to no cluster; in which case they are considered outliers
Overlapping clustering (also: alternative clustering, multi-view clustering): objects may belong to more than one cluster; usually involving hard clusters
Hierarchical clustering: objects that belong to a child cluster also belong to the parent cluster
Subspace clustering: while an overlapping clustering, within a uniquely defined subspace, clusters are not expected to overlap [3]
In this chapter we introduce 4 algorithms
k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. k-means clustering minimizes within-cluster variances (squared Euclidean distances), but not regular Euclidean distances, which would be the more difficult Weber problem: the mean optimizes squared errors, whereas only the geometric median minimizes Euclidean distances. For instance, better Euclidean solutions can be found using k-medians and k-medoids.
Given a set of observations (x1, x2, ..., xn), where each observation is a 𝑑-dimensional real vector, k-means clustering aims to partition the n observations into k (≤ n) sets S = {S1, S2, ..., Sk} so as to minimize the within-cluster sum of squares (WCSS) (i.e. variance). Formally, the objective is to find:
where μi is the mean (also called centroid) of points in 𝑆𝑖, i.e
|𝑆𝑖| is the size of 𝑆𝑖, and ‖⋅‖ is the usual L2 norm . This is equivalent to minimizing the pairwise squared deviations of points in the same cluster:
The equivalence can be deduced from identity |𝑆𝑖|∑𝑥∈𝑆𝑖‖𝑥−𝜇𝑖‖^2 = 1/2 ∑𝑥,𝑦∈𝑆𝑖‖𝑥−𝑦‖^2. Since the total variance is constant, this is equivalent to maximizing the sum of squared deviations between points in different clusters (between-cluster sum of squares, BCSS).[1] This deterministic relationship is also related to the law of total variance in probability theory.
We use pseudocode to provide how the algorithm work.
The below pseudocode outlines the implementation of the standard k-means clustering algorithm. Initialization of centroids, distance metric between points and centroids, and the calculation of new centroids are design choices and will vary with different implementations. In this example pseudocode, argmin is used to find the index of the minimum value.
def k_means_cluster(k, points):
# Initialization: choose k centroids (Forgy, Random Partition, etc.)
centroids = [c1, c2, ..., ck]
# Initialize clusters list
clusters = [[] for _ in range(k)]
# Loop until convergence
converged = false
while not converged:
# Clear previous clusters
clusters = [[] for _ in range(k)]
# Assign each point to the "closest" centroid
for point in points:
distances_to_each_centroid = [distance(point, centroid) for centroid in centroids]
cluster_assignment = argmin(distances_to_each_centroid)
clusters[cluster_assignment].append(point)
# Calculate new centroids
# (the standard implementation uses the mean of all points in a
# cluster to determine the new centroid)
new_centroids = [calculate_centroid(cluster) for cluster in clusters]
converged = (new_centroids == centroids)
centroids = new_centroids
if converged:
return clusters
In scikit-learn it provide the function name sklearn.cluster.KMeans : [5]
class sklearn.cluster.KMeans(n_clusters=8, *, init='k-means++', n_init='auto', max_iter=300, tol=0.0001, verbose=0, random_state=None, copy_x=True, algorithm='lloyd'), K-Means clustering.
There are some important parameters :
1.n_clustersint, default=8 : The number of clusters to form as well as the number of centroids to generate.
For an example of how to choose an optimal value for n_clusters refer to Selecting the number of clusters with silhouette analysis on KMeans clustering.
2.n_init‘auto’ or int, default=’auto’: Number of times the k-means algorithm is run with different centroid seeds. The final results is the best output of n_init consecutive runs in terms of inertia. Several runs are recommended for sparse high-dimensional problems (see Clustering sparse data with k-means).
When n_init='auto', the number of runs depends on the value of init: 10 if using init='random' or init is a callable; 1 if using init='k-means++' or init is an array-like.New in version 1.2: Added ‘auto’ option for n_init.
3.max_iterint, default=300 : Maximum number of iterations of the k-means algorithm for a single run.
4.algorithm{“lloyd”, “elkan”}, default=”lloyd”: K-means algorithm to use. The classical EM-style algorithm is "lloyd". The "elkan" variation can be more efficient on some datasets with well-defined clusters, by using the triangle inequality. However it’s more memory intensive due to the allocation of an extra array of shape (n_samples, n_clusters).
And some commonly used method:
fit(X[, y, sample_weight]) : Compute k-means clustering.
get_params([deep]) : Get parameters for this estimator.
predict(X[, sample_weight]) : Predict the closest cluster each sample in X belongs to.
score(X[, y, sample_weight]) : Opposite of the value of X on the K-means objective.
Mean shift is a non-parametric feature-space mathematical analysis technique for locating the maxima of a density function, a so-called mode-seeking algorithm.[1] Application domains include cluster analysis in computer vision and image processing.[2]
Mean shift is a procedure for locating the maxima—the modes—of a density function given discrete data sampled from that function.[1] This is an iterative method, and we start with an initial estimate 𝑥. Let a kernel function 𝐾(𝑥𝑖−𝑥) be given. This function determines the weight of nearby points for re-estimation of the mean. Typically a Gaussian kernel on the distance to the current estimate is used, 𝐾(𝑥𝑖−𝑥)=exp(−𝑐||𝑥𝑖−𝑥||^2). The weighted mean of the density in the window determined by 𝐾 is
where 𝑁(𝑥) is the neighborhood of 𝑥, a set of points for which 𝐾(𝑥𝑖−𝑥)≠0
The difference 𝑚(𝑥)−𝑥 is called mean shift in Fukunaga and Hostetler.[3] The mean-shift algorithm now sets 𝑥←𝑚(𝑥), and repeats the estimation until 𝑚(𝑥) converges.
Although the mean shift algorithm has been widely used in many applications, a rigid proof for the convergence of the algorithm using a general kernel in a high dimensional space is still not known.[5] Aliyari Ghassabeh showed the convergence of the mean shift algorithm in one dimension with a differentiable, convex, and strictly decreasing profile function.[6] However, the one-dimensional case has limited real world applications. Also, the convergence of the algorithm in higher dimensions with a finite number of the stationary (or isolated) points has been proved.[5][7] However, sufficient conditions for a general kernel function to have finite stationary (or isolated) points have not been provided.
Gaussian Mean-Shift is an Expectation–maximization algorithm.[8]
Let data be a finite set 𝑆 embedded in the 𝑛-dimensional Euclidean space, 𝑋. Let 𝐾 be a flat kernel that is the characteristic function of the 𝜆-ball in 𝑋,
In each iteration of the algorithm, 𝑠←𝑚(𝑠) is performed for all 𝑠∈𝑆 simultaneously. The first question, then, is how to estimate the density function given a sparse set of samples. One of the simplest approaches is to just smooth the data, e.g., by convolving it with a fixed kernel of width ℎ,
where 𝑥𝑖 are the input samples and 𝑘(𝑟) is the kernel function (or Parzen window). ℎ is the only parameter in the algorithm and is called the bandwidth. This approach is known as kernel density estimation or the Parzen window technique. Once we have computed 𝑓(𝑥) from the equation above, we can find its local maxima using gradient ascent or some other optimization technique. The problem with this "brute force" approach is that, for higher dimensions, it becomes computationally prohibitive to evaluate 𝑓(𝑥) over the complete search space. Instead, mean shift uses a variant of what is known in the optimization literature as multiple restart gradient descent. Starting at some guess for a local maximum, 𝑦𝑘, which can be a random input data point 𝑥1, mean shift computes the gradient of the density estimate 𝑓(𝑥) at 𝑦𝑘 and takes an uphill step in that direction.[9]
When we use mean shift in clustering, here is how the algorithm apply.
Consider a set of points in two-dimensional space. Assume a circular window centered at 𝐶 and having radius 𝑟 as the kernel. Mean-shift is a hill climbing algorithm which involves shifting this kernel iteratively to a higher density region until convergence. Every shift is defined by a mean shift vector. The mean shift vector always points toward the direction of the maximum increase in the density. At every iteration the kernel is shifted to the centroid or the mean of the points within it. The method of calculating this mean depends on the choice of the kernel. In this case if a Gaussian kernel is chosen instead of a flat kernel, then every point will first be assigned a weight which will decay exponentially as the distance from the kernel's center increases. At convergence, there will be no direction at which a shift can accommodate more points inside the kernel.
In scikit-learn it provide the function name sklearn.cluster.MeanShift [7]
class sklearn.cluster.MeanShift(*, bandwidth=None, seeds=None, bin_seeding=False, min_bin_freq=1, cluster_all=True, n_jobs=None, max_iter=300) Mean shift clustering using a flat kernel. Mean shift clustering aims to discover “blobs” in a smooth density of samples. It is a centroid-based algorithm, which works by updating candidates for centroids to be the mean of the points within a given region. These candidates are then filtered in a post-processing stage to eliminate near-duplicates to form the final set of centroids. Seeding is performed using a binning technique for scalability.
There are some important parameters :
bandwidthfloat, default=None : Bandwidth used in the flat kernel.If not given, the bandwidth is estimated using sklearn.cluster.estimate_bandwidth; see the documentation for that function for hints on scalability (see also the Notes, below).
seedsarray-like of shape (n_samples, n_features), default=None : Seeds used to initialize kernels. If not set, the seeds are calculated by clustering.get_bin_seeds with bandwidth as the grid size and default values for other parameters.
min_bin_freqint, default=1 : To speed up the algorithm, accept only those bins with at least min_bin_freq points as seeds.
max_iterint, default=300 : Maximum number of iterations, per seed point before the clustering operation terminates (for that seed point), if has not converged yet.
And some commonly used method :
Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996.[1] It is a density-based clustering non-parametric algorithm: given a set of points in some space, it groups together points that are closely packed (points with many nearby neighbors), and marks as outliers points that lie alone in low-density regions (those whose nearest neighbors are too far away). DBSCAN is one of the most common, and most commonly cited, clustering algorithms.[2]
The DBSCAN algorithm can be abstracted into the following steps:[4]
Find the points in the ε (eps) neighborhood of every point, and identify the core points with more than minPts neighbors.
Find the connected components of core points on the neighbor graph, ignoring all non-core points.
Assign each non-core point to a nearby cluster if the cluster is an ε (eps) neighbor, otherwise assign it to noise.
A naive implementation of this requires storing the neighborhoods in step 1, thus requiring substantial memory. The original DBSCAN algorithm does not require this by performing these steps for one point at a time.
In scikit-learn it provide the function name sklearn.cluster.DBSCAN [9]
class sklearn.cluster.DBSCAN(eps=0.5, *, min_samples=5, metric='euclidean', metric_params=None, algorithm='auto', leaf_size=30, p=None, n_jobs=None) Perform DBSCAN clustering from vector array or distance matrix. DBSCAN - Density-Based Spatial Clustering of Applications with Noise. Finds core samples of high density and expands clusters from them. Good for data which contains clusters of similar density. The worst case memory complexity of DBSCAN is O(n^2), which can occur when the eps param is large and min_samples is low.
There are some important parameters :
epsfloat, default=0.5 : The maximum distance between two samples for one to be considered as in the neighborhood of the other. This is not a maximum bound on the distances of points within a cluster. This is the most important DBSCAN parameter to choose appropriately for your data set and distance function.
min_samplesint, default=5 : The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself. If min_samples is set to a higher value, DBSCAN will find denser clusters, whereas if it is set to a lower value, the found clusters will be more sparse.
leaf_sizeint, default=30 : Leaf size passed to BallTree or cKDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.
And some commonly used method :
fit(X[, y, sample_weight]) : Perform DBSCAN clustering from features, or distance matrix.
get_params([deep]) : Get parameters for this estimator.
set_params(**params) : Set the parameters of this estimator.
In data mining and statistics, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis that seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two categories:
Agglomerative: This is a "bottom-up" approach: Each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
Divisive: This is a "top-down" approach: All observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.
In general, the merges and splits are determined in a greedy manner. The results of hierarchical clustering[1] are usually presented in a dendrogram.
Hierarchical clustering has the distinct advantage that any valid measure of distance can be used. In fact, the observations themselves are not required: all that is used is a matrix of distances. On the other hand, except for the special case of single-linkage distance, none of the algorithms (except exhaustive search in 𝑂(2^𝑛)) can be guaranteed to find the optimum solution.
In order to decide which clusters should be combined (for agglomerative), or where a cluster should be split (for divisive), a measure of dissimilarity between sets of observations is required. In most methods of hierarchical clustering, this is achieved by use of an appropriate distance d, such as the Euclidean distance, between single observations of the data set, and a linkage criterion, which specifies the dissimilarity of sets as a function of the pairwise distances of observations in the sets. The choice of metric as well as linkage can have a major impact on the result of the clustering, where the lower level metric determines which objects are most similar, whereas the linkage criterion influences the shape of the clusters. For example, complete-linkage tends to produce more spherical clusters than single-linkage.
The linkage criterion determines the distance between sets of observations as a function of the pairwise distances between observations.
Some commonly used linkage criteria between two sets of observations A and B and a distance d are
Maximum or complete-linkage clustering
Hierarchical clustering basic operation process can be divided into two types: Agglomerative Hierarchical Clustering and Divisive Hierarchical Clustering.
Agglomerative Hierarchical Clustering
This is a bottom-up approach:
Initialization: Treat each data point as an individual cluster, resulting in n clusters.
Calculate Distance: Compute the distance or similarity between all clusters, typically using Euclidean distance.
Merge Clusters: Find the two clusters that are closest to each other and merge them into a new cluster.
Update Distance Matrix: Recalculate the distances between the new cluster and all other clusters.
Repeat Steps 3 and 4: Continue this process until all data points are merged into a single cluster, or the desired number of clusters is reached.
Divisive Hierarchical Clustering
This is a top-down approach:
Initialization: Treat all data points as a single cluster.
Split Cluster: Choose a cluster to split into two sub-clusters, often using a method like K-means.
Calculate Split Point: Determine the best split point based on the splitting method and distance calculation.
Repeat Steps 2 and 3: Continue splitting each cluster until every data point becomes a single cluster, or the desired number of clusters is reached.
Calculating distance or similarity is crucial in these steps. Common methods include:
Euclidean Distance
Manhattan Distance
Cosine Similarity
In agglomerative hierarchical clustering, the distance between clusters can be defined using several criteria:
Single Linkage: Distance between the closest points of two clusters.
Complete Linkage: Distance between the farthest points of two clusters.
Average Linkage: Average distance between all points in the two clusters.
Centroid Linkage: Distance between the centroids of the two clusters.
The final result can be represented using a dendrogram, which is a hierarchical tree-like diagram showing how data points are merged or split into different clusters.
The advantage of hierarchical clustering is its hierarchical structure, which makes the results easy to interpret. However, the disadvantage is its high computational complexity, especially when dealing with large datasets.
In scikit-learn it provide the model name sklearn.cluster.AgglomerativeClustering [11]
class sklearn.cluster.AgglomerativeClustering(n_clusters=2, *, metric='euclidean', memory=None, connectivity=None, compute_full_tree='auto', linkage='ward', distance_threshold=None, compute_distances=False) Agglomerative Clustering. Recursively merges pair of clusters of sample data; uses linkage distance.
There are some important parameters :
n_clustersint or None, default=2 : The number of clusters to find. It must be None if distance_threshold is not None.
metricstr or callable, default=”euclidean” : Metric used to compute the linkage. Can be “euclidean”, “l1”, “l2”, “manhattan”, “cosine”, or “precomputed”. If linkage is “ward”, only “euclidean” is accepted. If “precomputed”, a distance matrix is needed as input for the fit method.
compute_full_tree‘auto’ or bool, default=’auto’ : Stop early the construction of the tree at n_clusters. This is useful to decrease computation time if the number of clusters is not small compared to the number of samples. This option is useful only when specifying a connectivity matrix. Note also that when varying the number of clusters and using caching, it may be advantageous to compute the full tree. It must be True if distance_threshold is not None. By default compute_full_tree is “auto”, which is equivalent to True when distance_threshold is not None or that n_clusters is inferior to the maximum between 100 or 0.02 * n_samples. Otherwise, “auto” is equivalent to False.
distance_thresholdfloat, default=None : The linkage distance threshold at or above which clusters will not be merged. If not None, n_clusters must be None and compute_full_tree must be True.
compute_distancesbool, default=False : Computes distances between clusters even if distance_threshold is not used. This can be used to make dendrogram visualization, but introduces a computational and memory overhead.
And some commonly used method :
fit(X[, y]) : Fit the hierarchical clustering from features, or distance matrix.
get_params([deep]) : Get parameters for this estimator.
set_params(**params) : Set the parameters of this estimator.
19.1 Clustering Using K-Means
This section will group observations into k groups.
In 1 uses Python's sklearn library to perform K-means clustering analysis and standardize the features of the Iris dataset.
First, the necessary libraries are loaded. The datasets module provides various standard datasets, StandardScaler is used for standardizing data, and KMeans is used for K-means clustering.
Next, the Iris dataset is loaded from the datasets module, and its feature data is extracted. The Iris dataset contains 150 samples, each with four features.
Then, the StandardScaler is used to standardize these features. Standardization transforms the features to have a mean of 0 and a standard deviation of 1, which makes comparisons between different features more reasonable.
After that, a K-means clustering object is created. Here, three cluster centers are specified (n_clusters=3), and a random seed (random_state=0) is used to ensure reproducibility of the results. The n_init parameter is set to "auto" to automatically select the number of times the initial cluster centers are chosen to increase the stability of the results.
Finally, the standardized feature data is used to train the K-means model, and the results are stored in the model variable.
K-means clustering is one of the most common clustering techniques. In k-means clustering, the algorithm attempts to group observations into k groups, with each group having roughly equal variance. The number of groups, k, is specified by the user as a hyperparameter. Specifically, in k-means:
k cluster “center” points are created at random locations.
For each observation:
The distance between each observation and the k center points is calculated.
The observation is assigned to the cluster of the nearest center point.
The center points are moved to the means (i.e., centers) of their respective clusters.
Steps 2 and 3 are repeated until no observation changes in cluster membership.
At this point the algorithm is considered converged and stops.
It is important to note three things about k-means. First, k-means clustering assumes the clusters are convex shaped (e.g., a circle, a sphere). Second, all features are equally scaled. In our solution, we standardized the features to meet this assumption. Third, the groups are balanced (i.e., have roughly the same number of observations). If we suspect that we cannot meet these assumptions, we might try other clustering approaches.
In scikit-learn, k-means clustering is implemented in the KMeans class. The most important parameter is n_clusters, which sets the number of clusters k. In some situations, the nature of the data will determine the value for k (e.g., data on a school’s students will have one cluster per grade), but often we don’t know the number of clusters. In these cases, we will want to select k based on using some criteria. For example, silhouette coefficients (see Recipe 11.9) measure the similarity within clusters compared with the similarity between clusters. Furthermore, because k-means clustering is computationally expensive, we might want to take advantage of all the cores on our computer. We can do this by setting n_jobs=-1.
In our solution, we cheated a little and used the iris flower data, which we know contains three classes. Therefore, we set k = 3. In 2 we can use labels_ to see the predicted classes of each observation.
In 3 If we compare this to the observation’s true class, we can see that, despite the difference in class labels (i.e., 0, 1, and 2), k-means did reasonably well.
However, as you might imagine, the performance of k-means drops considerably, even critically, if we select the wrong number of clusters.
Finally, as with other scikit-learn models, In 4 we can use the trained cluster to predict the value of new observations.
The observation is predicted to belong to the cluster whose center point is closest. In 5 we can even use cluster_centers_ to see those center points.
19.2 Speeding Up K-Means Clustering
This section will group observations into k groups, but k-means takes too long.
In 6 uses Python's sklearn library to perform MiniBatchKMeans clustering analysis and standardize the features of the Iris dataset.
First, the necessary libraries are loaded. The datasets module provides various standard datasets, StandardScaler is used for standardizing data, and MiniBatchKMeans is used for mini-batch K-means clustering.
Next, the Iris dataset is loaded from the datasets module, and its feature data is extracted. The Iris dataset contains 150 samples, each with four features.
Then, the StandardScaler is used to standardize these features. Standardization transforms the features to have a mean of 0 and a standard deviation of 1, which makes comparisons between different features more reasonable.
After that, a MiniBatchKMeans clustering object is created. Here, three cluster centers are specified (n_clusters=3), and a random seed (random_state=0) is used to ensure reproducibility of the results. The batch_size is set to 100, meaning that 100 samples are randomly selected for each mini-batch clustering iteration. The n_init parameter is set to "auto" to automatically select the number of times the initial cluster centers are chosen to increase the stability of the results.
Finally, the standardized feature data is used to train the MiniBatchKMeans model, and the results are stored in the model variable.
Mini-batch k-means works similarly to the k-means algorithm discussed in Recipe 19.1. Without going into too much detail, the difference is that in mini-batch k-means the most computationally costly step is conducted on only a random sample of observations as opposed to all observations. This approach can significantly reduce the time required for the algorithm to find convergence (i.e., fit the data) with only a small cost in quality.
MiniBatchKMeans works similarly to KMeans, with one significant difference: the batch_size parameter. batch_size controls the number of randomly selected observations in each batch. The larger the size of the batch, the more computationally costly the training process.
19.3 Clustering Using Mean Shift
This section will group observations without assuming the number of clusters or their shape.
In 7 uses Python's sklearn library to perform MeanShift clustering analysis and standardize the features of the Iris dataset.
First, the necessary libraries are loaded. The datasets module provides various standard datasets, StandardScaler is used for standardizing data, and MeanShift is used for MeanShift clustering.
Next, the Iris dataset is loaded from the datasets module, and its feature data is extracted. The Iris dataset contains 150 samples, each with four features.
Then, the StandardScaler is used to standardize these features. Standardization transforms the features to have a mean of 0 and a standard deviation of 1, which makes comparisons between different features more reasonable.
After that, a MeanShift clustering object is created. Here, n_jobs=-1 is used, which allows MeanShift to use all available CPU cores for parallel processing to speed up the computation.
Finally, the standardized feature data is used to train the MeanShift model, and the results are stored in the model variable.
But In 8 we can see that the predict label doesn't show good result.
One of the disadvantages of k-means clustering we discussed previously is that we needed to set the number of clusters, k, prior to training, and the method made assumptions about the shape of the clusters. One clustering algorithm without these limitations is mean shift.
Mean shift is a simple concept, but it’s somewhat difficult to explain. Therefore, an analogy might be the best approach. Imagine a very foggy football field (i.e., a two-dimensional feature space) with 100 people standing on it (i.e., our observations). Because it is foggy, a person can see only a short distance. Every minute each person looks around and takes a step in the direction of the most people they can see. As time goes on, people start to group together as they repeatedly take steps toward larger and larger crowds. The end result is clusters of people around the field. People are assigned to the clusters in which they end up.
scikit-learn’s actual implementation of mean shift, MeanShift, is more complex but follows the same basic logic. MeanShift has two important parameters we should be aware of. First, bandwidth sets the radius of the area (i.e., kernel) an observation uses to determine the direction to shift. In our analogy, bandwidth is how far a person can see through the fog. We can set this parameter manually, but by default a reasonable bandwidth is estimated automatically (with a significant increase in computational cost). Second, sometimes in mean shift there are no other observations within an observation’s kernel. That is, a person on our football field cannot see a single other person. By default, MeanShift assigns all these “orphan” observations to the kernel of the nearest observation. However, if we want to leave out these orphans, we can set cluster_all=False, wherein orphan observations are given the label of -1.
19.4 Clustering Using DBSCAN
This section will group observations into clusters of high density.
In 9 uses Python's sklearn library to perform DBSCAN clustering analysis and standardize the features of the Iris dataset.
First, the necessary libraries are loaded. The datasets module provides various standard datasets, StandardScaler is used for standardizing data, and DBSCAN is used for density-based clustering analysis.
Next, the Iris dataset is loaded from the datasets module, and its feature data is extracted. The Iris dataset contains 150 samples, each with four features.
Then, the StandardScaler is used to standardize these features. Standardization transforms the features to have a mean of 0 and a standard deviation of 1, which makes comparisons between different features more reasonable.
After that, a DBSCAN clustering object is created. Here, n_jobs=-1 is used, which allows DBSCAN to use all available CPU cores for parallel processing to speed up the computation.
Finally, the standardized feature data is used to train the DBSCAN model, and the results are stored in the model variable.
DBSCAN is motivated by the idea that clusters will be areas where many observations are densely packed together and makes no assumptions of cluster shape. Specifically, in DBSCAN:
A random observation, xi, is chosen.
If xi has a minimum number of close neighbors, we consider it to be part of a cluster.
Step 2 is repeated recursively for all of xi’s neighbors, then neighbor’s neighbor, and so on. These are the cluster’s core observations.
Once step 3 runs out of nearby observations, a new random point is chosen (i.e., restart at step 1).
Once this is complete, we have a set of core observations for a number of clusters. Finally, any observation close to a cluster but not a core sample is considered part of a cluster, while any observation not close to the cluster is labeled an outlier.
DBSCAN has three main parameters to set:
eps : The maximum distance from an observation for another observation to be considered its neighbor.
min_samples : The minimum number of observations less than eps distance from an observation for it to be considered a core observation.
metric : The distance metric used by eps—for example, minkowski or euclidean (note that if Minkowski distance is used, the parameter p can be used to set the power of the Minkowski metric).
In 10 If we look at the clusters in our training data we can see two clusters have been identified, 0 and 1, while outlier observations are labeled -1:
19.5 Clustering Using Hierarchical Merging
This section will group observations using a hierarchy of clusters.
In 11 uses Python's sklearn library to perform agglomerative hierarchical clustering analysis and standardize the features of the Iris dataset.
First, the necessary libraries are loaded. The datasets module provides various standard datasets, StandardScaler is used for standardizing data, and AgglomerativeClustering is used for performing agglomerative hierarchical clustering.
Next, the Iris dataset is loaded from the datasets module, and its feature data is extracted. The Iris dataset contains 150 samples, each with four features.
Then, the StandardScaler is used to standardize these features. Standardization transforms the features to have a mean of 0 and a standard deviation of 1, which makes comparisons between different features more reasonable.
After that, an agglomerative clustering object is created. Here, three cluster centers are specified (n_clusters=3), which means the data will eventually be clustered into three groups.
Finally, the standardized feature data is used to train the agglomerative clustering model, and the results are stored in the model variable.
Agglomerative clustering is a powerful, flexible hierarchical clustering algorithm. In agglomerative clustering, all observations start as their own clusters. Next, clusters meeting some criteria are merged. This process is repeated, growing clusters until some end point is reached. In scikit-learn, AgglomerativeClustering uses the linkage parameter to determine the merging strategy to minimize:
Variance of merged clusters (ward)
Average distance between observations from pairs of clusters (average)
Maximum distance between observations from pairs of clusters (complete)
Two other parameters are useful to know. First, the affinity parameter determines the distance metric used for linkage (minkowski, euclidean, etc.). Second, n_clusters sets the number of clusters the clustering algorithm will attempt to find. That is, clusters are successively merged until only n_clusters remain.
In 12 as with other clustering algorithms we have covered, we can use labels_ to see the cluster in which every observation is assigned.
First introduce Olivetti faces data-set
This dataset contains a set of face images taken between April 1992 and April 1994 at AT&T Laboratories Cambridge. The sklearn.datasets.fetch_olivetti_faces function is the data fetching / caching function that downloads the data archive from AT&T.
As described on the original website:
There are ten different images of each of 40 distinct subjects. For some subjects, the images were taken at different times, varying the lighting, facial expressions (open / closed eyes, smiling / not smiling) and facial details (glasses / no glasses). All the images were taken against a dark homogeneous background with the subjects in an upright, frontal position (with tolerance for some side movement).
Data Set Characteristics:
Classes : 40
Samples total : 400
Dimensionality : 4096
Features : real, between 0 and 1
The image is quantized to 256 grey levels and stored as unsigned 8-bit integers; the loader will convert these to floating point values on the interval [0, 1], which are easier to work with for many algorithms.
The “target” for this database is an integer from 0 to 39 indicating the identity of the person pictured; however, with only 10 examples per class, this relatively small dataset is more interesting from an unsupervised or semi-supervised perspective.
The original dataset consisted of 92 x 112, while the version available here consists of 64x64 images.
When using these images, please give credit to AT&T Laboratories Cambridge. [12]
Below figure shows the 40 people's face from the dataset.
Next we plot the first 10 images for the first two persons (face id=0 and face id=1) from the Olivetti Faces dataset.
In 14 uses Python's sklearn library to perform K-means clustering analysis and standardize the features of the Olivetti Faces dataset. It then splits the dataset into training and test sets, and calculates the accuracy of the clustering results against the true labels.
First, necessary libraries are imported, including sklearn datasets, preprocessing, and clustering tools. Additionally, matplotlib is used for visualization.
Next, the Olivetti Faces dataset is loaded from sklearn's datasets, and its features and labels are extracted. The dataset is then split into training and test sets with a ratio of 70:30.
StandardScaler is used to standardize the feature data so that each feature has a mean of 0 and a standard deviation of 1. This step is performed separately for the training and test sets.
A K-means clustering object is created with the number of clusters set to 40, and the model is trained using the standardized training set.
The trained model is then used to predict the clusters for both the training and test sets.
A function is defined to map cluster labels to true labels. The mode function is used to determine the most frequent true label in each cluster.
The accuracy for the training and test sets is calculated and printed.
Finally, the cluster centers are visualized. These images represent the average face for each cluster.
Because we have the target value, we can calculate the accuracy, and we get the training accuracy is 0.62, the testing accuracy is 0.62.
where 1(𝑥) is the indicator function. [7]
Below we have tried different clusters to get the better accuracy, after we experiment we can get the best n_clusters = 185
and its training acc = 1.0, test scc = 0.97, Why there are 40 different people in the data set, the expected number of cluster K is 40, but we got 185. This may because even the same people looks very different in the data set.
Next we change the model to mean shift, and test different bandwidth , below table (P7 different bandwidth) shows the training and test accuracy . Why we change bandwidth from 2 to 50000? first mean shift takes a long time to train make my cpu hot! next we have test 2 5 and 0.1 sequencely, but we also get the same training and test acc, that is why we jump so fast.
Then we change the model to DBSCAN, below table(P7 DBSCAN) show the experiment we have tried. eps=0.1, min_samples=1 get the accuracy training =1.0, testing =1.0.
At the end we try the Hierarchical Merging, below table (P7 AgglomerativeClustering) shows the paremeters we have tried, we get 100% test accuracy when we set n_clusters=100. That's seems good.
Here we compare all we have learned algorithm, looks like unsupervised are more fit to this dataset!
The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics split in two subsets: one for training (or development) and the other one for testing (or for performance evaluation). The split between the train and test set is based upon a messages posted before and after a specific date.
This module contains two loaders. The first one, sklearn.datasets.fetch_20newsgroups, returns a list of the raw texts that can be fed to text feature extractors such as CountVectorizer with custom parameters so as to extract feature vectors. The second one, sklearn.datasets.fetch_20newsgroups_vectorized, returns ready-to-use features, i.e., it is not necessary to use a feature extractor. [13]
Data Set Characteristics:
Classes : 20
Samples total : 18846
Dimensionality : 1
Features : text
Data Considerations
The Cleveland Indians is a major league baseball team based in Cleveland, Ohio, USA. In December 2020, it was reported that “After several months of discussion sparked by the death of George Floyd and a national reckoning over race and colonialism, the Cleveland Indians have decided to change their name.” Team owner Paul Dolan “did make it clear that the team will not make its informal nickname – the Tribe – its new team name.” “It’s not going to be a half-step away from the Indians,” Dolan said.”We will not have a Native American-themed name.”
We follow the tips from [14] to implement. and use
homogeneity, which quantifies how much clusters contain only members of a single class;
completeness, which quantifies how much members of a given class are assigned to the same clusters;
V-measure, the harmonic mean of completeness and homogeneity;
The V-measure is the harmonic mean between homogeneity and completeness:
v = (1 + beta) * homogeneity * completeness
/ (beta * homogeneity + completeness)
to quantifying the quality of clustering results.
In 137 performs unsupervised learning on the 20 newsgroups dataset using the KMeans clustering method and evaluates the quality of the clustering results.
First, the 20 newsgroups dataset, which is a text dataset for classification tasks, is loaded from sklearn.datasets. Then, TfidfVectorizer is used to convert the text data into TF-IDF feature vectors. We set max_df to 0.5 and min_df to 5, meaning we ignore words that appear in more than 50% of the documents and those that appear fewer than 5 times, excluding stop words like "the" and "and".
After the transformation, a function evaluate_clustering is defined to evaluate the clustering results. This function calculates three metrics: homogeneity, completeness, and V-measure, which are used to measure the quality of clustering.
Next, we loop through different numbers of clusters (from 10 to 190, increasing by 10 each time) and perform KMeans clustering for each number of clusters. For each specific number of clusters, a KMeans object is created, and the model is trained to predict cluster labels for each sample.
For each clustering result, the previously defined evaluation function is used to calculate homogeneity, completeness, and V-measure, and these results are stored in the results list. The specific evaluation scores for each clustering result are also printed out.
Finally, we visualize all the results by plotting the relationship between the number of clusters and the three evaluation metrics (homogeneity, completeness, V-measure). This helps in observing the impact of different numbers of clusters on clustering performance.
Below table shows the parameters we test. We can see that when the n_clusters = 240 we have the highest V-measure = 0.31, and it is not keep increasing.
Next we try to change the model to mean shift. But we failed, because this dataset are too big than we get the MemoryError: Unable to allocate 2.82 GiB for an array with shape (15670, 24164) and data type float64. Then we use PBA try to decrease the dimensionality. And the below is what we get. I think this result cannot get anything helpful, so i give up with this method.
Next we try to change the model to mean shift. But we failed, because this dataset are too big than we get the MemoryError: Unable to allocate 2.82 GiB for an array with shape (15670, 24164) and data type float64. Then we use PBA try to decrease the dimensionality. And the below is what we get. I think this result cannot get anything helpful, so i give up with this method.
Homogeneity: 0.00
Completeness: 1.00
V-measure: 0.00
Then we change the model to DBSCAN. Below table (P7 20newsgroup DBSCAN) show the paremeters we tested, we get V-measure = 0.46, but it didn't change too much when we are tuning, i think it is because this dataset are one dimension is not fit to DBSCAN.
At the end we try the AgglomerativeClustering. seems like this method also not work well.
Here are a few reasons why AgglomerativeClustering and MeanShift might not work well on the 20 Newsgroups dataset:
High Dimensionality: The 20 Newsgroups dataset consists of textual data, which is typically represented as high-dimensional sparse vectors (e.g., using TF-IDF or word embeddings). AgglomerativeClustering algorithms are known to suffer from the curse of dimensionality, as the distance or similarity measures become less reliable and discriminative in high-dimensional spaces.
Large Dataset Size: AgglomerativeClustering algorithms have a time complexity of O(n^2) or O(n^3), depending on the linkage method used, where n is the number of data points. This makes them computationally expensive for large datasets like the 20 Newsgroups, which contains thousands of documents. The quadratic or cubic time complexity can make the clustering process extremely slow or even infeasible for large datasets.
Sensitivity to Noise and Outliers: AgglomerativeClustering algorithms are sensitive to noise and outliers in the data, as they rely on pairwise distances or similarities to merge or split clusters. Text data often contains noise, such as misspellings, irrelevant words, or outlier documents, which can negatively impact the clustering performance.
Lack of Flat Geometry: AgglomerativeClustering works well when the clusters have a flat geometry, meaning that the data points within a cluster are roughly equidistant from each other. However, text data often exhibits a more complex, non-flat geometry, where documents within a cluster can have varying degrees of similarity or dissimilarity.
Difficulty in Determining the Number of Clusters: AgglomerativeClustering requires specifying the number of clusters in advance or using a stopping criterion based on a distance threshold. Determining the optimal number of clusters for text data can be challenging, as the true number of topics or categories may not be known a priori.
Lack of Interpretability: While AgglomerativeClustering can group similar documents together, it may not provide interpretable insights into the underlying topics or themes of the clusters, which is often desirable for text data analysis.
Instead of AgglomerativeClustering, other unsupervised learning methods like topic modeling (e.g., Latent Dirichlet Allocation, Non-negative Matrix Factorization), document clustering using algorithms like K-Means or DBSCAN, or dimensionality reduction techniques (e.g., Truncated SVD, t-SNE) are often preferred for text data analysis tasks, including the 20 Newsgroups dataset.
These methods can better handle high-dimensional sparse data, are more scalable to large datasets, and can provide more interpretable results by capturing the underlying topics or themes in the text data. However, the choice of the most suitable method ultimately depends on the specific goals and requirements of the analysis, as well as the characteristics of the dataset.
[1] Chapter 19. Clustering , Machine Learning with Python - Theory and Implementation.
[2] Clustering in Machine Learning, 20 Mar, 2024, geeksforgeeks, surya priy.
[3] Cluster analysis, Wikipedia.
[4] k-means clustering, Wikipedia.
[5] sklearn.cluster.KMeans, Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
[6] Mean shift, Wikipedia.
[7] sklearn.cluster.MeanShift, Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
[8] DBSCAN, Wikipedia.
[9] sklearn.cluster.DBSCAN, Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
[10] Hierarchical clustering, Wikipedia.
[11] sklearn.cluster.AgglomerativeClustering
[12] The Olivetti faces dataset , Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
[13] The 20 newsgroups text dataset, Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
[14] Clustering text documents using k-means, Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.