Clustering is a machine learning technique that involves grouping a set of data points in such a way that the data points in the same group, or cluster, are more similar to each other than those in other clusters. The goal of clustering is to identify patterns or structures in the data that may not be apparent at first glance.
Cite: https://www.javatpoint.com/clustering-in-machine-learning
Clustering is an unsupervised learning technique, which means that there is no predetermined outcome or labelled data to guide the clustering process. Instead, the clustering algorithm examines the data and tries to identify groups of data points that share common characteristics or properties.
Clustering algorithms work by measuring the similarity between data points using some distance or similarity metric. The distance metric used depends on the nature of the data being clustered and the goals of the clustering analysis. For example, Euclidean distance is a common choice for continuous data such as numerical values, while Jaccard distance is used for binary data such as text documents.
Once the similarity between data points has been measured, the clustering algorithm assigns each data point to an initial cluster, either randomly or based on some heuristic. The algorithm then iteratively reassigns data points to different clusters based on their similarity until the clusters converge, meaning that the data points in each cluster are more similar to each other than those in other clusters.
Clustering has many applications, including data analysis, customer segmentation, anomaly detection, image segmentation, and recommendation systems. Clustering can also be used as a preprocessing step for other machine learning algorithms such as classification or regression.
There are several different types of clustering algorithms, each with its own strengths and weaknesses. Some of the most common types of clustering:
Partitional clustering(K-means clustering): This is a partitional clustering algorithm that divides the data into a predetermined number of clusters (k) based on the similarity of data points. The algorithm works by iteratively updating the cluster centroids and assigning data points to the nearest centroid.
Hierarchical clustering: This is a clustering algorithm that creates a hierarchy of clusters based on the similarity of data points. Hierarchical clustering can be agglomerative, starting with each data point as its own cluster and then merging them together, or divisive, starting with all the data points in one cluster and then recursively dividing them.
Density-based clustering: This is a clustering algorithm that identifies clusters based on regions of high density in the data. Data points in regions of high density are assigned to the same cluster, while data points in regions of low density are considered outliers.
Model-based clustering: This is a clustering algorithm that assumes that the data is generated from a probabilistic model, such as a mixture of Gaussian distributions. The algorithm estimates the parameters of the model and assigns data points to the most likely cluster based on their probability under the model.
Fuzzy clustering: This is a clustering algorithm that allows data points to belong to multiple clusters with different degrees of membership. The degree of membership is based on the similarity of the data point to the cluster centroid.
Subspace clustering: This is a clustering algorithm that identifies clusters in subspaces of the data, rather than the entire dataset. This can be useful for high-dimensional data where the underlying structure may only be apparent in certain dimensions.
Each type of clustering has its own strengths and weaknesses, and the choice of clustering algorithm depends on the nature of the data and the goals of the analysis.
Cite: https://medium.com/@chyun55555/how-to-find-the-optimal-number-of-clusters-with-r-dbf84988388b
Figure 1. Working visualization of K Means Clustering algorithm. Image Credits: Image credit — GIF via Wikimedia.com
Partitional and hierarchical clustering are the two main approaches for clustering data.
Partitional clustering: Partitional clustering is a clustering method in which data points are divided into non-overlapping groups or clusters. The most popular partitional clustering algorithm is k-means clustering. In k-means clustering, the data points are partitioned into k clusters, where k is a predetermined number of clusters. The algorithm works by iteratively updating the cluster centroids and assigning each data point to the nearest centroid. The distance metric used in k-means clustering is typically Euclidean distance.
Hierarchical clustering: Hierarchical clustering is a clustering method that creates a hierarchy of clusters. There are two main types of hierarchical clustering: agglomerative and divisive. In agglomerative hierarchical clustering, each data point starts as a separate cluster, and the algorithm iteratively merges the most similar clusters until all the data points belong to a single cluster. In divisive hierarchical clustering, all the data points start in a single cluster, and the algorithm recursively divides the cluster until each data point belongs to its own cluster. The distance metric used in hierarchical clustering can vary, but common choices include Euclidean distance, Manhattan distance, and cosine similarity.
Clustering on this NYC Taxi dataset, one can answer few things:
Predict the most popular Pickup and Drop points using clustering.
Predict the total amount by clustering on this dataset.