For the data from the NYC Taxi data, one can find where the highest cluster density is in terms of pick-up and drop-off points. In order to explain this variability, have employed k-means clustering.
For finding the best number of clusters, Elbow method and Silhouette methods were used.
The Silhouette method is a technique used in clustering analysis to evaluate the quality of the clusters produced by a clustering algorithm. It provides a measure of how well each data point fits into its assigned cluster, and also gives an overall assessment of the clustering quality.
The Silhouette score for each data point is calculated by measuring two distances:
The average distance between the point and all other points in its own cluster (called "intra-cluster distance").
The average distance between the point and all other points in the nearest neighboring cluster (called "nearest-cluster distance").
The Silhouette score for a data point is then calculated as the difference between the nearest-cluster distance and the intra-cluster distance, divided by the maximum of the two values. The Silhouette score ranges from -1 to 1, with values close to 1 indicating that the point is well-matched to its own cluster and poorly matched to neighboring clusters, while values close to -1 indicate the opposite.
The overall Silhouette score for a clustering algorithm is then calculated as the average Silhouette score across all data points in the dataset. A higher Silhouette score indicates better clustering quality, with a score of 1 indicating that each data point is assigned to the perfect cluster and a score of -1 indicating that each data point is assigned to the wrong cluster.
Fig: Silhouette plot is drawn for k = 2, 3, 4, 5, 6, and 7.
From the above Silhouette plots, it's clear that for k-values ranging from 2 to 7, all pass the minimum threshold. Thus for this dataset, one cannot infer very much in deciding the optimum number of clusters.
Thus, we look at another method for determining an optimum number of clusters called the Elbow method.
The Elbow method is a graphical technique used in clustering analysis to determine the optimal number of clusters to use for a given dataset. It involves plotting the relationship between the number of clusters and the within-cluster sum of squares (WSS) or the sum of squared distances between each point and its centroid.
The within-cluster sum of squares is a measure of the total distance between each data point in a cluster and its centroid. The Elbow method involves plotting the WSS for each number of clusters, and looking for a point on the graph where the rate of decrease in WSS starts to level off, forming an elbow shape. This point represents the optimal number of clusters to use for the dataset.
From the above implementation, one can say that the optimal number of clusters for K-Means clustering is around 5.
Upon comparing the NYU Land usage with the taxi data, we can find similarities about the pick-up and drop-off points.
These similarities are very evident in the clustering results.
Fig: Left Image is the NYU Land Usage with respect to Pick-up points coincided on the same map.
Below are the results of K-Means for cluster sizes of 4, 5 and 6. The results coincide with elbow and silhouette methods calculated above.
The results from hierarchical clustering also coincide with the K-means clustering. Hierarchical clustering was coded in R using `hclust` function and using cosine similarity due to the high dimensionality of the data.
Hierarchical clustering is a bottom-up approach, where each data point starts in its own cluster and clusters are merged until all data points belong to a single cluster. Hierarchical clustering can be agglomerative (merging small clusters into larger ones) or divisive (splitting large clusters into smaller ones). Hierarchical clustering is useful when the data has a nested structure or when there is no prior knowledge about the number of clusters.
On the other hand, K-Means is a top-down approach, where the data points are initially assigned to K clusters, and then the cluster centres are iteratively updated to minimize the sum of squared distances between each point and its nearest cluster centre. K-Means is useful when there is prior knowledge about the number of clusters or when the data has a clear separation between clusters.
In terms of their results, hierarchical clustering and K-Means can sometimes coincide in terms of the number of clusters identified, but this is not always the case. The number of clusters in K-Means is determined by the user, while in hierarchical clustering it is determined by the dendrogram.
The dendrogram generated by hierarchical clustering can be used to estimate the number of clusters in the data. One way to do this is to look for the longest vertical distance that does not intersect any horizontal line and draw a horizontal line at that point. The number of clusters is equal to the number of vertical lines that intersect the horizontal line.
From the result below, one can infer that k=6 is the optimum number of clusters. This doesn't coincide with the K-means clustering since from our analysis, k=5 is found to be the optimum number of clusters using the Elbow Method.
The main reason for this difference is the different measures of cluster quality. Hierarchical clustering and the Elbow method use different measures of cluster quality to determine the optimal number of clusters. Hierarchical clustering uses the dendrogram to determine the number of clusters, which is based on the hierarchical structure of the data. The Elbow method uses the sum of squared distances of the data points to their nearest cluster centres, and the optimal number of clusters is determined by looking for the "elbow" point in the plot of the sum of squared distances versus the number of clusters.