In K-means clustering, Euclidean distance is employed for calculating the dissimilarity between data points. To perform K-means clustering, the number of clusters needs to be defined. Various methods, such as the Elbow method and Silhouette method, are employed in this case to determine the optimal number of clusters.
The results of the Elbow method indicate that the optimal number of clusters is 3, marked by the beginning of the elbow point. Consequently, for further consideration, all three k values for clusters (k = 2, 3, 4) will be tested to observe the variations. This decision is reinforced by the Silhouette score, as revealed by the results of the Silhouette method, which demonstrates a maximum score for k = 2 and a minimum score for k = 4. For testing purposes, the number of clusters is fixed at 2, 3, and 4.
Similarly, for Hierarchical clustering, cosine similarity is used as a distance measure. Dendrograms are plotted to determine the effective number of clusters. Multiple linkages can be employed for Hierarchical clustering. In this case, Ward linkage and complete linkage are utilized for plotting the dendrogram.
Based on the above dendrograms, dendrograms using complete linkage start clustering efficiently with 3 clusters, whereas the dendrograms using ward linkage start clustering efficiently with 2 clusters at the top level. Depending on the use case, it is better to use either of the linkage methods. In this case, the Ward Linkage method is employed for clustering.
From this clustering with K = 2, the data is divided into 2 clusters. Some patterns can be observed, and the insights that can be derived indicate that the clustering is based on delays less than 5 hours and delays of 5 hours and above. However, no major insights can be observed from this cluster.
From this clustering with K = 3, the data is divided into 3 clusters. The observed patterns in this cluster are that delays above 5 hours are grouped as one cluster, occurring more frequently within a specific range. Moreover, for delays of less than 5 hours, the temperature range is divided into two clusters. One is within the temperature range of 40°F - 75°F, where delays are sparse and minimal. In contrast, the third cluster, with a temperature range greater than 75°F, exhibits a dense and tightly clustered distribution, signifying a higher occurrence of delays. This indicates a relationship between delays and temperature in Florida. As the temperature increases, there is a higher chance of flight delays. Moreover, this clustering with K = 3 provides more meaningful insights compared to K = 2.
From this clustering with K = 4, the data is divided into 4 clusters. The observed patterns in this cluster are that delays above 10 hours are grouped as one cluster. Additionally, there is another cluster just below that, covering delays between 3 to 6 hours, and is dense, occurring in the temperature range of > 70°F. Another cluster represents delays of less than 3 hours in the temperature range > 75°F. The last cluster depicts sparsely spread delays in the range of 40°F - 75°F. The addition of an extra cluster does not contribute significantly, implying that K=3 is the best-optimized value of K for K-means clustering. This is also supported by the Elbow method and Silhouette method.
From this clustering with K = 2, the data is divided into 2 clusters. The clusters are separated based on temperature. The red cluster (cluster 1) is spread across the temperature range of 40°F - 75°F, and the green cluster (cluster 2) is spread across the temperature range above 75°F. In the red cluster, it can be visualized as long-range and sparsely distributed, while the green cluster is comparatively dense. These are the only insights that can be derived from this clustering with K = 2.
From this clustering with K = 3, the data is divided into 3 clusters. The red cluster (cluster 1) is spread across the temperature range of 40°F - 75°F, similar to the case where K = 2. However, cluster 2 from K = 2 is divided into two clusters, where delays in the temperature range above 75°F are split into delays below 3 hours as the green cluster (cluster 2), and the blue cluster (cluster 3) represents delays above 3 hours. Nothing significant other than this can be observed from this plot for K = 3, making K = 2 appear more suitable for the given data.
From this clustering with K = 4, the data is divided into 4 clusters. The red cluster (cluster 1) is spread across the temperature range of 40°F - 75°F, similar to the cases where K = 2 and 3. However, cluster 2 from K = 3 is further divided into two clusters, where delays in the temperature range above 75°F below 3 hrs are split into the range of 75°F - 80°F as the green cluster (cluster 2), and the blue cluster (cluster 3) represents delays less than 3 hours ranging from a temperature above 80°F. The gold cluster (cluster 4) is the same as cluster 3 from K = 3. Nothing significant other than this can be observed from this plot for K = 4, making K = 2 appear more suitable for the given data.
When comparing the clustering results of K-Means and Hierarchical clustering for various values of K, several insights can be derived. In terms of performance, K-Means was computationally faster compared to Hierarchical clustering. The choice of distance measures also plays a significant role. K-Means employed Euclidean distance, making it more suitable for clustering low-dimensional data that considers the magnitude of both origin temperature and origin weather delay. On the other hand, Hierarchical clustering used cosine similarity, focusing on angles rather than magnitudes.
Comparing the cluster results from both methods, it can be observed that K-Means considers both origin temperature and origin weather delay, while Hierarchical clustering primarily focuses on temperatures. Another distinction is that K-Means requires a predefined number of clusters, unlike Hierarchical clustering.
In terms of effective clustering, K-Means with K = 3 appear optimal, whereas in Hierarchical clustering, it's K = 2, and it is highly influenced by the linkage method employed. These differences highlight the importance of understanding the nature of the data and the goals of clustering when choosing between K-Means and Hierarchical clustering.