Before K-means clustering is performed, the data needs to be prepared for applying K-means. In this case, K-means is executed using the sci-kit learn library of Python employing Euclidian distance. The sci-kit learn library of Python requires the data to be numeric for performing K-means clustering since the calculation of centroids can't be computed for categorical variables as they won't be meaningful.
Similarly, hierarchical clustering is performed using R employing cosine similarity as a distance measure, and the data needs to be transformed before providing it as input. Since the performance of K-means and hierarchical clustering is going to be compared, the data that will be used for K-means must also be used for hierarchical clustering. There is an issue with R; if large data is provided as input, it is not possible for R to perform the clustering and gives an error stating a memory issue. To tackle this problem, a subset of data (selecting one of the 51 states) is going to be filtered and passed as input.
Initial data before processing for clustering
In the previous data preparation for visualization, the temperature column was discretized/binned into categories. In this case, the data used as input for K-means and hierarchical clustering will use the original temperature values before transformation.
Then, the features for input for K-means are selected. Weather delay patterns are analyzed for a particular state with the most delays - Florida, to address the memory issue of R. Moreover, since the percentage of origin weather delays is higher compared to the destination, only the origin will be considered in this case. While selecting features, some important things need to be kept in mind. For features greater than 3, it is impossible to visualize and less intuitive. However, this can be tackled by PCA. But for the sake of simplicity and understanding the main feature of focus behind the delay, only two features are being considered:
Origin Temperature
Origin Weather Delay
So, the data is prepared in such a way that the data with the origin state of Florida is filtered and stored in a new data frame. Following that, the only features selected for K-means and hierarchical clustering - origin temperature and origin weather delay are filtered. Since both of these columns are numerical they can be provided as input for both the clustering.
Data after filtering the state of Florida
Snapshot of a preprocessed final data frame for clustering