Clustering is usually used to classify data into structures that are more easily understood and manipulated. The clustering algorithms used are k-means and agglomerative clustering. By looking at the graph, it detect anomalies or outliers and group the data based on their similarities. The information get in clustering do support the EDA.
First of all, determine the optimal number of clusters(k) by using Elbow method or Davies Bouldin Score.
Base on the graph, the elbow does not look sharp and clear, the k could be 3 or 4.
Base on the Davies Bouldin score graph, 3 is the best k value.
Scatter Plot of k-means clustering
Base on the scatter plot, cluster 2 is considered as outlier. The biggest cluster is cluster 0 with 6017 items and cluster 1 with 28 items.
Table above shows the data centroid
Base on the cluster table set, total delay duration equal to 2565 minutes are cluster 2, thus this 2 rows data are the outlier and remove. For the cluster 0, the total delay duration is less than 480 minutes whereas cluster 1 is between 480 to 1454 minutes delay.
Agglomerative Clustering
The scatter plot of agglomerative clustering is almost same as k-means clustering. The 2 red dot with 2565 minutes delay is the outlier.
Cluster table set
This analytics shows the description of the range of data in order to understand and investigate. For example, the highest total delay time is 2565 minutes and the destination is non-Southeast Asia mostly will be delay.
Click this link to view the Google Colab :