Clustering is one of the most widely used machine learning techniques. It falls into the category of unsupervised learning, where labels are not provided, and the model trains autonomously without human intervention. The primary goal of clustering is to group data based on their similarity. It is a popular method with a wide range of applications. Some of the common applications of clustering includes:
Market Segmentation
Anomaly Detection
Image Segmentation
Customer Segmentation
Clustering can be employed in various ways, with prominent applications including:
Data Segregation or Grouping: Group data based on their similarity.
Data Validation: Validating labeled data by uncovering patterns or anomalies, ensuring the correctness of labels.
Data Labeling: After clustering, data can be labeled accordingly, enhancing interpretation and understanding.
As the data usage and format is different across the industries the methods of clustering also varies. There are various methods of clustering and each method use different criterion for clustering.
Similarity in data points in most of the cases are measured using distances between the points. There are various types of distance measures used based on various scenarios. Some of the commonly used and famous distance measures include:
In the context of flight delays caused by weather, clustering emerges as a valuable analytical tool, enabling the examination of spatial and meteorological patterns. The application of clustering allows for the investigation of how delays distribute across temperature. This approach aids in determining the optimal number of clusters, providing a comprehensive understanding of the contributing factors to weather-related flight delays.
Initially K-Means algorithm, a form of partitional clustering is employed to see if there is any distinctive patterns is observed. Before proceeding with K-Means the optimal number of clusters are determined using techniques like Elbow method and Silhouette method for effective clustering. The distance measured used in this particular context is Euclidian distance.
Following that, Hierarchical clustering is performed to verify whether the results match with the partitional clustering and if any new patterns are observable. The distance measured used for this clustering is cosine similarity.