Overview:
Clustering is a powerful unsupervised learning technique used to identify patterns or groups within data. In this project, we focus on two main types: partitional clustering (specifically, k-means clustering) and hierarchical clustering. Partitional clustering divides data into distinct groups based on their attributes, aiming for homogeneity within clusters and heterogeneity between clusters. Hierarchical clustering, on the other hand, creates a tree of clusters showing hierarchical relationships among the data points. We use Euclidean distance for k-means and Cosine Similarity for hierarchical clustering as distance metrics. The goal is to uncover natural groupings within weather data, revealing insights into different weather patterns.
Clustering is a method of unsupervised learning that groups similar data points together based on their features. There are two main types of clustering:
Partitional Clustering: This divides the data into non-overlapping subsets (clusters) such that each data point is in exactly one subset. K-means is a popular partitional clustering algorithm.
Hierarchical Clustering: Builds a hierarchy of clusters using a tree-like structure called a dendrogram. It can be agglomerative (bottom-up approach) or divisive (top-down approach).
Distance Metrics play a crucial role in determining the similarity between data points. Common metrics include Euclidean distance, Manhattan distance, and Cosine similarity. Cosine similarity measures the cosine of the angle between two non-zero vectors, useful for assessing similarity in high-dimensional data.
Application of Clustering for Discovery: Clustering can help in discovering inherent groupings within the data, such as customer segments, document categories, or natural groupings in biological data.
Clustering is used in this project to group weather data into distinct clusters. By analyzing Temperature, Humidity, and Wind Speed, we can observe how similar weather patterns group together. This helps in:
Anomaly Detection: Identifying outliers or unusual weather patterns.
Pattern Discovery: Uncovering hidden relationships between weather variables.
Data Segmentation: Grouping similar weather conditions for predictive modeling or further analysis.
Euclidean Distance (Used in K-Means):
Measures the straight-line distance between points.
Best for datasets where the magnitude of differences between points matters.
Formula: d(x,y) = ∑i=1n(xi−yi)2d(x,y)=i=1∑n(xi−yi)2
This image visualises how Euclidean distance is used in K-Means to measure the proximity between points.
2. Cosine Similarity (Used in Hierarchical Clustering):
Measures the cosine of the angle between vectors, focusing on orientation rather than magnitude.
Ideal when the direction of the data points matters more than their magnitude.
Formula:
cosine_similarity(x,y) = x⋅y∣∣x∣∣⋅∣∣y∣∣cosine_similarity(x,y)=∣∣x∣∣⋅∣∣y∣∣x⋅y
This image visualises how cosine similarity is used in hierarchical clustering to measure the similarity between points based on orientation.