CLUSTERING

Clustering is an unsupervised machine learning technique that groups data points based on their similarities, using distance metrics such as Euclidean distance to measure how closely related the data points are. In this project, clustering is used to identify patterns within coal shipments by analyzing attributes such as ash content, heat content, price, quantity, and sulfur content. Three clustering methods—KMeans, Hierarchical clustering, and DBSCAN—are employed to reveal distinct groups of coal shipments. Each method uses a different approach: KMeans groups data based on minimizing the distance between points and centroids, Hierarchical clustering builds a tree of clusters, and DBSCAN identifies clusters based on data density, helping to detect outliers. Through clustering, the project gains insights into coal quality variations, pricing strategies, and potential anomalies within the shipments. The process of clustering not only highlights underlying patterns but also aids in optimizing business decisions related to coal procurement and distribution.

Types of Clustering:

Partition Clustering (K-means): A method that divides data into mutually exclusive groups, where each data point belongs to only one group. It tries to make the inter-cluster data points as similar as possible while keeping the clusters as different as possible.

Hierarchical Clustering: This method builds a hierarchy of clusters where each node is a cluster consisting of the clusters of its daughter nodes. Strategies for hierarchical clustering generally fall into two types: divisive and agglomerative.

Density-Based Clustering (DBSCAN): Groups together closely packed points, marking as outliers points that lie alone in low-density regions. This differs from K-means by not requiring prior specification of the number of clusters to form.

DATA PREPERATION

In this clustering analysis, the PCA dataset from the 3D data was used to perform various clustering methods, including K-Means, Hierarchical Clustering, and DBSCAN. By reducing the dataset to three principal components through PCA, the complexity of the data was reduced while still retaining a significant amount of variance. This simplified the clustering process and allowed for more interpretable results.

UNCLEANED DATA USED FOR PCA : Link

NORMALIZED DATA

PCA DATA FOR 2D AND 3D

LINK FOR THE CODE : CLUSTERING

K-MEANS CLUSTERING

SILHOUETTE SCORES FOR DIFFERENT K

The optimal number of clusters was determined using the Silhouette Method, which evaluates the quality of clustering. Based on the silhouette scores, three distinct cluster numbers (K=3, K=5, and K=7) were chosen for deeper analysis.

PLOTS FOR K = 3, K= 5 AND K =7

RESULTS:

Each plot of the 3D PCA data with varying K values demonstrates the centroids (marked by 'X') which represent the center of each cluster. These visualizations help in interpreting the spatial distribution of data points and understanding how they aggregate into meaningful clusters.

For K=3: Three clusters show clear segmentation, suggesting a strong distinction in the dataset's inherent characteristics.

For K=5: Increasing the number of clusters provides a more nuanced separation, allowing for finer distinctions between data groups.

For K=7: The additional clusters reveal more detailed divisions, which may correspond to more specific characteristics or conditions within the dataset.

These visual analyses are crucial for identifying patterns and variances in the dataset, enabling predictions and decisions based on the detected clusters. The clustering results, coupled with the centroids, offer a practical approach to categorizing complex datasets into manageable and interpretable segments.

HIERARCHICAL CLUSTERING

Hierarchical clustering was applied to the three-dimensional PCA-reduced dataset to investigate the natural groupings without pre-specifying the number of clusters, as is required in K-means. The Ward method was employed, which minimizes the sum of squared differences within all clusters. This method is particularly effective for identifying spherical clusters and is often preferred for its intuitiveness and general applicability.

RESULTS:

The dendrogram generated provides a visual representation of the clustering process, illustrating how individual samples are merged into clusters based on their distance or similarity. Each merge is represented by a horizontal line connecting the clusters, with the height of the line indicating the distance or dissimilarity between clusters being merged.

Comparative Analysis with K-Means Clustering:

Hierarchical clustering reveals a more granular view of the data's structure, showing not just the formations of clusters but also the hierarchy and proximity between different groups.

Unlike K-means, which requires a predefined number of clusters and can force data into these categories, hierarchical clustering allows for a more flexible understanding of data relationships.

The dendrogram can be particularly useful for determining the appropriate number of clusters by observing the 'distance' at which large jumps in linkage occur, which often signify natural divisions within the data.

This approach provides a profound insight into the underlying structure of the dataset, which can complement the findings from K-means clustering by highlighting potential subtleties in data grouping that K-means might overlook due to its centroid-based approach

DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) provides a robust method for identifying clusters of varying shapes and sizes in a dataset, which is particularly beneficial when the data contains noise and outliers. This technique was applied to the three-dimensional PCA-reduced data to discern dense regions of data points, effectively highlighting core samples and distinguishing them from noise.

DBSCAN works on the principle of identifying 'core' samples within a specified radius (eps) that have a minimum number of points (min_samples) within their neighborhood. Points within these neighborhoods are classified as part of a cluster, whereas points in sparse areas are labeled as noise, thus not assigned to any cluster.

VISUALIZATION AND INSIGHTS

The 3D scatter plot of the PCA-reduced data clustered using DBSCAN reveals the spatial distribution of data points, highlighting clusters in distinct colors and noise in another, providing an intuitive understanding of group densities and separations.

This method proves advantageous over K-means and Hierarchical clustering by not requiring the number of clusters beforehand and by its ability to handle outliers effectively.

DBSCAN's ability to discover arbitrarily shaped clusters and its insensitivity to the order of data points make it a powerful tool for exploratory data analysis, especially when dealing with complex datasets where traditional clustering methods may falter.

COMPARISION WITH OTHER CLUSTERS

The DBSCAN clustering results in the 3D PCA plot provide a distinct contrast to K-means and hierarchical clustering:

Density vs Centroid-Based Clustering: DBSCAN, a density-based method, groups closely packed points and labels low-density points as outliers. In contrast, K-means assigns every point to the nearest centroid, often overlooking outliers. DBSCAN's approach highlights dense areas while isolating scattered outliers.

Outlier Handling: DBSCAN effectively identifies outliers, unlike K-means, which forces every point into a cluster. This is evident in the DBSCAN plot where outliers are clearly separated from the main cluster.

Cluster Shape Flexibility: Unlike K-means, which assumes spherical clusters, DBSCAN can capture clusters of varying shapes, making it better suited for complex data structures.

Comparison with Hierarchical Clustering: While hierarchical clustering provides varying levels of granularity, DBSCAN offers a clear partitioning of dense regions and sparse areas as outliers.

In summary, DBSCAN excels in identifying irregularly shaped clusters and outliers, offering a more flexible and accurate alternative to K-means and hierarchical clustering.

LEARNINGS

The analysis of coal shipment data provided valuable insights into coal characteristics and distribution within the energy sector:

Coal Quality Diversity: Significant variations in coal quality were observed across states and coal types (e.g., Bituminous, Subbituminous, Lignite). This is crucial for energy companies and policymakers, as it impacts both efficiency and environmental outcomes.

Impact of Coal Properties on Pricing: The analysis revealed that higher heat content correlates with higher market prices, highlighting the value of energy-dense coal.

Temporal Changes in Coal Use: Shifts in coal shipments over time suggest changing trends in energy usage, possibly reflecting a move toward more sustainable or cost-effective energy sources.

Geographic Distribution of Shipments: The study provided insights into the logistics of coal distribution across the U.S., influencing regional energy strategies based on coal availability.

Clustering Techniques: Different clustering methods like K-means, hierarchical clustering, and DBSCAN each offered unique insights into coal data, from identifying typical clusters to detecting outliers.

Dimensionality Reduction and Visualization: PCA simplified the data while preserving key variance, enabling clearer visualizations and aiding in decision-making.

Practical Applications: These findings help energy companies optimize fuel selection, improve logistics, and comply with environmental regulations by focusing on lower-sulfur coal.

This analysis demonstrated how advanced data techniques can extract actionable insights to support strategic decisions in the energy sector.

Page updated

Report abuse

CLUSTERING

Github link to my code: https://github.com/rahulshetty07/ML_Project_RahulShetty