Principal Component Analysis (PCA) is a statistical technique used to reduce the dimensionality of a dataset, increasing interpretability while minimizing information loss. By transforming a large set of variables into a smaller one that still contains most of the information in the large set, PCA helps highlight the most significant relationships and patterns. This is achieved by identifying the directions, called principal components, along which the variation in the data is maximized. PCA is particularly useful in processing data where multicollinearity exists or when simplifying datasets by reducing the number of dimensions without significant loss of information.
Principal Component Analysis (PCA) plays a critical role in reducing the complexity of the coal dataset while preserving the most important information. The coal dataset consists of numerous variables such as ash content, heat content, price, quantity, and sulfur content, which can introduce redundancy and correlation between variables. PCA simplifies this data by transforming it into a smaller set of uncorrelated principal components. These components capture the majority of the variance in the dataset, ensuring that the most significant relationships and patterns are retained, while the less relevant noise is minimized. This dimensionality reduction helps streamline the data for subsequent analyses, such as clustering or association rule mining, without losing key information.
The primary benefit of using PCA in the project is its ability to enhance interpretability and visualization while maintaining the underlying structure of the data. By reducing the data from high-dimensional space to just a few principal components, it becomes easier to visualize the coal characteristics and identify trends or groupings. This is particularly useful when dealing with high-dimensional data, where traditional visualization and analysis methods may struggle. In this project, PCA enabled a more efficient analysis, ensuring that insights drawn from clustering or association rules were based on the most significant coal attributes, leading to clearer and more meaningful conclusions.
This visualization comprises two elements illustrating the effects of PCA on the project's data. The left plot shows the distribution of the dataset along the first two principal components, highlighting how PCA condenses information, with the red and blue points perhaps indicating different classifications or conditions before analysis. The right diagram simplifies this transformation into a network, where nodes represent principal components and edges the relationships between them, emphasizing the variance each component captures. This aids in identifying which components hold the most critical information, guiding further data analysis decisions.
This biplot visualizes the data transformation using Principal Component Analysis (PCA), combining a scatter plot of the data in the space of the first two principal components with the vectors of the original variables. The scatter plot illustrates how the data is distributed along these components, capturing the most variance, while the arrows indicate the contribution of each variable. This visualization helps in understanding the underlying patterns and relationships in the data, essential for the project's analytical goals.
The data was prepared by isolating these specific numerical variables and ensuring no missing values were present. Labels and qualitative data were removed to maintain the quantitative integrity required for PCA application. This preparation stage is critical as PCA requires numeric data only.
Data Normalization:
The data was normalized using the StandardScaler from Scikit-Learn to standardize the features onto a unit scale (mean = 0 and variance = 1), which is a requirement for the optimal performance of many machine learning algorithms.
PCA Implementation
PCA was performed twice on the normalized data:
First, with n_components=2 to reduce the dataset to 2D.
Second, with n_components=3 to explore 3D
PCA 2D and 3D VISUALIZATIONS
Information Retention
2D PCA: The two principal components retained 68.33% of the total variance in the dataset, highlighting a significant amount of the dataset's information.
3D PCA: Extending to three components increased the variance coverage to 83.6%, providing a deeper insight into the dataset’s dynamics.
OUPTUT VISUALIZATIONS FOR 2D AND 3D PCA
Dimensionality Reduction:
To retain at least 95% of the data's original information, it was calculated that four principal components are necessary. This indicates that while the first three components capture the majority, a fourth dimension allows for a more comprehensive representation of the dataset's variance.
Eigenvalues The top three eigenvalues from the PCA were found to be 2.30, 1.12, and 0.74, respectively, which quantitatively reflect the amount of variance captured by each principal component.
Eigenvectors: The top three eigenvectors for the PCA were calculated as shown in the array. These vectors represent the directions of maximum variance in the dataset, with each value in the eigenvector corresponding to the weight or contribution of each original feature in defining that principal component. The orientation and length of these eigenvectors help in identifying the directions along which the coal dataset varies the most, aiding in dimensionality reduction.
Cosine Similarity: The cosine similarity matrix reveals the similarity between data points based on the angle between their vector representations. Values close to 1 indicate high similarity, while values closer to -1 suggest dissimilarity. This matrix provides insight into how closely related different points are within the reduced-dimensional space, which is beneficial for clustering and identifying patterns in the coal dataset.
Conclusion:
Principal Component Analysis (PCA) has proven to be an invaluable tool in simplifying the coal dataset while retaining its most critical information. By reducing the dataset to its principal components, PCA enables a more manageable analysis without significant information loss. This dimensionality reduction has allowed for clearer visualizations and more effective clustering, ultimately supporting deeper insights into the patterns and relationships within the coal data. Through the use of eigenvalues, eigenvectors, and cosine similarity, PCA facilitates an understanding of the variance in each component and the similarity between data points, making it a foundational step in preparing the data for further analyses, such as clustering and association rule mining.