Principal Component Analysis (PCA) is a statistical technique used for dimensionality reduction while retaining as much variability in the data as possible. It transforms the original dataset into a set of new uncorrelated variables, called principal components (PCs), which are ranked according to the variance they capture from the data. The first principal component (PC1) captures the largest variance, and each subsequent component captures the remaining variance while being orthogonal to the previous components.
Dimensionality Reduction: High-dimensional datasets are complex and challenging to visualize. PCA reduces the number of variables, allowing us to focus on the most significant ones while preserving important information.
Variance Maximization: PCA ensures that the maximum amount of variance is captured in as few components as possible, reducing the risk of overfitting while speeding up computations.
Data Visualization: By reducing the number of dimensions to two or three, we can visualize patterns in the data that were previously hidden in higher dimensions.
Before applying PCA, it is essential to standardize the dataset to ensure that each feature contributes equally. In this project, we standardized weather-related features such as Temperature, Humidity, Wind Speed, Pressure, and Visibility using StandardScaler to give them a mean of 0 and a standard deviation of 1. Once the data was scaled, we applied PCA to reduce the dimensionality.
1. 2D PCA Scatter Plot
The 2D scatter plot below illustrates the distribution of the data projected onto the first two principal components. Each point represents a sample from the dataset, projected into the new 2D space where the largest variances are captured.
Principal Component 1 (PC1): Captures the maximum variance in the dataset.
Principal Component 2 (PC2): Captures the second-highest variance while being uncorrelated to PC1.
In this 2D projection, we can see how the data is spread across these two principal components. Although no distinct clusters are immediately apparent, this projection effectively reduces the dimensionality of the dataset while retaining significant variance.
2. 3D PCA Projection
The 3D projection below visualizes the dataset in a three-dimensional space defined by the first three principal components.
Principal Component 1 (PC1): Captures the most variance.
Principal Component 2 (PC2): Adds additional variance captured beyond PC1.
Principal Component 3 (PC3): Further adds variance while being orthogonal to both PC1 and PC2.
This 3D plot provides a clearer view of the data's structure, with more variance captured than the 2D projection. The spread of points across all three components gives a better sense of how the data varies across multiple dimensions.
In PCA, it is important to measure how much of the dataset’s variance is captured by each principal component. This is shown in an explained variance plot, where we can observe the proportion of total variance retained as the number of components increases.
2 Components: Captured 67.83% of the variance.
3 Components: Captured 85.60% of the variance.
From this, we can conclude that two or three principal components are sufficient to explain the majority of the variance in the dataset.
Dimensionality Reduction: By reducing the dataset to 2 or 3 principal components, we simplified the data while preserving the most important information.
Variance Explained: The first two principal components captured 67.83% of the variance, and the first three components captured 85.60%, allowing us to retain most of the dataset’s variability with fewer dimensions.
Data Visualization: The 2D and 3D PCA projections provide a clear and concise way to visualize the dataset in fewer dimensions, helping us identify patterns and relationships that were not visible in the original higher-dimensional space.
Principal Component Importance: The principal components identified through PCA reveal which directions of the data contain the most variance, providing insight into the most informative aspects of the dataset.
PCA proved to be a powerful tool for simplifying the dataset, improving its interpretability, and preparing it for further analysis, such as clustering or predictive modeling.