Dataset Overview
The dataset contains historical weather data with attributes such as temperature, humidity, wind speed, visibility, and more, with Precip Type serving as the target variable indicating the occurrence of precipitation (1) or lack thereof (0). To prepare this dataset for machine learning models, missing values were removed to ensure complete and unbiased data for training and testing. Categorical variables, such as the Summary column, were encoded using LabelEncoder, converting text into numeric representations suitable for machine learning algorithms. Additionally, numerical columns were scaled using StandardScaler to normalize the range of values, ensuring equal treatment of features regardless of their magnitude.
The dataset initially exhibited a significant class imbalance, with far more instances of No Precipitation compared to Precipitation. To address this, oversampling was applied to the minority class to equalize the dataset, enabling models to learn patterns from both classes effectively. Further, a small amount of random noise was added to the training data to mimic real-world variability and enhance robustness against overfitting. Lastly, the highly correlated feature Apparent Temperature (C) was removed, as it added redundancy due to its strong correlation with Temperature (C), which improved the overall efficiency and generalization of the models.
Before and After Data Preprocessing
Data Preprocessing and Disjoint Split
The implementation was structured into logical steps to maintain clarity and ensure reproducibility. During preprocessing, irrelevant columns like Formatted Date and Daily Summary were removed, and missing values were eliminated to ensure a clean dataset. Categorical variables, such as Summary, were encoded into numeric values, and numerical features were standardized using StandardScaler to normalize their scales. To address the imbalance in the target variable, oversampling was performed using the resample method from sklearn.utils, ensuring that both classes had an equal number of instances, thereby enabling the models to learn patterns from both precipitation and non-precipitation cases effectively.
Noise was added to the training data using np.random.normal to mimic real-world variability. Any resulting NaN values were handled with SimpleImputer, replacing them with the mean of the respective columns. Ensemble models, including Random Forest, AdaBoost, and Gradient Boosting, were then trained on the processed data. Their performance was evaluated using accuracy scores, confusion matrices, and classification reports, while cross-validation was applied to ensure generalizability. Finally, the results were visualized through confusion matrices, an accuracy comparison bar plot, and a feature correlation heatmap to provide comprehensive insights into model performance and feature relationships.
A disjoint split was created to divide the dataset into training and testing subsets:
1. 80% Training Data:
• Used for training ensemble learning models.
2. 20% Testing Data:
• Reserved for evaluating model performance.
This split ensures that the training and testing data are mutually exclusive, simulating real-world scenarios and preventing overfitting.
Why Disjoint Split?
• Prevents the model from “memorizing” the test data.
• Provides realistic performance metrics.
• Simulates real-world scenarios where models are evaluated on unseen data.
Ensemble Learning Methods
Three popular ensemble learning methods were applied:
1. Random Forest
• A bagging technique that aggregates predictions from multiple decision trees.
• Reduces variance and provides feature importance for interpretability.
2. AdaBoost
• A boosting method that iteratively adjusts weights to focus on misclassified instances.
• Builds a strong classifier by combining weak learners like decision stumps.
3. Gradient Boosting
• Another boosting method that sequentially minimizes a loss function.
• Captures complex patterns and relationships in data effectively.
Each method was trained on the training set and evaluated on the test set.
The performance of the three ensemble models, Random Forest, AdaBoost, and Gradient Boosting, was evaluated using accuracy, confusion matrices, and classification reports. These results highlight the strengths and limitations of each model in predicting precipitation.
Random Forest
Accuracy: 86.55%
The model correctly classified 93% of No Precipitation cases (Class 0) and 80% of Precipitation cases (Class 1). However, it misclassified 1,210 instances of Class 0 as Class 1 and 3,376 instances of Class 1 as Class 0.
Classification Report:
• Precision: 82% for Class 0 and 92% for Class 1, indicating that the model is slightly better at identifying precipitation cases.
• Recall: 93% for Class 0 and 80% for Class 1, showing the model’s effectiveness in detecting non-precipitation cases but slightly underperforming for precipitation cases.
• The overall F1-score is balanced, with a weighted average of 86%.
AdaBoost Results
Accuracy: 86.44%
Similar to Random Forest, AdaBoost achieved 93% recall for No Precipitation cases and 80% recall for Precipitation. However, it misclassified slightly more cases than Random Forest.
Classification Report:
• Precision: Comparable to Random Forest, with 82% for Class 0 and 92% for Class 1.
• Recall: 93% for Class 0 and 80% for Class 1, indicating comparable performance to Random Forest.
• The weighted average F1-score is 86%, demonstrating consistent performance with Random Forest.
Gradient Boosting Results
Accuracy: 87.05%
Gradient Boosting showed improved overall accuracy compared to the other models. It correctly classified 92% of No Precipitationcases and 82% of Precipitation cases. The number of misclassifications decreased slightly compared to the other two models.
Classification Report:
• Precision: 83% for Class 0 and 91% for Class 1, indicating a strong ability to correctly identify precipitation cases.
• Recall: 92% for Class 0 and 82% for Class 1, showing improved performance in detecting precipitation cases compared to the other models.
• The weighted average F1-score is the highest among the three models, at 87%.
Comparative Analysis
The performance of the three ensemble learning models—Random Forest, AdaBoost, and Gradient Boosting—was analyzed based on accuracy, precision, recall, and F1-scores. Below is a detailed comparative analysis of the results:
1. Accuracy
• Gradient Boosting achieved the highest accuracy of 87.05%, indicating its ability to capture subtle patterns in the data more effectively than the other models.
• Random Forest followed closely with an accuracy of 86.55%, showcasing its strength in handling complex datasets with relatively low overfitting.
• AdaBoost had a similar accuracy of 86.44%, slightly lower than Random Forest, highlighting its potential in focusing on misclassified instances but struggling marginally with dataset complexity.
2. Precision
• All three models demonstrated comparable precision for both classes.
• For No Precipitation (Class 0), Gradient Boosting led with 83%, followed by Random Forest and AdaBoost at 82%.
• For Precipitation (Class 1), Random Forest and AdaBoost showed slightly higher precision (92%) than Gradient Boosting (91%).
• Precision scores indicate that the models were highly effective in minimizing false positives.
3. Recall
• Recall values differed more noticeably across the models:
• Gradient Boosting achieved the best recall for Precipitation (Class 1) at 82%, outperforming Random Forest and AdaBoost, both of which achieved 80%.
• For No Precipitation (Class 0), Random Forest and AdaBoost achieved the highest recall of 93%, slightly better than Gradient Boosting at 92%.
• Higher recall for Class 1 in Gradient Boosting suggests better identification of precipitation cases.
4. F1-Score
• Gradient Boosting outperformed with the highest weighted average F1-score of 87%, indicating its ability to balance precision and recall effectively across both classes.
• Random Forest and AdaBoost both achieved weighted F1-scores of 86%, reflecting strong but slightly less balanced performance compared to Gradient Boosting.
5. Confusion Matrices
• Gradient Boosting had the lowest number of misclassified instances for both classes, with 1,291 false positives and 3,125 false negatives.
• Random Forest and AdaBoost showed slightly higher misclassifications, particularly in predicting Precipitation cases (false negatives: 3,376 for Random Forest and 3,370 for AdaBoost).
Key Observations:
1. Random Forest: Achieved good overall performance, particularly in identifying No Precipitation cases. However, it struggled slightly with recall for Precipitation.
2. AdaBoost: Delivered performance similar to Random Forest but showed slightly lower accuracy and precision for No Precipitation cases.
3. Gradient Boosting: Outperformed the other models in overall accuracy and F1-score. It showed better balance in precision and recall, particularly for Precipitation.
Conclusion
The study demonstrated the effectiveness of ensemble learning models, Random Forest, AdaBoost, and Gradient Boosting,in classifying precipitation from historical weather data. Each model successfully leveraged the power of ensemble techniques to deliver robust predictions despite the inherent class imbalance in the dataset. Gradient Boosting emerged as the best-performing model, achieving the highest accuracy (87.05%) and balanced performance across precision, recall, and F1-scores for both classes. Random Forest and AdaBoost also showed competitive performance, with accuracy scores of 86.55% and 86.44%, respectively, but had slightly higher misclassification rates for precipitation cases.
The results highlight the importance of preprocessing steps, such as handling missing values, scaling features, and balancing the dataset, in improving model performance. Noise addition and feature reduction further enhanced the models’ generalizability and reduced overfitting. Overall, the study demonstrated the potential of ensemble learning models in tackling real-world classification tasks, particularly in weather prediction scenarios, where balanced performance across classes is critical for decision-making.