"ReneWind" is a company working on improving the machinery/processes involved in the production of wind energy using machine learning and has collected data of generator failure of wind turbines using sensors.
Our objective is to build various classification models, tune them and find the best one that will help identify failures so that the generator could be repaired before failing/breaking and the overall maintenance cost of the generators can be brought down.
Skills Covered:
Up and downsampling
Regularization
Hyperparameter tuning
Tools Used:
Python: Jupyter Notebook
Libraries: Numpy, Pandas, Matplotlib, Seaborn, scikit-learn, XGBoost.
Renewable energy is becoming a bigger player in how the world gets its energy, especially as we try to be less harmful to the environment.
Wind power is one of the most advanced renewable energy sources.
To keep wind turbines running smoothly and cheaply, companies like ReneWind use special sensors to predict when parts might break (Predictive maintenance).
This way, they can fix things before they become a big problem.
These sensors collect data on things like the weather and how different parts of the turbine are working.
The goal is to develop an accurate failure prediction system for generators using various models. This will minimize maintenance costs by:
Enabling proactive repairs (cheaper) instead of reactive replacements (expensive).
Balancing the cost of unnecessary inspections (from false alarms) with the benefit of catching real failures.
The nature of predictions made by the classification model will translate as follows:
True positives (TP) are failures correctly predicted by the model. These will result in repairing costs.
False negatives (FN) are real failures where there is no detection by the model. These will result in replacement costs.
False positives (FP) are detections where there is no failure. These will result in inspection costs.
The distribution of all the variables are similar (We used V2 for representation).
All the variables are close to normally distributed.
● The data shared is a ciphered version containing 20000 observations in the train set and 5000 in the test set.
● The number of features provided is 40 but the data is ciphered hence, the column names are anonymous.
● There were few missing values in V1 and V2, we imputed them using the median and to avoid data leakage we imputed missing values after splitting train data into train and validation sets.
● 94.5% of the observations are negative and only 5.5% observations are a positive representing failure. The dataset is highly imbalanced so we tried undersampling and oversampling techniques to balance the data.
Which metric to optimize?
We need to choose the metric which will ensure that the maximum number of generator failures are predicted correctly by the model.
We would want Recall to be maximized as greater the Recall, the higher the chances of minimizing false negatives.
We want to minimize false negatives because if a model predicts that a machine will have no failure when there will be a failure, it will increase the maintenance cost.
Recall value is higher for Xgboost followed by Bagging Classifier, Random Forest, and Gradient Boosting Classifier.
Recall value is higher for Random Forest, Decision tree, and Bagging Classifier, but they all are overfitting.
We can see that oversampling of data helped improve the performance a lot, now let's see how models perform with undersampled data.
Recall value is higher for Random Forest followed by Xgboost model and Gradient Boosting .
After looking at performance of all the models, let's decide which models can further improve with hyperparameter tuning
The following 4 models have the highest Recall value, and we will be tuning them all:
Random Forest with undersampled data
Gradient Boosting with oversampled data
AdaBoost on oversampled data
Xgboost on oversampled data.
Model Performance Summary - Tuned models
AdaBoost model trained with oversampled data has generalised the performance.
So, let's consider it as the best model.
V30, V9 and V18 are most important features.
They can be deciphered to determine and analyze the actual variables to understand their impact on the predictive task at hand.
We build the pipeline with the following components:
○ Simple Imputer for imputation
○ AdaBoost model with oversampled data
AdaBoost model performed well on test data as shown in the table at the left side.
The AdaBoost Classifier tuned using oversampled data has the best performance.
V30, V9 and V18 are the most important features. They can be deciphered to determine and analyze the actual variables to understand their impact on the predictive task at hand.
This model can be further used to detect if a wind turbine will fail or not and this will help reduce the cost.
We also saw that there might be few points near around the classification threshold (0.5 by default) which can be further studied by the engineer and a final call could be made.