Electric Pole Data Analytics Model
Phase 1: Problem Statement
Electric poles play an important role in providing energy to the customers and with huge populated poles in the states, it is getting difficult for utility business to predict their pole hits and also track their critical poles. This project focuses on this problems.
The main problems are :
Utility business do not have the automated way to predict their electric pole hits
They do not know which electric pole is more critical and which is not ?
Aim of the Project:
To predict the probability percentage of a pole being hit and also provide which pole is more critical
GitHub : Vinay Bollapu
How does this model help ?
Energy providers spend millions of dollars in their electricity services in order to track their customer impact on outages, to provide better customer services and to repair their outages as soon as possible. Especially recent years showed a history of major power outages like the Texas Incident. Recently, 290,000 customers of dominion energy were out of power in the month of February due to the adverse weather conditions [1].
If a model can predict the probability of a pole being hit , then energy provider can focus more on the critical poles more often and take prevention to reduce the pole damages which in turn reduces the service cost and negative customer impact. This will eventually reduce the mitigation process.
GitHub : Vinay Bollapu
Phase 2 : Data Gathering and Model Data Preparation
Before diving into model, first we need to understand where the data is coming from and where it is getting stored. In any data driven organization there are two types data storages. Transactional data and Analytical Data. In a nut shell, transactional data is where you put the data in and you get the data out from analytical data. Usually, models are build on top of this analytical data warehouse.
Transactional data bases focus on the daily events and these databases do not care about the historical data or aggregation queries but the analytical data warehouses store the historical data and this data warehouse is designed to work upon the analytical queries like aggregations.
For this project the data is persisted in different formats and in different databases. I had multiple transactions and analytical data warehouses. The challenge was, none of these databases did not store information of pole hits. So, I used various python ETL processes to extract the data and it's features so that I can transform and feed them into a machine learning model.
Various methods I used to extract the data:
Relational SQL joins from OLTP and OLAP data bases
Text mining using spacy and regular expressions by analyzing operator comments
Tracking the customer orders about a pole and labelling them as hits using Oracle SQL
Spacy Text Mining:
This is one of the challenge I faced while I was developing the data set for the model. Since I had no pole hits stores, I had to find sources where I can get information regarding pole hits and one of this was comments. These comments stores the description of an event of pole hit. A sample comment from my data bases is:
"3153 4766 Aldgate Gn ,MD 21035, car broke pole 123456 40 class 4. Needs to be held while tow truck pulls car out, once car is removed may be able to be braces.... renewed same"
I cannot simply use Regular expressions to extract the pole numbers because they do not check that whether that number is pole number or not. So, I had to get the dependency of the words in the sentence. For this purpose I used displacy model of Spacy. The output of displacy is stored into a python variable to get the pole numbers.
The image to the right is the sample output of a Displacy visualizer of a sentence.
Pole Criticality:
Every pole has devices attached on them. Some pole have transformers (which serve more people) and some pole have only bus break (which are just used to carry circuit wires). Based on these devices, I have developed a formula which ranks the pole on a scale of 0-5 (5 being more critical)
Why Criticality is Important ?
Every pole is capable of taking hits, that's how they are designed. But some pole carry devices on them which might damage on a hit. So, to narrow down the search further, this algorithm is used.
Instance: Imagine two Poles A and B. Pole A has no devices on them (Criticality 0) and Pole B has multiple devices on them (Criticality 5). If both of these pole are predicted with a probability 70% chance of hit. The organization should focus more on the pole with criticality 5 and try to reduce the impact as soon as possible.
Pole criticality goes beyond the machine learning predictions. This adds business value to the project. It is important for any project to generate revenue. So, this algorithm focuses on generating business value by using predicted results from the model.
Refer this document to see the pole criticality implementation : Document
How does my data look like ?
My data contains, 360 thousand unique poles, meaning if you imagine this in a csv file it contains 360k rows. I have collected data from past 5 years i.e. from 2015-2020 and I will test my model on 2021 data.
My model is a classification problem. I have two classes Pole hits (Class 1) and Pole No-Hits(Class 0). For any classification data, it is important to check the ration between these classes. Observing the image to the left, it is clear that my data is heavily imbalanced. In real world examples, this is almost expected because you cannot imagine a utility business with 98% hit ratio.
Data Set Numbers:
After cleaning the data set my data rows reduced to 246481.
Training Data:
My training data is the collection of pole hits from 2015-2020. The data set consists of 246481 unique poles with 3819 pole hits and 242662 pole no hits
Testing Data:
My testing data is collection of poles from the year 2021. The data set contains 246481 unique poles with 219 pole hits and 246262 pole no hits
Note: These stats are after data cleaning so, the math might not match with the figure to the left.
Model Inputs and Targets:
To determine or predict the pole hits, the most important features would be
Road Distance (feet)
Distance to nearest bar/tavern (feet)
Speed limit (mph)
Road classification
Slope of the road (meters)
Demographic Information (population, income, and so on)
My target column would be HIT_FLAG which is either 0 or 1. 1 is the pole hit
Imbalance Data :
Imbalanced classes are a common problem in machine learning classification where there are a disproportionate ratio of observations in each class. Machine learning models don't work well with this kind of data as the model biases out to the majority of the class but in reality the model cannot predict the minority class right. We cannot simply rely on the accuracy of the model as it is not the right metric to measure the model.
Usually data scientists used under and over sampling technique to solve this problem. The ideal way is oversample the training data and fit the model. Use this model on the test data to get the predictions.
According to my research the ideal way to solve imbalance problem is:
Reduce the unnecessary data: One should decide, weather we need to model every data point. In my project poles are present in multiple locations where probability of being hit is almost zero. So, I do not need to model or predict these poles. I simply dropped these poles. It is important to verify that dropping these poles is the right decision or not. I cross checked the pole hits from the previous years.
Under sampling Technique: Resampling methods are designed to change the composition of a training dataset for an imbalanced classification task. Most of the attention of resampling methods for imbalanced classification is put on oversampling the minority class. Nevertheless, a suite of techniques has been developed for under sampling the majority class that can be used in conjunction with effective oversampling methods. Near Miss refers to a collection of under sampling methods that select examples based on the distance of majority class examples to minority class examples.
SMOTE Over Sampling Technique: A problem with imbalanced classification is that there are too few examples of the minority class for a model to effectively learn the decision boundary. One way to solve this problem is to oversample the examples in the minority class. This can be achieved by simply duplicating examples from the minority class in the training dataset prior to fitting a model. This can balance the class distribution but does not provide any additional information to the model. SMOTE works by selecting examples that are close in the feature space, drawing a line between the examples in the feature space and drawing a new sample at a point along that line.
The below confusion matrices explains how the model performed on the different sampling techniques. These results are of the test data set. it is clearly evident how the model increased it's performance from raw data to SMOTE oversample data. As mentioned above in the data set facts, my test results had only 219 hits and my model predicted 134 right. The overall distribution across two classes is a decent score. There can be future improvements on this model like including class weights parameters and performing xgboost algorithms.
Model Performance on Raw Data
Model Performance on NearMiss Data
Model Performance on SMOTE Data
Model Visualizations:
Since I am getting the probability of the pole hits and no hits, I am exploring the highest probability percentages poles.
How will these help ?
Now that I have these graphs, I can narrow down to these 3000 poles which have probability greater than 60% and combine these with criticality. This will narrow down the search towards, the most critical poles.
*Please refer to the phase 3 slide
Model predicted 3781 poles above 60% chance of hit based on previous 5 years of data
Class 1 has 54.66 % as average probability predicted by Model
Combining the criticality score and the model predictions would create the business value for this project. Imagine a pole with 90% probability hit and also a critical score of 5. This pole is really important so this model narrows down the approach and gives the organization a clear cur idea on which pole to focus on.
References
Rivers, Author: Megan. “Could the DMV See Power Outages like in Texas with Thursday's Storm?” wusa9.Com, 17 Feb. 2021, Click here
“U.S. Energy System Factsheet.” U.S. Energy System Factsheet | Center for Sustainable Systems, Click here
“U.S. Energy Information Administration - EIA - Independent Statistics and Analysis.” State Energy Profile Data, Click here