Exploratory Data Analysis
To analyze supply and demand at different cities ,we compute the ratio of EV Count(Demand) to Charging Stations Count (Supply) for each city .
We observe that the few cities have very high demand to supply ratio which indicates that the these cities might need more number of charging stations.
Filtered cities with high demand to supply ratio ,we set a high ratio threshold (200), and filtered out the cities with the ratio higher than the threshold.
Statistical Analysis
Investigated if there is positive correlation between EV Count and Charging Stations
Spearman's rank correlation coefficient: 0.7135930417983278
Pearson correlation coefficient: 0.9327692584207435
P-value: 8.526123104138706e-85
It shows a positive correlation as both the coefficient is closer to 1 confirming that the EV Population Count (Demand) positively influences the Charging Stations Count (Supply).
Types of Models Used
1. Linear Regression Model
Linear Regression Model captures the linear relationships between the dependent variable y (Charging station count), and independent variable x (EV count, education and income). The linear regression model calculates these coefficients by minimizing the difference between the actual and predicted values of the dependent variable. It's a good starting point for regression tasks because it provides a clear and direct understanding of how each independent variable affects the dependent variable.
2.Random Forest Regressor Model
It is a non linear Regression model that captures the complex relationship between the dependent and independent variables. Random Forest Regressor is used for regression tasks . It utilizes the ensemble learning method by building multiple decision trees and merging their predictions in improving accuracy and reduce overfitting. It is particularly effective for regression tasks where the relation might be highly nonlinear.
3.Gradient Boosting Regressor Model
Gradient Boosting Regressor is the ensemble machine learning algorithm used for regression tasks. It is part of the ensemble learning family, which means it combines the predictions from multiple models to produce a more accurate prediction than any single model. It works by first fitting an initial model (like a simple decision tree) to the data. Then, it fits additional models to the residuals (differences between the observed and predicted values) from the previous step. In each step, it tries to minimize a loss function, like mean squared error for regression.The key advantage of Gradient Boosting Regressor is its ability to combine multiple weak predictive models to create a strong predictive model making it good choice for complex relationships. For example, in our case we use 100 estimators, which are decision trees. Each decision tree improves upon previous prediction with a learning rate of 0.1.
4.Support Vector Regressor Model
SVR is an extension of Support Vector Machines (SVM), a method primarily used for classification. Unlike SVM, which is used for predicting categorical outcomes, it predicts continuous outputs. SVR is particularly useful in cases where the relationship between the independent and dependent variables is not well understood or is highly complex. It is robust to outliers and can model non-linear relationships effectively by using high dimensional features. It performs regression by constructing a hyperplane or set of hyperplanes in a high-dimensional space. The goal is to ensure that the distance between the hyperplane and the closest data points is minimized, effectively reducing the error of prediction. They use kernel trick to perform fast computation in high dimension.
Regression of Charging Stations
We took Charging Stations Count as dependent variable y, and Education, Income, EV Count as the independent variable X. The training data was 80% of the total dataset rest 20% was test dataset. The above four models were fit.
We plotted the scatter plot graph between Actual value and predicted value . NonLinear Regression Models like Gradient Boosting Regressor and Random Forest model shows a good result.
From the results ,we can observe that Gradient Boosting Regressor and Random Forest Regressor perform better compared to Linear Regression model and Support Vector Regressor model.
We observe that EV count is the most important variable and the other two variables median income and education have minimal impact on the prediction ability of the charging stations. Among the two algorithms, gradient boosting seems to value income and education slightly. Contrarily, random forest doesn’t value education and income and works best when it has only one independent variable, the EV count.