Electric Vehicles: Predicting Charging Demand
by
by
GitHub Repository for your capstone project: https://github.com/umbc-data606-summer-2021/EV-Analysis-TylerE
The electric vehicle (EV) market is gaining traction as a viable alternative to internal combustion engine (ICE) vehicles. Approximately 60% of carbon pollution from transportation is due to personal ICE vehicles, and although the development and usage of EVs is not pollution-free, it offers a means to decrease the level of pollution that is created through transportation (Featherman,2021). Several barriers to widespread EV adoption include an increased electricity demand for charging, higher price points for the vehicles, insufficient charging infrastructure, and the inconvenience of charging rather than filling up a gas tank (Tarei 2021). These hurdles currently impact consumers, manufacturers, and utility providers.
Consumers are hesitant to purchase EVs because there is a lack of familiarity, and the idea of not being able to go from an empty battery to a full charge in a matter of minutes is intimidating. Manufacturers have to accurately gauge public interest to ensure that producing a new line of EVs will be profitable, because without enough public interest, EV development poses more financial risk than ICE vehicle development. Additionally, EV charging stations are being installed on outdated power grids that were not designed to handle the demand that is generated by EV charging. Utilities have to consider the safest and most economical approach to modernizing their electrical grids so they can support EV charging demand.
The dataset for this project can be used to predict energy consumption as a result of EV charging. The ability to recognize how much electricity demand EVs are responsible for will allow utilities and property owners to address the electrical infrastructure that they are responsible for. With EV ownership growing in popularity, this is a critical issue that needs to be addressed before EVs outnumber ICE vehicles. Demand caused by EVs is represented by how many kilowatt hours (kWh) are provided during a charging session, therefore the focus of this project will be on predicting kWh values.
I am using 4 datasets from a database that I am in the process of building. Each dataset represents a table from a SQL database, with primary keys and foreign keys that can be used to link the tables together, or in this case, merge the datasets in Python.
Station Dataset (0.9 MB) - electric vehicle supply equipment (EVSE)
id - primary key, an identifier that represents a unique station
city
state
local_time_zone
country
no_ports - the # of ports available at a station
charge_level - rate of charging (Level 1/2/3, DCFC)
venue - description of where the station is installed (Hospital, Retail, Residential, etc.)
access_type - Public, Private or Limited
rucc_id - foreign key to rucc table
Port Dataset (0.5 MB) - individual charging ports (EV charging stations may have 1 or 2 ports)
id - primary key, unique for every port
station_id - foreign key to station table column 'id'
port_number - the port number from the station that the port is attached to (Typically 1 or 2)
charge_level - level of charging that the port provides
connector - the type of hardware that connects the station to the vehicle
power_kw - rate at which electricity is transferred (kW)
Session Dataset (226 MB) - individual charging sessions
id - primary key, an identifier that represents a unique charging session
station_id - foreign key to the station table, represents the station where the session is being completed
port_id - foreign key to port table, represents which unique port is being used
start_datetime - session start date and time
start_time_zone
end_datetime - session charge end date and time
end_time_zone
total_duration - the amount of time between start_datetime and end_datetime (in hours)
charge_duration - the amount of time that energy was provided from the station to the vehicle (in hours)
energy_kwh - amount of energy provided (in kwh's)
charge_level - records which level of charging is being used for that session (Level 1, 2, DCFC)
fee - records the amount paid for the session
currency
ended_by - what caused the end of the session (Plug Out, Timeout, etc.)
start_soc - % of battery capacity at the start of the session (only provided for stations with DCFC charging)
end_soc - % of battery capacity at the end of session (only provided for station with DCFC charging)
RUCC Dataset (1 KB) - public data that classifies locations by how rural or urban they are
https://www.ers.usda.gov/data-products/rural-urban-continuum-codes/
id - 1 is most densely populated area. 9 is least densely populated area.
population - population size
area - description of population density
1 - Counties in metro areas of 1 million population or more
2 - Counties in metro areas of 250,000 to 1 million population
3 - Counties in metro areas of fewer than 250,000 population
4 - Urban population of 20,000 or more, adjacent to a metro area
5 - Urban population of 20,000 or more, not adjacent to a metro area
6 - Urban population of 2,500 to 19,999, adjacent to a metro area
7 - Urban population of 2,500 to 19,999, not adjacent to a metro area
8 - Completely rural or less than 2,500 urban population, adjacent to a metro area
9 - Completely rural or less than 2,500 urban population, not adjacent to a metro area
The attributes available in this dataset can be used to address the issues facing utilities and electric infrastructure. This will be done by determining which attributes can be used to predict energy demand.
Can EV charging behavior be modeled based on the following attributes (combined and separately)?
Venue Type
RUCC classification
City
State
Charge Level
Charge start/end datetime
Charge duration
The amount of time that the EV battery is being filled
Session duration
The amount of time that the vehicle is connected to the charging station (may be longer than charge duration)
Which of the features are the most important for predicting the amount of energy that EVs consume during charging sessions?
EV charging behavior is represented by session start time, session end time, and the amount of energy that is provided to the vehicle during the session. Charging behavior varies depending on many things, such as location, time of day, day of the week, and how much charge the battery already has. This project will focus on determining which attributes are the most important when trying to predict the electricity demand that is generated by EVs. Being able to accurately model charging behavior will give utilities a better understanding of how to handle the increased grid demand that is created by EV charging.
Previous research efforts have attempted to predict EV charging behavior by using historic datasets and analyzing charge duration and energy consumption. In an ensemble machine learning-based approach, researchers noticed that prediction error increased with the rise of data entropy and the decrease of data sparsity (Chung, 2019). This implies that as the data became less sparse, the ability to accurately predict charging behavior became less reliable. Sparsity can be defined by the number of zero entries in a dataset divided by the total number of entries, and high sparsity implies data is less variant (Chung, 2019). The entropy/sparsity ratio (R) in this paper serves as a guide to determine which models were more accurate than other. When R was low, support vector regressions and random forest regression are more accurate methods for charge duration and energy consumption predictions (Chung, 2019). Diffusion-based kernel density estimator performed more adequately when R was high. This dataset only consisted of 40,000 charging sessions and it was split into 70/20/10 subsets for training/validating/testing.
A similar study tried to predict the available energy in a subsequent 24-hour window at any given time for a charging station, with the assumption that there is a function which relates future available energy to previous consumed energy (Majidpour, 2015). This study used a historical average, where the predicted energy in the future is the average of the energy consumed in the past for a given time of day. The researchers also used a k-nearest neighbors approach, where the input is a concatenation of the consumption records from (D) previous days and the output is the predicted energy consumption for the following 24 hours (Majidpour, 2015). A weighted K-NN approach was also developed, in addition to a lazy learning approach, but the researchers determined that the normal K-NN approach was the best performer after determining each model's p values.
Random Forest
Use to determine which features are the most important
Several attributes can be used to forecast future demand, but random forest may be able to pinpoint the most effective ones
Linear Regression model
Energy is the output from EV charging sessions and the focus of what needs to be accurately predicted
Determine if there is a relationship between any of the input features and the response variable (Energy in kWh)
Use session attributes as features (separately and combined)
12,285 EVSE
1,747,018 sessions
The largest EV battery available to the public is roughly 100 kWh in capacity, so there is no reason for a charge session to deliver more than 100 kWh. There are instances where people will sit in their car as their vehicle charges, so I placed the maximum threshold at 115 kWh to ensure I do not remove too much data. This resulted in 11,627 sessions being dropped from the dataset. I also placed a charge duration threshold at 24 hours because charging stations can fully charge an EV battery in less than 24 hours. Any session that lasts longer than 24 hours is most likely an error in data recording.
A distribution of energy by session count can be seen below, where 75% of the sessions fall between 0-17 kWh. The peak on the left edge of the histogram represents sessions that record less than 1 kWh and that peak accounts for 5% of the dataset. It is unlikely that an EV owner would intentionally charge for less than 1 kWh during a session. These sessions most likely represent an owner who quickly realized they had somewhere to be, or a brief session that was accidentally cut short and quickly restarted by a user.
This dataset includes sessions that took place from June-2019 to March-2021, but most of the charging activity does not start until September of 2019. As seen below, COVID-19 had a significant impact on charging activity at the beginning of 2020, which is to be expected. Towards the end of 2020 and early 2021, as vaccines became publicly available and the fear of COVID-19 began to subside, EV owners started to travel and charge their vehicles more. The graph also shows that weekday charging is more popular than weekend charging but they both follow a similar trend during the COVID-19 dip in activity. I was hoping to see an observable trend as the seasons of the year change, but given the timeframe of the dataset, COVID-19 had too much of an impact.
After aggregating the data to the monthly level, it is clear that most charging happens during the week. The next step is to determine what the dataset looks like on a daily basis rather than a monthly one. The visual below illustrates an hour-by-hour charging breakdown and provides a more detailed picture for understanding how charging behavior changes throughout the day, with the blue bars representing weekday sessions and the orange bars representing weekend sessions. The x-axis represents charging start times based on the hour that the charge begins. The weekday trend is slightly more sporadic than the weekend trend, with start times peaking in the mid-morning and remaining relatively constant into the early evening. The weekend distribution is more symmetrical, with a gradual increase in the late-morning to mid-afternoon, followed by less sessions starting in the evening.
There are 11 different venue categories in the dataset:
Multi-Unit Dwelling
Single Family Residential
Business Office
Fleet
Municipal Building
Multi-use Parking Garage/lot
Leisure Destination
Medical or Educational Campus
Retail
Hotel
Transit Facility
The two image carousels pictured below illustrate the distribution of charging session start and stop times based on time of day, where the x-axis represents each hour of the day (midnight starts at 0). The title of each visual denotes which venue type is being pictured and the number of sessions (n) that took place at that venue type.
The two residential venues, Single-Family and Multi-Unit, both display a similar behavior, with a peak in session start times during the early evening, and a peak in session stop times in the morning before normal business hours. The Business Office venue shows a large peak in session start times at the beginning of a business day and a large peak in session start times at the end of a business day. These three venue types follow an observable trend that is to be expected based on their description.
Charge Session Start Time Distributions by Venue Type
Charge Session End Time Distributions by Venue Type
It is important to understand the geographic breakdown of charging activity by state. The graph below illustrates a session count for every state and provides a clear indicator that this dataset does not accurately represent each state in the country. After seeing this distribution, it is clear that the dataset needs to be filtered down to a smaller subset with the understanding that any findings from this project only pertain to the states that were included in the dataset.
The dataset is filtered down to include sessions from the following states:
Colorado
Georgia
Iowa
Kansas
Maryland
Michigan
Nevada
New York
Oregon
Pennsylvania
Rhode Island
Vermont
Washington
After filtering out states with less charging activity, the total session count drops from 1,747,018 to 1,492,983 sessions.
Input features used from session dataset:
station_id
port_id
weekend
start_hour
end_hour
total_duration
charge_duration
rucc_id
Feature importance is helpful when working with machine learning models because it places a different value on each input feature. Linear regression, decision tree and random forest approaches were all used to determine which features are the most influential in predicting the target label (energy_kwh). The results from each of the feature importance tests were used to guide each model's development. Two approaches were used for each model: one that only includes the most important features as determined by that model, and one that includes every session attribute as a feature.
Linear Regression
Important features:
'charge_duration'
'weekend' (0 for weekday, 1 for weekend)
MSE: 74.028
RMSE: 8.604
R2: 0.562
MSE: 71.081
RMSE: 8.431
R2: 0.579
The results from both linear regression approaches differ slightly. The approach that relied on every session attribute as input features resulted in a small amount of negative 'energy_kwh' predictions which is not possible. This issue did not show up in the approach that only used the most important features, 'weekend' and 'charge_duration', however this approach resulted in a smaller R^2 value and higher MSE/RMSE values.
Decision Tree
Important Features:
charge_duration
station_id
port_id
MSE: 41.362
RMSE: 6.431
R2: 0.755
MSE: 45.686
RMSE: 6.759
R2: 0.7299
The results from both decision tree approaches are fairly similar when looking at the distribution of points on each predicted label vs. actual label visual. The approach that relied on every session attribute as input features resulted in a lower R2 value, which is not what was observed with the linear regression comparison. In this instance, it makes more sense to only consider the features with the highest weights when training the model. The decision tree model performs better than the linear regression model when trying to predict 'energy_kwh' values, but it still does not have an impressive R2 value.
Random Forest
Important Features:
charge_duration
station_id
port_id
MSE: 23.646
RMSE: 4.863
R2: 0.8602
MSE: 23.627
RMSE: 4.861
R2: 0.8603
Random forest performs considerably better than the linear regression and decision tree approaches. Similar to the linear regression approach, the random forest approach that utilizes every feature performs slightly better than the approach that only uses the important features as inputs. The feature importance visuals provide an idea of which features are the most impactful when trying to predict 'energy_kwh' values, but each model behaved differently when comparing the outcome between experiments with all features included and experiments with only the most important features.
Low p-values for each method of regression, displayed below, illustrate that the results of these approaches can be repeated.
Model Results and Research Answers
The ability to predict energy consumption based on input features can be measured by considering the R2 value of each model. The R2 value associated with a model represents how close the observed data points are to the fitted regression line, which means a higher R2 value corresponds with a more reliable predictor. R2 accounts for a model's ability to explain variance in the data, therefore the random forest approach is the most reliable of the 3 models used throughout these experiments.
Can EV charging behavior be modeled based on charging session attributes (combined and separately)?
Two different approaches used
All session attributes as input features
Most important attributes used as input features (based on each model's feature importance results)
Random forest is the most effective, with the approach that utilizes all input features having the highest R2 value
Which of the session features is the most important for predicting the amount of energy that EVs consume during charging sessions?
'charge_duration' is the most important
All three modeling approaches gave this attribute the highest importance level
Future Work
Create subsets of the data by venue type and see if the models are more or less accurate at predicting charging behavior when only considering data from a specific venue
Explore the residual plots from each experiment to determine if there is bias in the independent variables
Are the residual plots random or do they display a pattern
Ideally they would be random
This will provide deeper insight into the reliability of the models
R2 is a good identifier of a model's ability to fit the data, but it does not analyze if the predictions are biased
Featherman, M., Jia, S. (J., Califf, C. B., & Hajli, N. (2021). The impact of new technologies on consumers beliefs: Reducing the perceived risks of electric vehicle adoption. Technological Forecasting and Social Change, 169. https://doi.org/10.1016/j.techfore.2021.120847
Tarei, P. K., Chand, P., & Gupta, H. (2021). Barriers to the adoption of electric vehicles: Evidence from India. Journal of Cleaner Production, 291. https://doi.org/10.1016/j.jclepro.2021.125847
Chung, Y.-W., Khaki, B., Li, T., Chu, C., & Gadh, R. (2019). Ensemble machine learning-based algorithm for electric vehicle user behavior prediction. Applied Energy, 254, 113732. https://doi.org/10.1016/j.apenergy.2019.113732
Majidpour, M., Qiu, C., Chu, P., Gadh, R., & Pota, H. R. (2015). Fast Prediction for Sparse Time Series: Demand Forecast of EV Charging Stations for Cell Phone Applications. IEEE Transactions on Industrial Informatics, 11(1), 242–250. https://doi.org/10.1109/tii.2014.2374993