Predicting Outcomes of Major League Baseball Games
by
by
GitHub Repository for your capstone project: Insert Link (e.g. https://github.com/UMBC-Data-Science/ )
1. Introduction
Provide background of the matter or issue. Why it is important - to you and to others.
As an avid baseball fan, I am intrigued by the sport and the sabermetrics that go along with watching a baseball game. Data analytics are utilized in baseball from baseball spin rates to bat angles at contact between bat and ball. Baseball has always been deeply rooted in statistics but recently there has been a push to delve into advanced statistics to better analyze team and player performance. I would be highly interested to see if I could use my Data Science knowledge to use a combination of basic and advanced baseball statistics to predict baseball game outcomes.
What may have been done already in this area by others?
Beyond the advancements made in baseball analytics that attempt to better capture player and team performance, I have come across several attempts to create algorithms to predicate baseball game outcomes. I will have amble resources to pull from including college level Data Science courses on baseball saber metrics and several Data Science projects attempting to predict baseball game outcomes.
What are the gaps?
Although I have come across several attempts to use Data Science to predict baseball outcomes, I have not seen any algorithms which attempts to utilize advanced baseball stats. As an example, I have seen batting averages used in the predictive algorithms but not “on-base percentage plus” which is an advanced baseball stat that attempts to credit batters for getting on base and awards batters for extra base hits. I think the use of these advanced baseball statistics will strengthen my predictive linear regression models.
Are you trying to close the gaps or trying to create something novel?
Beyond using the advanced baseball statistics to strengthen my predictive regression model, I would like to incorporate run differential into my regression model. I picked up the thought process of winning a close game being the same as winning a blowout victory when I watched the documentary AlphaGo. AlphaGo was a machine learning algorithm which beat the best Go player in the world. Most human Go players tended to want to dominate opposing players whereas the AlphaGo machine learning algorithm wanted to simply win the game. In terms of baseball, I want to figure out how to give equal credit for a 1-run victory as a 10-run victory since my goal is to predict a win.
2. Data Set. This is where more details are required. data sources - make sure they are from reputable sources. Briefly describe each data source. If you are using multiple sources, what need to be done to merge, consolidate them?
I have found a plethora of baseball statistics at MLB.com website and advanced baseball stats at https://baseballsavant.mlb.com/league?season=2021#statcastHitting which is connected to the MLB.com site. It will be relatively easy to pull these statistics and advanced statistics into my dataframes and I’ll simply have to merge them into a specific dataframe which will contain only the stats that I find to be the most pertinent to predicting the outcomes of baseball games.
data elements - what are they? what elements you are planning to use as the features (predictors) for your predictive analytics (ML)? What are your planned targets (to be predicted)?
At this point, I’m going to focus on baseball stats of walks, runs scored, runs given up, batting avg. and advanced stats: weighted on-base percentage, on-base percentage plus, run differential. I’m targeting specifically ‘wins’ as I want to utilize predictive analytics (ML) to be able to predict which baseball team will win a game. In this quest, I would like to be able to target key characteristics that will help I identify the baseball team most likely to win the baseball game being played. For instance, the team with the best batting avg. will have the highest chance of winning or potentially the team with the highest runs scored and highest OBS+ will have the highest probability of winning a game. Since these key statistics and advanced statistics will be utilized to build my regression model, I want to try to identify the most statistically relevant statistics to use.
what are the types of each data element you plan to use - categorical, numerical, text, audio, video, image, etc.
Other than the categorical data of ‘win’ and ‘loss’ all of the data elements being used will be numerical in nature.
What is your unit of analysis? - person, country, a medical device, a data breach incident, etc. How many observations are available in the data?
I have data sets ranging from the 2015 season up until this current 2021 season. Because of the COVID pandemic the 2020’s season statistics will be fairly limited, and the 2021 season is currently being played so this data set will be limited. Regardless, this is a lot of data. I will, also, have to consider splitting the data into two data sets to account for the designated hitter (DH) in the American League versus the National League not having a DH.
3. Hypothesis / Research Question(s)
This tells your scope of work. List potential questions/hypotheses you are thinking about. Give much thought to this, tying back to the introduction - why these hypothesis or questions are important to investigate? You can list many and later as you explore and learn more you can decide to choose which ones to focus on.
As previously stated, my main goal is to create a predictive model that will best predict the winner of a baseball game. Within this main goal, there are several questions/hypotheses that may arise.
1) Does home field advantage exist in baseball? Meaning does the home team have a better chance of winning the baseball game simply because of home field advantage and not because they are statistically the better team?
2) Are specific baseball statistics and advanced statistics key to predicting the outcome of a baseball game?
3) Is there a difference between predicting American League games where a DH is allowed versus National League games where a DH is not allowed?
4. Implementation (Model)
Potential models to use. List what they are and why they may be useful to answer your questions. as you explore and learn more later, you can pick the most practical and effective ones. Going into this level of details not only help me to make my investment decision but also help yourselves in planning your work. After this is done, the rest of your effort will be much smoother, and you will have less uncertainty and rework. This becomes a roadmap for your journey.
At this point I’m focusing on a regression model and will attempt to identify the key features to utilize with the regression model. I’m, also, considering uses the Random Forest Regression but plan on researching other regression models. I have been focusing research into baseball statistics and advanced statistics attempting to highlight statistics to use as features in the regression model. I want to use the most efficient statistics possible as to best eliminate unnecessary noise in my regression model. I’m in the very early stages of creating the actual regression model.
Agarwal, Animesh. (2018, Oct 16). Building a Logistic Regression in Python. https://towardsdatascience.com/building-a-logistic-regression-in-python-301d27367c24.
Codeacdemy. (2019, April 26). Web Scraping MLB Stats with Python and Beautiful Soup. https://www.codecademy.com/resources/blog/web-scraping-python-beautiful-soup-mlb-stats/.
Friedman, Dan. (2018, Dec 25). Visual Introduction to Classification and Logistic Regression. https://dfrieds.com/machine-learning/visual-introduction-classification-logistic-regression-python.html.
Frost, Jim. (N.A.) How To Interpret R-squared in Regression Analysis. https://statisticsbyjim.com/regression/interpret-r-squared-regression/.
Huang, Mei-Ling & Li, Yun-Zhi. (2021, May 14). Use of Machine Learning and Deep Learning to Predict the Outcomes of Major League Baseball Matches. file:///Users/kennethreading/Downloads/applsci-11-04499-v2.pdf.
Huilgol, Purva. (2020, Sept 4). Precision vs. Recall – An Intuitive Guide for Every Machine Learning Person. https://www.analyticsvidhya.com/blog/2020/09/precision-recall-machine-learning/.
Krishnan, Sowmya. (2020, Jun 9). Multivariate Logistic Regression in Python.
https://towardsdatascience.com/multivariate-logistic-regression-in-python-7c6255a286ec
Kumar, Ajitesh. (2020, Sept 4). Micro-average & Macro-average Scoring Metrics – Python https://vitalflux.com/micro-average-macro-average-scoring-metrics-multi-class-classification-python/.
Layton, Robert. (N.A.). Predicting sports winners using data analytics with pandas and scikit-learn. https://www.youtube.com/watch?v=k7hSD_-gWMw.
Long, Justin, & Schweitzer, Brad & Crute, Christy. (N.A.). Simulating Major League Baseball Games. Slippery Rock University. https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2018/2875-2018.pdf.
Manning, Ray. (2018, Jul 13). How Many Wins are Needed to Make the MLB Playoffs? https://ray-90807.medium.com/how-many-wins-are-needed-to-make-the-mlb-playoffs-555655d9b63a.
Nourse, G. (2021, May 6). A Machine Learning Algorithm for Predicting Outcomes of MLB Games. https://garretnourse.medium.com/a-machine-learning-algorithm-for-predicting-outcomes-of-mlb-games-fa17710f3c04.
Ostwal, Prasad. (2019, May 28). Multi-dimension plots in Python — From 3D to 6D. https://medium.com/@prasadostwal/multi-dimension-plots-in-python-from-2d-to-6d-9a2bf7b8cc74.
Poston, Daniel. (2017, May 4). A scikit-learn tutorial to predicting MLB wins per season by modeling data to KMeans clustering model and linear regression models. Scikit-Learn Tutorial: Baseball Analytics Pt 1. https://www.datacamp.com/community/tutorials/scikit-learn-tutorial-baseball-1.
Prettenhofer, Peter. (2014, Feb 15). Multiple Regression Using Statsmodels. https://www.datarobot.com/blog/multiple-regression-using-statsmodels/#appendix.
Raoniar, Rahul. Oct 8, 2020. Fitting MLR and Binary Logistic Regression using Python. https://onezero.blog/fitting-mlr-and-binary-logistic-regression-using-python/.
Sarang Narkhede. (2018, May 9). Understanding Confusion Matrix.. https://towardsdatascience.com/understanding-confusion-matrix-a9ad42dcfd62.
Shung, K.P. (2018, Mar 15). Accuracy, Precision, Recall or F1? https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9.
Sports Reference LLC. Baseball-Reference.com - Major League Statistics and Information. https://www.baseball-reference.com/. (2021, Aug 8).
Yadav, Jyoti. (2019, Aug 15). Statistics: How Should I interpret results of OLS? https://jyotiyadav99111.medium.com/statistics-how-should-i-interpret-results-of-ols-3bde1ebeec01.
Your Linkedin Profile
Your GitHub Repo: https://github.com/umbc-data606-summer-2021/Kenneth_Reading/upload/main