The infamous movie, "Moneyball" in 2011 highlighted the ability of analytics in developing a salary-friendly roster that could win in the playoffs as well as the World Series. Optimizing financial decisions for a multi-billion dollar league has always been a priority with team salaries ranging from $59 million to $254 million in 2020. Additional revenue from tv-deals (e.g. Turner Sports) in 2020 was expected to be valued at $3.75 billion. More recently, the Supreme Court has allowed individual states to decide on the issue of legalizing sports betting. In Nevada, where sports betting has been legal, $61.81 million of revenue was generated for the month of November 2020. As applications such as DraftKings and FanDuel allow for mass-adoption of sports-related betting, leveraging the ocean of data to accurately predict outcomes becomes invaluable to both the league of baseball, casual and pro betters, as well as the spectators.
In MLB sports betting, there are various different outcomes that can be chosen. Several related works have utilized the retrosheet.org dataset to predict the winner of a match-up. In addition to winners (money line bets), there are over/under (total runs in a game) bets. The downside for money line bets is the variability (the more heavily favored a team is, the higher the price up to $400 to win $100). Therefore, focusing on the total runs of the game allows for restriction of variability (usually $110 to win $100).
Project Objective: Compare machine learning models in order to predict the outcome of a MLB game
Research Methods: K Nearest Neighbors, Decision Tree, Logistic Regression
Data Sources:
Season logs from the years 2010-2020 [https://www.retrosheet.org/gamelogs/index.html]
- Data Dictionary [https://www.retrosheet.org/gamelogs/glfields.txt]
EDA Help from Alumni Liam Preis: https://github.com/ljpreis/CapstoneProject
Related Works:
Predicting winners: http://cs229.stanford.edu/proj2013/JiaWongZeng-PredictingTheMajorLeagueBaseballSeason.pdf
Predicting salaries: https://www.shsu.edu/academics/general-business-and-finance/general-business-conference/documents/GeneralBusiness2011Proceedings.pdf#page=222
Predicting career performance: https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1600-0838.2011.01408.x
Phase 2: Exploratory Data Analysis
Looking at a general trend over time, we can see that there is small (>50%) advantage for home teams winning.
Having chosen the four main features -
Average Home Score Relative to Average
Average Visitor Score Relative to Average
Average Home Score Conceded Relative to Average
Average Visitor Score Conceded Relative to Average;
I analyzed the features against one another to create a scatter plot correlation matrix. We can see that the team level data do not show any correlations other than top right and bottom left (AvgHscoreRA and AvgVscoreConcededRA) which are inversely related to one another.
Phase 3: Machine Learning Models