Baseball is notorious for its inherent variance, changing rosters, and the ever-improving efficiency of modern betting markets. Therefore, determining the final score of an MLB baseball game can be difficult. Machine learning models can identify weak predictive signals from baseball games; however, translating these predictive signals into actionable wagers will require an evaluation of these predictive signals in a manner distinct from the evaluation of baseball players’ traditional performances.
The goal of this project is to develop an outcome prediction model for baseball games and test the prediction model using a full-season simulation. While creating a profitable betting system is not the main objective of this project, it will help to elucidate the degree to which model behavior and deployment logic interact within the constraints of real-world environments.
The modeling pipeline consists of three key sources of data: Statcast pitch-level data (using PyBaseball) provides the foundation for detailed information about pitcher handedness and what happens to a pitcher; game-level data shows both teams and pitchers playing at home or away (i.e., who is starting); and betting odds and final scores provide context to predictions based on the gaming market. All the data used to build and test the model will come from 2015-2021. The 2025 season will be used only for the out-of-sample simulation.
All data will pass through standard pre-processing methods. For example, any numerical features (such as velocity) will be normalized based on their distribution, and categorical variables (e.g., the name of a particular team) will be converted to one-hot encoded variables. Binary variables will be created to indicate home/away status and pitcher handedness. The historical dataset will be split into an 80% training and 20% test dataset; the 2025 season will not be included in the training dataset to ensure that there are no biases in the simulation environment.
The creation of features for input into the model consisted of key data points, along with adding elements to provide context related to baseball, to keep features from changing as much through time as possible. Offensive production measurements on a team-level include utilizing a rolling average of the last ten games (offensively) compared to the handedness of the starting pitcher, i.e., runs scored, OPS, Ks, and BBs.
Team strength is based on Pythagorean win % (the calculation of the winning team based on run differential), describing the strengths of an organization. Pitcher performance is measured through Rolling FIP and K–BB% of the last ten games, as well as overall performance throughout the season. Bullpen performance is measured through WHIP, K–BB%, and ERA estimates based on hits and walks allowed (season totals) and rolling estimates (last 10 games).
Some attempts were made to group pitchers into “types” based on pitch characteristics and handedness. This proved to be too statistically intensive for consistent data availability and was not utilized in the final model.
Several different model types were tested, including Random Forest, XGBoost, Neural Networks, and stacking methods. The final model selected was a stacking classifier used to predict if the home team would win in each game. Generalizing and being stable were prioritized over maximizing in-sample accuracy during model selection.
In a test sample of data that was withheld from training, the final model produced a 55.65% accuracy rate and an AUC of 0.569. This suggests that the model has some predictive power, but it's not strong enough to use the model as is; additional tuning will be required before deployment.
A Receiver Operating Characteristic (ROC) curve was created for the Final Model to analyze discriminatory performance across probability threshold levels.
Figure 1. The ROC curve demonstrates that the final stacking classifier yields an area under the curve (AUC) of 0.569, meaning it has some ability to discriminate between classes but does not yield sufficient discrimination for use as a standalone measure of performance.
The ROC curve confirms that the final stacking classifier captures some signal and therefore differentiates between "wins" and "losses", but does not provide significant separation. This necessitates consideration of the deployment context for evaluation beyond the classification metric.
To test my model’s predictions, Iran a complete season simulation of the 2025 Major League Baseball (MLB) Season. Bettors took bets based on the model’s predicted outcomes when the expected value was found to be higher than an acceptable threshold; i.e., if the model's prediction, along with the market odds, showed an “expected value.” Throughout the entire season, the development of the better’s bankroll would be monitored to see how well the model performed over the course of the year and to see how much risk and return were associated with the model predictions.
As a result of the simulation modeling exercise, an estimated increase of approximately $1k to $2.39k would result from betting on approximately 1100 separate occasions. The model, when used on its own, had very little ability to forecast future results (single forecast vs actual). The model's accuracy was largely driven by the processes used to "filter and execute" the predicted bets.
To better understand the level of risk associated with the strategy, I have tracked the progression of a bankroll throughout the course of the entire season. By constructing a graph of the bankroll's movements, we can analyze the volatility, drawdown and characteristics of the performance long-term, which cannot be assessed solely by the total return.
The bankroll graph (see Figure 2) shows that, overall, the season ended with a positive bankroll, but the graph's volatility indicates considerable drawdowns. The variance in performance for the entire season was not due to the reliable dominance of any one edge, but your selective involvement in good conditions over time. As such, the success of this strategy is more episodic than consistent, making it important to evaluate both risk and overall returns.
To assess how the model's confidence level has affected performance, an analysis of performance into three (3) ranges of expected value (EV) (see Figure 3) was conducted. For example, betting volume, win percentage (% of wins), profit, average stake size, and Return on Investment (ROI) were analysed with respect to each of the three (3) EV brackets.
When evaluating the performance of schedules of bets based on their average expected value (EV), it became apparent that bets placed in the lowest EV range (2–5%) produced an overall negative return despite a high volume of activity. Conversely, wagers placed in the 5–10% EV range generated only small, but positive, returns, but the majority of profits came from selections within the highest confidence categories (EV ≥ 10%). Furthermore, the win rates for bets placed in both the 5–10% and 10%+ segments remained consistent, suggesting that the overall profitability of each must be attributed less to the actual predictive accuracy of the models used and much more to the combined effects of the confidence level of the model used with market odds and the size of the stake wagered.
Overall, the trajectory of the bankroll through time and results across EV segments highlight the need to be selective about the level of confidence when deploying a probabilistic-based modeling tool, such that rather than applying an even opportunity-based strategy, active participation in only those scenarios that are deemed higher-confidence will have a meaningful impact on gross performance. Subsequent rules regarding how decisions are made or action taken on behalf of wagers will also have a substantial impact on the actual outcome achieved, despite having no prior influence on what is expected to occur based on the original probability-based modeling predictions.
In assessing the operational utility of a predictive model, it can be seen that its value exceeds purely its accuracy measured in isolation. For example, in the case of the model estimating outcomes for the final MLB game of the regular season, the model displayed little value in terms of discrimination between teams. However, an evaluation of various scenarios for simulated deployment revealed that decisions regarding scenario deployment, confidence levels for deployment, and considerations relating to implementation could all contribute materially to the player's performance during an actual game.
The analysis also highlights the challenge of extensive feature engineering within high-variance environments. The baseball-informed features created a high level of interpretability and realism; however, they introduced such high levels of variance relative to the underlying signals due to short rolling windows and matchups that they often completely overwhelmed the predictive signal. Therefore, the performance of the models is not only sensitive to the variables used in creating the model, but also to how the predictions generated from the model are filtered/modified and applied at the playing field level.
This research greatly illustrates how critical end-to-end validation is. The way the model operates in the real world was much different than what traditional methods (e.g., accuracy and AUC) would imply. By looking at bankroll trends and EV buckets, dynamics surfaced that would not have been seen if evaluated using typical means. Overall, understanding how your model acts (and how it will not work) is often much more informative than minor improvements to predictive ability.
In conclusion, this project illustrates how effective use of analytics in baseball needs to have disciplined processes, transparency, and thoughtful evaluation/validation of the decision chain. By concentrating on calibration, robustness, and deployment mechanics, analysts may derive useful insights from imperfect models. All of these principles can be applied to baseball operations, where understanding uncertainty and managing risk are as important as prediction itself.