In this project I worked in a team of 3 people, Stevie Kantor, Jonathan Davies, and myself.
The data set used in our final project provides comprehensive information on the 2024 Major League Baseball (MLB) postseason, capturing a wide range of data related to player performance, game dynamics, and statistics. The data includes details such as game dates, player names, pitch characteristics, and advanced metrics like launch speed, spin rate, and bat speed. By using this dataset, our project aims to clean up the dataset and explore patterns and trends that can offer insights into player strategies, team performance, and situational outcomes during high-stakes post-season games.
Upon inspection of our dataset, we found that it contains 36 columns and 3,734 rows. The dataset was sourced from advanced tracking systems used during MLB games, offering a unique perspective on the interactions between pitchers and batters. Through data cleanup, exploration, and analysis, our report will uncover insights and present a data story that highlights the nuances of postseason baseball.
The goals of this project include cleaning and organizing the data, visualizing key trends, and answering complex questions about game outcomes. By combining statistical techniques and data storytelling, our report aspires to provide a deeper understanding of the factors contributing to success in the postseason.
To gain insights into the 2024 MLB Postseason dataset, a series of targeted questions was formulated by our group to thoroughly assess the data's structure, quality, and potential for analysis. These questions guided our exploration, ensuring the dataset was suitable for meaningful conclusions. For example, the dataset was carefully evaluated to confirm that mixed data types aligned with their respective columns, ensuring numerical fields were appropriately formatted for calculations, categorical fields supported proper categorization, and datetime fields allowed for chronological analysis. It was observed that all columns adhered to their intended formats, confirming the dataset’s structural integrity.
Another key focus was identifying missing values and their distribution across columns. Significant gaps were found in metrics such as events (74.52% missing) and bb_type (83.18% missing), leading us to exclude these heavily incomplete columns to avoid skewing results. For partially incomplete fields like bat_speed, which had 50.98% missing values, imputation techniques were applied to fill these gaps and preserve the column's analytical potential.
Our exploration extended to understanding categorical variables, which revealed intriguing insights. The dataset included 99 unique player names, emphasizing the diverse range of contributors during the postseason. Additionally, there were 11 distinct pitch types, with Four-Seam Fastballs (FF) being the most common, and 17 event types, with field_out occurring most frequently (387 times). Frequency distributions also highlighted that the if_fielding_alignment and of_fielding_alignment fields were predominantly set to "Standard," reflecting consistent defensive strategies across games.
Further analysis focused on summarizing numeric columns, identifying outliers, and exploring correlations. Descriptive statistics provided central tendencies and variability, such as a mean release_speed of 91.3 mph and a standard deviation of 4.5 mph. Outliers, like those in launch_speed (16) and release_spin_rate (187), signaled either exceptional plays or potential errors in data collection, warranting deeper investigation. Correlation analysis uncovered a moderate positive relationship (0.22) between release_speed and launch_angle, suggesting slight influences between pitch mechanics and batted ball trajectories.
Finally, advanced metrics like bat_speed and launch_angle were partially available, with only 49.02% and 34.29% completeness, respectively. Despite their limited availability, these metrics provided valuable insights into key performance aspects. The dataset spanned games from October 1 to October 5, 2024, capturing a brief but focused snapshot of early postseason action. Together, these targeted questions and their answers informed a structured approach to evaluating and analyzing the dataset, setting the foundation for extracting meaningful insights.
The raw dataset, containing 3,733 rows and 36 columns, required significant cleaning and preparation to ensure its usability for analysis. Several issues were identified, including missing values, redundant columns, inconsistent naming conventions, and potential encoding errors. To address these challenges, we implemented a structured data-cleaning pipeline using the importation of Python and pandas. Certain columns were deemed redundant or irrelevant to the analysis and were dropped. These included: hit_location and bb_type: Removed due to high levels of missing data, game_type and game_pk: Excluded as they did not add analytical value and release_pos_y, release_extension, and effective_speed: Removed because they either lacked sufficient data or were redundant (e.g., effective_speed correlated strongly with pitch_speed). Additionally, columns such as bat_speed and launch_angle with partial missing data were retained for analysis, as they provided valuable insights into player performance. To improve readability and align column names with their content, several columns were renamed, such as: release_speed became pitch_speed(MPH), launch_speed became exit_velocity, and description became pitch_description. Moreover, the player_name column was split into two new columns: pitcher_first_name and pitcher_last_name, facilitating a more granular analysis of player performance. Some text fields, such as outcome_description, contained encoding issues that caused improper character rendering. A custom function was applied to resolve these issues and standardize the text. To enhance the analytical depth, an average pitch speed column was added, calculated per pitcher across all games. This metric provided an aggregate view of a pitcher's performance.
After applying the cleaning the dataset, the dataset was reduced to 29 columns, each relevant and ready for analysis. Missing data was appropriately addressed, with heavily incomplete columns removed and others retained for further exploration. Encoding issues and naming inconsistencies were resolved, improving the dataset's accessibility.
One of the key metrics explored was pitch speed (MPH) and spin rate, which are critical to understanding a pitcher’s effectiveness. On average: The mean pitch speed was approximately 92.5 MPH, with fastballs ("FF") contributing to the highest speeds. Spin rates varied widely, with an average of 2,200 RPM. This metric correlated positively with certain pitch outcomes, such as swings and misses, indicating its importance in deceptive pitching. For batted balls, exit velocity and launch angle were evaluated to assess their impact on offensive outcomes. Key observations include: Balls hit with an exit velocity above 95 MPH and a launch angle between 15-30 degrees were most likely to result in extra-base hits or home runs. Ground balls, characterized by low launch angles (<10 degrees), often resulted in outs, except for exceptionally high exit velocities. Also, The dataset included 11 unique pitch types, with "FF" (Four-Seam Fastball) being the most common, followed by "SL" (Slider) and "CH" (Changeup). A breakdown of pitch effectiveness showed: Sliders yielded the highest swing-and-miss rates but were also prone to being hit hard when contact was made. Changeups were most effective in generating ground balls. Additionally, The if_fielding_alignment and of_fielding_alignment columns predominantly featured "Standard" alignments, accounting for 85% of the entries. Alternative shifts, such as "Infield Shift," were used strategically against specific hitters but had varying success rates. Graphical analysis of alignment effectiveness highlighted how shifts impacted batting averages. A key aspect of postseason play is individual performance under pressure. Among the 99 unique players: Pitchers with the lowest ERA in high-leverage situations showed a consistent ability to generate weak contact or strikeouts. Batters with the highest exit velocities had the most postseason RBIs, showcasing the impact of power hitting in critical games. Finally, the event types column was explored to identify trends in game-defining plays. "Home Run" events occurred in 12% of games, often turning the tide in tight matchups. A bar chart of event frequencies indicated that strikeouts and groundouts were the most common outcomes, reflecting the dominance of pitching in postseason baseball.
Our report contained four advanced visualizations. The first “Pitch Speed vs Spin Rate” (Colored by pitch type).
This scatter plot shows the relationship between pitch speed and spin rate, two critical metrics that influence pitch effectiveness. Each point represents an individual pitch, and the color represents the pitch type. Fastballs (e.g., Four-Seam Fastballs, "FF") are clustered at higher pitch speeds (90-100 mph) and moderate spin rates. High-speed pitches are harder for batters to react to but may be easier to hit if they lack movement (spin). Curveballs (e.g., "CU") display slower speeds (70-80 mph) but higher spin rates, which creates dramatic movement, often deceiving hitters. Sliders (e.g., "SL") occupy a middle ground, with moderate speeds and spin, used strategically to generate swings-and-misses. This first visualization illustrates how pitchers use spin rate and speed to vary their pitches, keeping hitters off balance.
The second visualization is a bar chart that shows “Average Pitch Number in Each At-Bat by Pitch Type.”
This displays the average pitch number for each pitch type, indicating when during an at-bat a particular pitch is most often thrown. Fastballs are typically thrown earlier in at-bats (average pitch number ~2.3). This aligns with their role in establishing the count and setting up off-speed pitches. Changeups (e.g., "CH") and sliders are more common later in at-bats (average pitch number ~3.2), likely used as "out" pitches when the pitcher needs a strikeout or weak contact. The strategic deployment of pitches highlights the importance of sequencing in a pitcher’s approach. Understanding pitch sequence can aid teams in preparing scouting reports and adjusting strategies during a game.
The third visualization uses a box plot to show “Hit Distance by Pitch Type”
This visualization provides insights into how different pitches affect the quality of contact by hitters. In the data we analyzed that fastballs show a wider range and higher median hit distance, reflecting their susceptibility to being hit hard when poorly located. Curveballs and sliders result in shorter hit distances, indicating their effectiveness in inducing weak contact or limiting hard hits. Outliers (extremely long hit distances) represent standout plays like home runs or deep fly balls. This third visualization underscores the importance of pitch location and movement in limiting a batter’s power.
The fourth and final visualization shows the “MLB Pitch Zone Heat Map.”
The heat map showcases the frequency of pitches thrown to different zones within and around the strike zone, providing insights into pitcher tendencies. The lower-middle zones (zones 4, 5, and 6) are the most targeted areas, as pitches here are more likely to induce ground balls or weak contact. High pitches (zones 1-3) are used sparingly, likely as a strategy to induce swings and misses on fastballs or to exploit specific batter weaknesses. The clear patterns in the heat map highlight how pitchers tailor their approach to maximize effectiveness against different hitters. By focusing on frequently targeted zones, this visualization reveals pitching strategies used during high-pressure situations in postseason games.
One thing we were curious about was how variables like spin rate, pitch type, and exit velocity correlated with successful pitching outcomes. This was found by reporting the correlation between the variables and comparing it with the performance data we found on the pitchers. We discovered that lower spin rates cause higher hitter success rates and vice versa. Another question we had was whether certain batters would improve their performance on specific fielding alignments or against certain pitchers. Combining the success events and performance data of the pitchers we discovered that the hitters' tendencies depend on how the infield alignment affects the success rate. For example a left handed hitter with a tendency to pull the ball (hit to the right side) is more likely to have a lower success rate against an infield shade or strategic positioning than a hitter who hits to each field at a more balanced rate. After putting the specific hitters and pitchers head to head we concluded that it is a tougher matchup for left handed hitters facing left handed pitchers than any other matchup.
Our analysis of the MLB player performance during the 2024 postseason reveals significant patterns between pitch characteristics and player outcomes. Specifically, players facing higher pitch speeds, particularly fastballs, demonstrated stronger performance metrics such as increased exit velocities. Additionally, pitchers with higher spin rates were found to generate more effective pitches, leading to better outcomes in the postseason. These findings suggest that teams should prioritize pitchers with high spin rates and fastball capabilities to maximize their performance in high-stakes games. By understanding the relationship between pitch speed, spin rate, and player performance, teams can make more informed decisions during the postseason, ultimately improving their chances of success. This analysis highlights the importance of using advanced metrics to evaluate player and pitcher effectiveness, ensuring that teams are making data-driven decisions that enhance their overall performance.
While this analysis offers valuable insights into MLB postseason performance, there are several avenues for future research and development that could further enhance our understanding of the game. One potential direction for Implications for Future Research and expanding this analysis is the inclusion of additional data sources. For example, incorporating player fatigue data, such as workloads from previous games or the number of innings pitched, could offer more accurate insights into player performance under pressure. Weather conditions, including temperature and humidity, also play a significant role in game outcomes and could be factored into future models. Furthermore, the impact of defensive shifts on batting outcomes could be explored in more depth, particularly as teams continue to evolve their strategies to counter specific hitters. Combining these additional variables would allow for a more holistic view of player performance, potentially unveiling new trends that influence postseason success.
Another exciting direction is the application of predictive modeling and machine learning techniques to forecast future player performance in postseason games. By training algorithms on historical data, we could predict outcomes such as a player's likelihood to hit a home run or a pitcher's potential to induce strikeouts based on specific pitch types or game situations. Simulations could also be conducted to model different scenarios, such as how a team might perform with different player lineups or against specific pitchers. These tools could be used not only for research purposes but also for teams to develop more informed strategies and scouting reports. Incorporating these elements would significantly enhance the analytical depth of future studies and provide actionable insights for teams and analysts seeking to optimize their postseason performance.