Goals:
1. Create a way to highlight defensive performance on the individual level with spray chart incorporated (good visual)
2. Make something in R that can be iterated through multiple different players and statistics
3. Use some of what I have learned over the past two months in class to contribute to something that I would like to see more of (fielding reports)
4. Continue to test myself in R with what I can produce. Trial and error baby.
batted_ball_data <- statcast_search(start_date = "2024-08-18", end_date = "2024-08-24")
out_outcomes <- c("field_out", "force_out", "sac_bunt", "fielders_choice_out", "other_out",
"grounded_into_double_play", "sac_fly", "double_play", "sac_fly_double_play")
out_prob_dash <-batted_ball_data %>%
mutate(location = as.factor(hit_location),
infield_align = as.factor(if_fielding_alignment),
of_align = as.factor(of_fielding_alignment),
description = as.factor(description),
out_hit = ifelse(events %in% out_outcomes, 1, 0)) %>%
select(launch_speed, launch_angle, home_team, away_team, inning_topbot, estimated_woba_using_speedangle,
location, bat_speed, swing_length, release_speed, description, delta_home_win_exp, delta_run_exp,hit_distance_sc, infield_align, of_align, hc_x, hc_y, events, out_hit, pitcher, fielder_2, fielder_3, fielder_4, fielder_5, fielder_6, fielder_7, fielder_8, fielder_9, game_date, bb_type) %>%
filter(description == "hit_into_play")
out_prob_dash <- out_prob_dash %>%
mutate(defteam = case_when(
inning_topbot == "Top" ~ home_team,
inning_topbot == "Bot" ~ away_team
))
fielder_mapping <- function(location) {
case_when(
location == 1 ~ "fielder_1",
location == 2 ~ "fielder_2",
location == 3 ~ "fielder_3",
location == 4 ~ "fielder_4",
location == 5 ~ "fielder_5",
location == 6 ~ "fielder_6",
location == 7 ~ "fielder_7",
location == 8 ~ "fielder_8",
location == 9 ~ "fielder_9",
TRUE ~ "unknown_fielder"
)
}
out_prob_dash <- out_prob_dash %>%
mutate(fielder = sapply(location, fielder_mapping))
out_prob_dash <- out_prob_dash %>%
mutate(key_mlbam = case_when(
location == 1 ~ pitcher,
location == 2 ~ fielder_2,
location == 3 ~ fielder_3,
location == 4 ~ fielder_4,
location == 5 ~ fielder_5,
location == 6 ~ fielder_6,
location == 7 ~ fielder_7,
location == 8 ~ fielder_8,
location == 9 ~ fielder_9,
TRUE ~ NA
)) %>%
mutate(key_mlbam = as.factor(key_mlbam))
# creating a vector for the player registry
player_name <- chadwick_player_lu()
# selecting relevant columns in this data
player_name <- player_name %>%
select(key_mlbam, name_last, name_first, birth_year) %>%
mutate(key_mlbam = as.factor(key_mlbam)) %>%
filter(!is.na(key_mlbam),
!is.na(birth_year))
# joining the data on the key_mlbam and including player names
list_with_names <- right_join(out_prob_dash, player_name, "key_mlbam") %>%
na.omit()
final_out_prob <- list_with_names %>%
select(launch_speed, launch_angle, estimated_woba_using_speedangle,
location, release_speed, hc_x, hc_y, description, delta_home_win_exp, delta_run_exp,
hit_distance_sc, infield_align, of_align, events, out_hit, pitcher, game_date,
key_mlbam, name_last, name_first, out_hit, bb_type, defteam)
# comnbining first and last name
final_out_prob$name_full <- paste(list_with_names$name_first, list_with_names$name_last)
# this is the model we will use to get our predicted out probabilities
out_model <- glm(out_hit ~ launch_speed + launch_angle + estimated_woba_using_speedangle +
location +
hit_distance_sc + infield_align + hc_x + hc_y + bb_type,
data = final_out_prob,
family = binomial())
summary(out_model)
##
## Call:
## glm(formula = out_hit ~ launch_speed + launch_angle + estimated_woba_using_speedangle +
## location + hit_distance_sc + infield_align + hc_x + hc_y +
## bb_type, family = binomial(), data = final_out_prob)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.476465 256.744527 0.025 0.979875
## launch_speed 0.031349 0.004547 6.895 5.39e-12 ***
## launch_angle -0.013556 0.004153 -3.264 0.001097 **
## estimated_woba_using_speedangle -5.209523 0.270008 -19.294 < 2e-16 ***
## location2 0.718266 0.722170 0.995 0.319934
## location3 0.806410 0.353213 2.283 0.022426 *
## location4 1.114143 0.333354 3.342 0.000831 ***
## location5 0.266520 0.290409 0.918 0.358755
## location6 0.995840 0.317761 3.134 0.001725 **
## location7 -3.922872 0.481062 -8.155 3.50e-16 ***
## location8 -3.758301 0.489817 -7.673 1.68e-14 ***
## location9 -4.794911 0.489349 -9.799 < 2e-16 ***
## hit_distance_sc 0.025683 0.001917 13.395 < 2e-16 ***
## infield_alignInfield shade -11.848985 256.741227 -0.046 0.963190
## infield_alignStandard -11.731634 256.741209 -0.046 0.963554
## infield_alignStrategic -11.811571 256.741253 -0.046 0.963306
## hc_x 0.008608 0.003682 2.338 0.019380 *
## hc_y 0.028127 0.005124 5.490 4.02e-08 ***
## bb_typeground_ball -1.543769 0.408181 -3.782 0.000156 ***
## bb_typeline_drive -1.677490 0.189683 -8.844 < 2e-16 ***
## bb_typepopup 0.717697 1.088981 0.659 0.509861
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4788.3 on 3947 degrees of freedom
## Residual deviance: 2267.4 on 3927 degrees of freedom
## AIC: 2309.4
##
## Number of Fisher Scoring iterations: 12
final_out_prob$predicted_prob <- predict(out_model, newdata = final_out_prob, type = "response")
# we need to create a vector for l teams abbreviations and match them to the full team name
team_name_full <- data.frame(
defteam = c("AZ", "ATL", "BAL", "BOS", "CHC", "CWS", "CIN", "CLE", "COL", "DET",
"HOU", "KC", "LAA", "LAD", "MIA", "MIL", "MIN", "NYM", "NYY", "OAK",
"PHI", "PIT", "SD", "SFG", "SEA", "STL", "TBR", "TEX", "TOR", "WSN"),
team = c("diamondbacks", "braves", "orioles", "red sox", "cubs", "white sox", "reds",
"guardians", "rockies", "tigers", "astros", "royals", "angels", "dodgers",
"marlins", "brewers", "twins", "mets", "yankees", "athletics", "phillies",
"pirates", "padres", "giants", "mariners", "cardinals", "rays", "rangers",
"blue jays", "nationals")
)
# joining in the full team names to use on spraychart
final_out_prob <- final_out_prob %>%
left_join(team_name_full, by = "defteam")
# narrowing down our score and grouping by player name
grouped_by_player <- final_out_prob %>%
select(team, name_full, key_mlbam, events, estimated_woba_using_speedangle, delta_home_win_exp, delta_run_exp,location, hc_x, hc_y, predicted_prob, out_hit, game_date) %>%
group_by(name_full)
# This will turn the x and y coordinates to be representative of proper distance from home plate
grouped_by_player <- mlbam_xy_transformation(grouped_by_player)
players_to_use <- c("571970") #!! Replace with your specific player IDs !!
# Filter the data for the specified players
filtered_data <- grouped_by_player %>%
filter(key_mlbam %in% players_to_use) %>%
mutate(out_hit = as.factor(out_hit))
filter <- final_out_prob %>%
filter(key_mlbam %in% players_to_use) %>%
mutate(out_hit = as.factor(out_hit))
has_field_errors <- filter %>%
filter(events == "field_error") %>%
nrow() > 0
# Create the table_grob and add title if there are field errors
error_add_grob <- if (has_field_errors) {
table_data <- filter %>%
filter(events == "field_error") %>%
rename(weighted_error = delta_home_win_exp,
delta_runexp = delta_run_exp) %>%
select(weighted_error, delta_runexp)
}
# Create the error table if there are field errors
error_add_grob <- if (has_field_errors) {
tableGrob(
filter %>%
filter(events == "field_error") %>%
rename(weighted_error = delta_home_win_exp,
delta_runexp = delta_run_exp) %>%
select(weighted_error, delta_runexp)
)
} else {
NULL
}
This is the code chunk for the table in our dashboard, where we will look at the readily available Baseball Savant leaderboard information. This will allow us to make a table to put at the bottom of our dashboard.
# we are going to scrape savant for directional and overall outs above average
savant <- baseballr::scrape_savant_leaderboards(leaderboard = "outs_above_average", min_field = 100, year = 2024)
savant_dir_out <- baseballr::scrape_savant_leaderboards(leaderboard = "directional_oaa", min_field = 5, year = 2024)
# we join these so that each player_id has all information
total_savant <- right_join(savant, savant_dir_out, by = "player_id")
# ((OUTFIELDERS)) doing some clean up for what we want to look at and what our columns should be called
total_savant_of <- total_savant %>%
select(player_id, `last_name, first_name.x`, primary_pos_formatted, outs_above_average, attempts, fielding_runs_prevented,
n_oaa_slice_back_all, display_team_name,
n_oaa_slice_in_all) %>%
rename(name = `last_name, first_name.x`,
Team = display_team_name,
position = primary_pos_formatted,
OAA = outs_above_average,
Attempts = attempts,
FRP = fielding_runs_prevented,
OAA_Back = n_oaa_slice_back_all,
OAA_In = n_oaa_slice_in_all)
player_stats <- total_savant_of %>%
filter(player_id == 641658) %>% # !! adjust with the proper player_id !!
select(Team, position, OAA, Attempts, FRP, OAA_Back, OAA_In)
table_grob <- tableGrob(player_stats)
average_by_position <- total_savant_of %>%
group_by(position) %>%
filter(position == "CF") %>% # filter on the position of the player to show comparison
summarise(
Avg_OAA = mean(OAA, na.rm = TRUE),
Avg_Attempts = mean(Attempts, na.rm = TRUE),
Avg_FRP = mean(FRP, na.rm = TRUE),
Avg_OAA_Back = mean(OAA_Back, na.rm = TRUE),
Avg_OAA_In = mean(OAA_In, na.rm = TRUE))
average_by_position_rounded_of <- average_by_position %>%
mutate(across(c(Avg_OAA, Avg_Attempts, Avg_FRP, Avg_OAA_Back, Avg_OAA_In), ~ round(.x, 2)))
# Create a table grob with the rounded values
table_grob_avg_of <- tableGrob(average_by_position_rounded_of)
grid.draw(table_grob_avg_of)
# ((INFIELDERS)) doing some clean up for what we want to look at and what our columns should be called
total_savant_infield <- savant %>%
select(player_id, `last_name, first_name`, primary_pos_formatted, outs_above_average, fielding_runs_prevented, outs_above_average_lateral_toward3bline, outs_above_average_lateral_toward1bline, display_team_name) %>%
rename(name = `last_name, first_name`,
Team = display_team_name,
position = primary_pos_formatted,
OAA = outs_above_average,
FRP = fielding_runs_prevented,
OAA_to_1st = outs_above_average_lateral_toward1bline,
OAA_to_3rd = outs_above_average_lateral_toward3bline)
player_stats <- total_savant_infield %>%
filter(player_id == 571970) %>% # !! adjust with the proper player_id !!
select(Team, position, OAA, FRP, OAA_to_1st, OAA_to_3rd)
# Display the table
table_grob <- tableGrob(player_stats)
# creating a way to see difference between the player and league averages for same criteria
average_by_position <- total_savant_infield %>%
group_by(position) %>%
filter(position == "3B") %>% # filter on the position of the player to show comparison
summarise(
Avg_OAA = mean(OAA, na.rm = TRUE),
Avg_FRP = mean(FRP, na.rm = TRUE),
Avg_OAA_to_1st = mean(OAA_to_1st, na.rm = TRUE),
Avg_OAA_to_3rd = mean(OAA_to_3rd, na.rm = TRUE))
average_by_position_rounded_if <- average_by_position %>%
mutate(across(c(Avg_OAA, Avg_FRP, Avg_OAA_to_1st, Avg_OAA_to_3rd), ~ round(.x, 2)))
# Create a table grob with the rounded values
table_grob_avg_if <- tableGrob(average_by_position_rounded_if)
grid.draw(table_grob_avg_if)
This code chunk uses MLB.com to find the headshots for each player that we want to look at. Insert the player_id into the key_mlbam at the bottom to save the headshot as a distinct plot.
# get the MLB headshot for each player I am looking at, copy and paste the key_mlbam in the bottom of the code
player_headshot_plot <- function(key_mlbam) {
# Define a grob for the image
headshot_grob <- function() {
url <- paste0('https://img.mlbstatic.com/mlb-photos/image/',
'upload/d_people:generic:headshot:67:current.png/',
'w_640,q_auto:best/v1/people/', key_mlbam,
'/headshot/silo/current.png')
temp_file <- tempfile(fileext = ".png")
download.file(url, temp_file, mode = "wb")
img <- readPNG(temp_file)
rasterGrob(img, width = unit(1, "npc"), height = unit(1, "npc"))
}
ggplot() +
annotation_custom(headshot_grob(), xmin = -Inf, xmax = Inf, ymin = -Inf, ymax = Inf) +
theme_void()
}
# Create the standalone headshot plot by inserting the player_id into the key_mlbam
headshot_plot <- player_headshot_plot(key_mlbam = "571970") ## !!!!
headshot_plot_resized <- ggdraw() +
draw_plot(headshot_plot, x = .05, y = -1.25, width = .21, height = 2)
This will allow us to create a de facto text box for the best play of the week that we can highlight to the audience
# Load necessary libraries
library(ggplot2)
library(cowplot) # For ggdraw() and plot_grid()
game_date_plot <- " August 24th "
predicted_prob_plot <- " 72.6% "
dashboard_text <- ggplot() +
geom_text(
aes(
x = -0.50,
y = 0.5,
label = paste(
"Best play of the Week came on\n",
game_date_plot,
"\nwhere the out probability was\n",
predicted_prob_plot
)
),
size = 5,
hjust = 0,
vjust = 0.5,
fontface = "bold"
) +
theme_void() +
theme(plot.margin = margin(.5, .5, .5, .5, "cm"))
# creating a name plot for the final dashboard
name_plot <- ggplot() +
geom_text(
aes(x = 0.15, y = 0.5, label = "Max Muncy"), ## !!
size = 13, # Adjust size as needed
hjust = 0.5,
vjust = 0.5,
fontface = "bold"
) +
theme_void()
This will combine all the plots for a more finished product and will let you customize the appearance of each.
# this is the code for the player spray chart for the selected player
player_spray_gg <- ggplot(filtered_data, aes(x = hc_x_, y = hc_y_, color = predicted_prob, shape = out_hit)) +
geom_spraychart(stadium_ids = unique(filtered_data$team),
stadium_transform_coords = TRUE,
stadium_segments = "all", size = 5,
stadium_aes = TRUE) +
geom_point(size = 3, alpha = 0.6) +
# stat_density2d(color='gray') +
scale_shape_manual(
values = c(17, 16), # Manually specify the shape values
labels = c("Out", "Hit"), # Labels for the legend
breaks = c(1, 0) # Order in the legend
) +
scale_color_gradient(low = "lightpink", high = "darkblue", name = "Predicted Probability") +
labs(
title = " Defensive Spray Chart \n with Predicted Out Probability and Outcomes",
subtitle = " Week of August 18 - August 25"
) +
theme(
plot.title = element_text(size = 20, face = "bold"),
plot.subtitle = element_text(size = 14, color = "blue", margin = margin(t = 10))
) +
theme_void() +
coord_fixed() +
facet_wrap(~team) +
theme(legend.position = "bottom",
plot.title = element_text(size = 18, face = "bold"),
strip.text = element_text(size = 12, face = "bold"))
full_plot <- ggplot() +
theme_void() +
coord_fixed(ratio = 1) # Adjust ratio if needed to fit your plots
# Use draw_plot to place individual plots
finished_dash <- ggdraw(full_plot) +
draw_plot(headshot_plot_resized, x = .01, y = 0.6, width = 1, height = 0.2) +
draw_plot(dashboard_text, x = -.18, y = -0.0, width = 0.5, height = 0.5) +
draw_plot(player_spray_gg, x = 0.2, y = 0.2, width = .75, height = .75) +
draw_plot(table_grob, x = 0.3, y = -.1, width = 0.5, height = 0.5) +
draw_plot(table_grob_avg_if, x = 0.3, y = -.2, width = 0.5, height = 0.5) + #have to change this depending on infielder or outfielder
draw_plot(name_plot, x = -.080, y = 0.55, width = 0.5, height = 0.5)
# Conditionally add `error_add_grob` if it is not NULL
if (!is.null(error_add_grob)) {
finished_dash <- finished_dash +
draw_plot(as_grob(error_add_grob), x = 0.6, y = .4, width = 0.5, height = 0.3)
}
ggsave(filename = "muncy_max_finished_dash.jpg", plot = finished_dash, width = 10, height = 6, units = "in", dpi = 300)
I continue to be drawn to the defensive side of the game and wonder what different ways I can tinker with creating visualizations that can tell a story about a certain player. I decided to create a defensive fielding dashboard with data from BaseballSavant through the baseballR package in R. My code can be found at the GitHub link and also is listed above. Let's quickly dive into the process:
Process and Structure:
Baseball Savant Statcast_search
First, I collected data for each fielder for the last week. This is something that I will continue to update over the course of the season and hopefully find an effective way to query more data using the statcast_search function. I created a vector that highlighted different outcomes of play that resulted in outs, which will be important in making our linear regression model for predicting outs.
After cleaning the data a lot to return the desired columns, as well as creating ways to see which team is on defense and what player the ball was hit to, I ran a basic linear regression model for out prediction. My limitations will come up later in the write up, but the out model factored in different categories that I thought would play a factor in determining what would be an out or not. Listed out they include:
Launch _speed -> exit velocity
Launch_angle
Estimated_woba_using_speedangle -> xWOBA
Location -> to which fielder the ball was hit
Hit_distance_sc -> the distance of the hit
Infield_align -> if there was a specific shading of the infielders
Hc_x -> batted ball coordiante (x)
Hc_y -> batted ball coordinate (y)
Bb_type -> ex. (line drive, ground ball, fly ball)
The results of the model are shown in the code and in the markdown file. Using launch speed, launch angle, and estimated wOBA may seem redundant, but I was curious to see if any of these played a bigger part individually.
I then used the model to run predicted values for outs and added that to my data. Grouping the data by each player would allow me to make my dashboards.
I wanted to find a way to make an illustration that would be both interesting and informational. There have been a bunch of graphics published by many people on hitting and pitching statistics (just recently read TJStats' pitching summary tutorial which I used for inspiration in parts of my graphics) and I wanted a way to engage a user more so than just spitting numbers at them. That is why I incorporated the spray chart aspect. I think the spray chart is a crucial, yet flawed piece of my dashboard. I also wanted to look at raw numbers for easier comparison, which I also accomplished.
Baseball Savant Leaderboards
I used another function in baseballr to pull certain metrics pertaining to defensive fielding in the Savant leaderboard section. More well known metrics like Outs Above Average (OAA) and Fielding Runs Prevented (FRP) give the reader better context in seeing the big picture of the fielder's yearly performance. I also referenced league averages in each corresponding metric for fielders who have over 100 attempts at the respective position.
Headshot
I used a very similar code structure to what Thomas Nestico used for his headshot in Python, but made it available to run in R, with some help from Co-Pilot :) . This is another way to give the dashboard life and make a better connection through a visual and not just words.
Extras:
I had to teach myself how to use different functions in R to print the tables that I created and add in text. I also wanted to make the dashboard unique to what I have researched in defensive metrics. This is why some players, if they have committed an error that week, have another table to the right of their spray chart that shows the value lost in that play as well as the change in run expectancy. I have written a lot about this topic in my research. Read more here.
Room for Improvement:
There are numerous things I want to continue to iterate as I progress with these dashboards.
1. Appearance -
I am pretty happy for the most part with how the dashboards came together. Considering it was my first time using R for dashboards (and yes, it is not as easy or pretty as Tableau - maybe I'll try that next), I was happy that all of the main points came across with what I hoped to convey. I am happy that I could get each team's ballpark on the background, that the predicted probability scale is easy to read, and that the differentiator between out and hit is clean. I want to find better ways to position each chart, get rid of the index numbers on the tables, and make a more slick dashboard overall, but that will come with time. I also know that I created a binary for outs and hits as shown on the spray chart and there are other plays (like errors) that do not classify as either, but I will continue to fix that as well.
2. Code -
I am well aware that I will not win any prizes for most concise and efficient code for this project. Again, it was more of a test run to teach myself a lot of what these functions do and how they can be improved going forward. My code for some QA reading this is probably a nightmare, so I apologize. But, it worked for me! I will continue to clean my movements and work on learning more about the language to expedite all of this without bugs as I continue.
3. Data -
One of, if not the most difficult thing to learn about public defensive data is that it is very scarce and not very indicative. First, a better defensive model would include player positioning metrics. How fast is a player moving to get to a certain ball, where is he originally positioned, how much ground does he have to cover, etc. The day HawkEye data becomes public will be a glorious day for me. That is the real issue with my linear regression model. Using xWOBA as a focal point in the model is way better than nothing, but it is too inconsistent. There have been xWOBA's of 1.000+ that look like "routine" outs to some outfielders because they are positioned in a perfect spot. Second, the difference between infielder and outfielder data is hard to ignore. I used "location" as the basis for most of my research. There are many problems with this, but mostly the fact that outfielders get many more opportunities because the ball ends up in their hands first, meaning they are credited with the location. If a ball is hit to the left fielder, regardless of the chance that the third baseman or shortstop had to field the ball, the left fielder gets credited with the touch if he is the one to retrieve it. This creates a very inconsistent and unfair view between infielders and outfielders. Most infielders have a lot of triangles on their charts (signifying outs) because the only time it is credited with a hit is if the fielder cannot make a play on a ball he fielded. So, yes, very flawed on that front.
Conclusions:
Overall, I am happy with the production of the dashboard. I think it is an interesting and relatively informative tool to use when looking at defensive player performance over the selected time frame. I hope that I will be able to clean up the appearance, process, and data as I get more comfortable with R and even look into using different software (Tableau and Python) in the future. I will be on the hunt for more data sources that can give me a more accurate representation of a fielder's overall ability, but for now it is a good start to get out in the eyes of others. Keep following along on Twitter/X @DigestDuffy and on this website. Would love to hear any thoughts and suggestions that you may have!
-Drew