library(tidyverse)
library(stringr)
library(ggplot2)
library(baseballr)
library(sjPlot)
library(vip)
# loading in the two CSV for usage, data from Fangraphs for fielding statistics (2022-2024, filter on 1B data)
sheet_1 <- read.csv("C:/Users/Drew Duffy/Downloads/MLB Def for Phillies (as of 8_15_24) - picks.csv")
sheet_2 <- read.csv("C:/Users/Drew Duffy/Downloads/MLB Def for Phillies (as of 8_15_24) - split.csv")
#joining the sheets on season and names
clean <- left_join(sheet_1, sheet_2, by = c("Season", "name_last", "name_first"))
#loading previous error data I have used
load("first_base.RData")
#renaming to season for a join
clean <- clean %>%
rename(game_year = "Season")
#joining the two datasets
total <- left_join(clean, first_base, by = c("game_year", "name_last", "name_first"))
#picking the columns I want to look at
total <- total %>%
select(game_year, name_last, name_first, G, wpa, age, OAA, DRS, FRV,
Scp, FE, TE, PO, A, avg_xwoba, sum_run_exp)
# assigning weights to each commonly used metric. WPA gets a x2 due to it's small values
def_rat_weights <- data.frame(
weight_OAA = c(".4"),
weight_DRS = c(".4"),
weight_wpa = c("2"))
#changing all from string to numeric
def_rat_weights <- def_rat_weights %>%
mutate(weight_OAA = as.numeric(weight_OAA),
weight_DRS = as.numeric(weight_DRS),
weight_wpa = as.numeric(weight_wpa))
#converting the values into new columns with their weighted value
weight_OAA <- def_rat_weights$weight_OAA
weight_DRS <- def_rat_weights$weight_DRS
weight_wpa <- def_rat_weights$weight_wpa
total$weighted_OAA <- total$OAA * weight_OAA
total$weighted_DRS <- total$DRS * weight_DRS
total$weighted_wpa <- total$wpa * weight_wpa
# Duffy Defensive Rating
ddr <- total %>%
mutate(duffy_d_rating = (weighted_OAA + weighted_OAA + sum_run_exp)) %>%
filter(!is.na(duffy_d_rating))
#finding the 3rd quartile to establish an above average threshold (that is well above 50%)
summary(ddr$duffy_d_rating)
abv_avg <- quantile(ddr$duffy_d_rating, 0.75)
# Add a binary classification based on the above average limit
ddr <- ddr %>%
mutate(is_above_avg = ifelse(duffy_d_rating > abv_avg, 1, 0))
#creating a linear model for above average first base defenders and seeing the trend for age
ddr_model <- lm(is_above_avg ~ weighted_OAA + weighted_DRS + weighted_wpa + age,
data = ddr)
summary(ddr_model)
plot_model(ddr_model, type = "pred")
Goal: Create a more composite way to view defensive performance on positional levels. I created a weighted metric that uses OAA, DRS, and weighted errors, my way of quantifying the impact of an error on a certain play
Process: I found data from FanGraphs on first basemen for 2022-2024 (as of August 15) seasons for OAA and DRS and used Statcast data for their error data.
After finding relevant columns to extract insights from, I then began work on their composite score. Since errors represent a small number of plays and its number values are so small to begin with (representing percentages), I used their multiplier as .2 while OAA and DRS both are .4 multipliers.
I set a threshold for the model at .75 which determined an "above average" defender. Then I ran the model to see which variables contribute most to that threshold. Not surprisingly, the OAA and DRS metrics have the most prominent impact (since they are weighed the highest). I also added age into the model, but nothing substantial came from it which is telling for first basemen as they tend to be a bit older than their infield counterparts. This bodes well for the position, as age is not a great determinant of wether the player trends toward becoming a better or worse fielder over his years.
Here, we see that Christian Walker is in fact the top defensive first baseman in the game and has been one of the best overall defenders in the past three seasons.
I am interested in the progression and evolution of Nathaniel Lowe as a player. He has been one of the only players to consistently improve his defensive rating year to year, wonder if it continues in the future.
Model for Defensive Rating Outputs