To the left is Bill James, arguably the godfather of the sports anlaytics revolution. He determined a way to predict how many games a team will win based on their points scored and the points scored against them. It is called the "Pythagorean Expectation" or "Pythagorean Win Percentage." It is more predictive that previous win-loss percentage at predicting future win-loss percentage! Amazing!
It can also be used to determine what teams are lucky frauds and what teams are good teams that have won less than expected.
Here is the simple formula:
Bill James used k=2, so win percentage = PF^2 / (PF^2 + PA^2)... hence the similarity to the Pythagorean Theorem. We'll see later that different values for k are needed for different sports.
Let's try this. Start by downloading the data here. This has the 2023 MLB and 2023 NFL. Look through the columns - the ones with "f" to start are football, and the ones with "b" to start are baseball. Let's compute the Pythagorean win percentage for the baseball teams and compare to the actual. We're going to start by putting our dataset into a simpler variable name to ease things.
s <- sportsPythagorean
s$bPythag <- NA
s$bPythag <- #your code here
How does that compare to the actual win percentages? Let's plot them against each other. The second line below
plot(s$bPerc, s$bPythag)
abline(0,1)
Okay, now let's use a slightly better exponent. Baseball-reference.com uses 1.83 instead.
s$bPythag2 <- NA
s$bPythag2 <- #your code here
plot(s$bPerc, s$bPythag2)
abline(0,1)
Which seems to work better in your opinion??
Analysis: Which teams seem to be under and over valued according to this statistic?
Compute a Pythagorean Win Expectation for football too. Plot it against the actual values. Then, experiment with the exponent value until you get one that you think is accurate. I'll tell you which one the pros use once you think you have a good value?