As you're going through this exercise, keep everything in an RMarkdown file to save and send to me/refer to later! Template way down below.
Install and load the packages that you'll need to access and manipulate the data.
install.packages(c("tidyverse","ggrepel","nflreadr","nflplotR")) # once ever
# loading the libraries
library(tidyverse)
library(ggrepel)
library(nflreadr)
library(nflplotR)
Load the 2023 data:
data <- load_pbp(2023)
Last class, we loaded data, and it was a bit overwhelming! The 2023 data had ~50,000 plays with 372 features about each play. Maybe too much!!! Here's how to pull specific columns. There's a new symbol in there called a "pipe". A pipe allows you do sort out each piece of a complicated expression one step at a time instead of doing it as a big nested expression.
So basically, think of the expression not as function( of data ) but as data %>% into function ( )
OLD WAY THAT WOULD BE HARD TO LOOK AT:
View(select(data, home_team, away_team, posteam, desc, rush, pass)))
NEW WAY THAT'S EASIER TO READ AND ADD TO:
Check out the result. Phew! much less overwhelming. Fewer columns to look at. Let's save this into a new variable:
data1 <- data %>%
select(home_team, away_team, posteam, desc, rush, pass)
Sum the rush column and pass column to see how many passes vs. rushes there were in 2023.
Sum the rushes and passes (separately) by the posteam (team with possession) to see how many rush plays/pass plays each team ran in 2023. Some code below from a different data set might be helpful.
aggregate(drivingDeaths$deaths, list(drivingDeaths$region), FUN=median)
Okay, now let's select a few columns but then also filter out the data. Notice that in the rush/pass column there are some plays that are listed as neither. These are probably special teams plays. We are going to use the filter function (which is sort of an easier way to subset!). This is when the %>% pipe will be hepful because we can do a few rows of piping:
There are 7,535 plays that seem to fit this description. Scroll through the descriptions just to see what you're working with!
Edit the code above to find out how many plays fit each of these descriptions:
4. Plays that are on 4th down.
5. Plays that are on 4th down and are a special teams play.
6. Plays that are either on 4th down or a special teams play.
Note that you'll need to make use of: AND (thing1 == 1 & thing2 == 1) and OR (thing1 == 1 | thing2 == 1)
One thing that will be helpful will be to group that data and summarize it. This is similar to the aggregate feature above, so we can group plays by player, by down, by... whatever we want. Again... the piping is going to be super helpful!!!! In the summarize step, the "ypc" and "plays" are variables to head the column, and then the piece after it is how the things in that column are created.
Below is code to filter out the Dallas Cowboys rush plays and then look at which rusher gained the most yards per carry. Then, it is arranged by the number of plays they ran (the negative making it so the biggest number goes to the top).
data %>%
filter(posteam == "DAL", rush == 1) %>%
group_by(rusher) %>%
summarize( ypc = mean(yards_gained), plays = n() ) %>%
arrange(-plays)
Edit the code above to find out the following:
7. A list of the New England receivers with their TOTAL yards gained sorted by the number of plays they ran.
8. A list of the Washington Commanders receivers with their average yards gained... on 4th down, again sorted by the plays they ran.
9. Sort the initial Dallas rusher data by yards per carry instead of number of carries.
A couple of things that might be helpful:
table(data$posteam) can help you just see the abbreviations for team
the columns are named very intuitively, but you can also see all of them with names(data). There is also this site which has a list of all the columns which is kinda cool! It's searchable too.
One of the most important ways to evaluate the value of a play in football is by expected points added. At each place on the field and down marker, there is an expected number of points that a team will score. For example, reading below, if you are 15 yards away from the endzone on 2nd down, you are expected to score ~4.2 points. If you are 65 yards away from the endzone on 4th down (so on your own 35 yard line), then you are expected to *lose* 0.7 points (presumably the other team is likely to get the ball with good field position!).
So a play can be judged on its effectiveness by comparing the number of expected points before and after the play. For example, if you go from a game situation where you are expected to score 5 points to a place where you're expected to score 4.3 points, even though your team still probably will score, it was a bad play because you lost 0.7 expected points on the play!! Let's pull some main data, including the epa.
epaData <- data %>%
select(posteam, rush, pass, special, epa, rusher, passer, receiver)
Then, take that data and pipe it into other functions. The code below finds the top 20 quarterbacks by epa per play! Note that I'm only printing the top 20 because there are a lot of people who passed the ball this season. I'm also filtering the data so
epaData %>%
filter( pass==1 ) %>%
group_by(passer) %>%
summarize( epaPerPlay = mean(epa), plays = n() ) %>%
arrange(-epaPerPlay ) %>%
print(n=20) %>%
filter(plays>50)
Edit the code above to find out the
10. top 20 rushers...
11. & top 20 receivers that added the most expected points per play last season.
HAND IN YOUR WORK ON THE CODING CHALLENGES ABOVE (HTML output from RMarkdown), then
ANALYSIS: Using the nflfastR data, make a short slide based presentation (~6ish slides, 3 for each of 2a and 2b) in this slide deck. Pick one question from the first list below and pick one question to answer on your own.
2a: DECISION ANALYSIS
1. It’s 4th and X and we've decided to go for it. Should I have my team run or pass the ball?
2. You’re on the X yard line on 4th down and you've decided not to go for it. Should you punt or kick a field goal?
3. It’s first down - what is the best play choice? Does it vary by team?
4. Does a touchback or a kickoff return help the kicking or receiving team more?
5. It's X and goal. Should you run or pass?
6. Do teams change their run/pass strategy when they are losing or winning by certain amounts?
7. Generally, should you fair catch a punt or try to return it?
8. Should teams go for 2 after a touchdown more often?
2b: YOUR ANALYSIS
Then, answer a question of your own choosing based on the dataset. A suggestion might be to look at certain players/teams and look at their strategies/effectiveness.
A helpful resource: R Graph Gallery. You can also make graphs in Tableau if you want.
RUBRIC FOR ANALYSIS:
Included in quarter 2:
11/11: 10 coding challenges
9/9: Rough Draft of analyses:
Final Exam Presentation (10% of overall grade for course, 45%x2 for each quarter)
6/6: Does your mini-presentation *answer the direct question at hand* for your assigned question?
6/6: Does your mini-presentation *answer the direct question at hand* for your own question?
6/6: Does your mini-presentation do so with *good visualizations, measures, clear communication*?
RMARKDOWN FILE BELOW... Make a new one, then DELETE EVERYTHING, and copy paste my template in.
---
title: "NFL Play-by-Play Coding Challenges"
author: "Bowman Dickson"
date: "2024-12-09"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r include=FALSE}
library(tidyverse)
library(ggrepel)
library(nflreadr)
library(nflplotR)
```
Load 2023 data:
```{r}
data <- load_pbp(2023)
```
Selecting only a few columms:
```{r}
data1 <- data %>%
select(home_team, away_team, posteam, desc, rush, pass)
```
## INTROS
1. Sum the rush column and pass column to see how many passes vs. rushes there were in 2023.
```{r}
```
2. Sum the rushes and passes (separately) by the posteam (team with possession) to see how many rush plays/pass plays each team ran in 2023. Some code below from a different data set might be helpful.
```{r}
```
## FILTERING
How many plays fit each of these descriptions from 2023?
4. Plays that are on 4th down.
```{r}
```
5. Plays that are on 4th down and are a special teams play.
```{r}
```
6. Plays that are either on 4th down or a special teams play.
```{r}
```
## GROUP BY & SUMMARISE
Find the following:
7. A list of the New England receivers with their TOTAL yards gained sorted by the number of plays they ran.
```{r}
```
8. A list of the Washington Commanders receivers with their average yards gained... on 4th down, again sorted by the plays they ran.
```{r}
```
9. Sort the initial Dallas rusher data by yards per carry instead of number of carries.
```{r}
```
## EXPECTED POINTS ADDED
Determine the following based on EPA:
10. The top 20 rushers from 2023.
```{r}
```
11. The top 20 receivers from 2023.
```{r}
```