Utilize scouting reports on NFL draft prospects to be able to analyze the players and their potential NFL success.
I used scouting data written by Dane Brugler of the Athletic in his infamous "Beast" draft guide. I downloaded these draft guides since the 2019 season.
I needed to convert the PDF files into text files for easier parsing first, then I went through using regex to separate out all the player profiles from each other and the other text in the documents. I also went through manually to add in any players who slipped through the cracks or had name issues.
I standardized all the names by removing punctuation and capitalization to make joining on other datasets more likely. I used my PFF Wins Above replacement clone to add in the player's NFL WAR.
I had some other issues to clean up in the preprocessing steps regarding the conversion from the PDF to text file. Often the letter f, though particularly if it was something like ff, fi, or fl, would be encoded as a different character and then be dropped from the main dataframe. I went through and did my best to get a conversion table to add back in those missing characters.
From there, I did the basic NLP processes of standardization, tokenization, and lemmatization to convert the text from regular writing into something more useful for the computer by converting everything to a consistent baseline.
All of these calculations were performed within a given position group so no other position groups impacted the specific analysis.
Bi-gram analysis
This is a collection of two words (bi-gram) found within the scouting report and their correlation with NFL Wins Above Replacement using PFF grades.
I also plotted the most common bi-grams across the position group among players with NFL WAR above the 75th percentile (good players) and those with WAR below the 25th percentile (bad players) to see trends in common.
Sentiment Analysis
I computed the sentiment analysis of the words in the strengths and weaknesses sections to get a sense of how positive, negative, or neutral the words were in those sections.
Similarity Prediction
I took the combined strength and weaknesses sections and used latent semantic analysis to compute a similarity score between players. I used these similarity scores and their NFL WAR values to compute a prediction based on the scaling from the most similar player profiles to them.
For the WAR I took a player's first four seasons and scaled their first season to be a 50% weighting with an average rookie at the position. Then I took the average of those four seasons as that player's WAR. This is done to cut down on survivorship bias and highlight the importance of the rookie contract.
For the LSA, I used a tf_idf (term frequency - inverse document frequency) vectorizer and then used SVD (singular value decomposition) to compute this multi-dimensional space. I settled on 15 components for the SVD, along with normalizing the LSA vectors.
To actually compute the similarity scores I used a cosine similarity to the third power.
Overall I feel pretty good about this project. This is something that I have wanted to do for a while and I have always fallen off the project for one reason or another. I am very pleased to have done it and gotten it out by the draft.
As far as the actual output of the analyses, I am less pleased with the predictions. I had a hard time getting the model to differentiate between the players instead of just bunching up so the scores weren't as clean as I wanted them to be. I also don't think they are terribly predictive of future NFL success.
I was basing some of this project off of work done by Ben Brown when he was at PFF and they mentioned using data from 2015, whereas I could only find the Beast from 2019. Perhaps they were using more broad scouting reports than I was, since I limited the scope to just the strengths and weaknesses.