Source:
Introduction:
The MLS, being the soccer league closest to me and soccer in general being one of my passions, made doing a project using my data analysis skills on the MLS a no-brainer. The data set I picked came from Kaggle and can be found here. The author of the MLS dataset is Aiden Flynn. The dataset is jam-packed with all the information you could want on player stats. Specifically, it includes player name, player nationality, position, secondary position, club, age, birth year, matches played, matches started, minutes played, goals, assists, goal contributions, and yellow and red cards.
Data Description:
The data is organized into rows and columns, indicating that this dataset is rectangular, also known as tabular. Each row represents an individual MLS player, and each column represents a player variable such as nationality, goals, assists, etc. The columns have both qualitative data, such as player name, nationality, position, and club, and quantitative variables, such as age, matches played, goals, and assists. As for granularity, this data has fine granularity since each row represents an individual player's statistics and not the stats of a team or league, for example. The scope of the dataset is the MLS players and their statistics. The dataset is pretty complete for analyzing players, as it provides enough information to analyze players just with the dataset. When it comes to temporality, the dataset is from the 2024 MLS season. Each row represents a player's statistics for that season only. The dataset also includes variables related to time, like age, birth year, and minutes played, but it is all within the 2024 season and does not change over multiple seasons. Overall, this dataset appears reliable aside from a few minor issues. The column secondary position had a large amount of missing values, but this makes sense as not all players have a secondary position. Another issue was the 'age', 'birth', and 'country' year columns, which have two missing values each. These issues are few and as such it does not reduce the reliability of the data.
Data Cleaning:
1.)
I first looked at the dataset and immediately saw that it had missing data. To find all the missing values in the dataset, I used the Pandas library and checked for the sum of all missing values to identify how many NA values were in each column.
2.)
Next, since the second position column had 562 missing values, I decided not to include the column and create a new df excluding it. In addition, I excluded 'id', 'born', 'yellow_cards', and 'red_cards' since I have no plans to use them in my analysis.
3.)
Lastly, I decided to get rid of the rows that had missing data in the 'country' and 'age' columns. My reasoning for this is that each column had 2 missing values, which won't have a big impact on my data analysis, so getting rid of them made sense.
Single Variable Graphs
This bar chart demonstrates the top 25 player nationalities that are in the MLS based on the number of players from each nation. It is clear from this chart that the United States has by far the largest number of players. There is a large gap between the United States and the next country being Canada. The distribution is right-skewed because of the large number of players the United States has in comparison to the others. Many believe that American players are not at the same level as players from other nations. As seen in this chart, the MLS has a majority of American players, which leads many to undermine the league completely. This is because if Americans are rated low, then the overall league will be as well.
This histogram shows the distribution of ages in the MLS. We can see that most players are around the ages of 20 to 30 years old. The graph seems to be slightly bell-shaped and has a peak at around age 23 to 24. This means that most players are around there mid 20s. There is, of course, a lot of young and older players, but there is a tail on both sides indicating that it is fewer than those in there mid 20s. This is surprising, as the MLS has a stereotype of being a retirement home for older players past their prime. By using this graph only, it seems that it is not the case.
This Histogram shows the distribution of goals scored. It is clear to see that the graph has a heavy right skew. This means that for the most part, the players scored few goals. This means that most players don't contribute to goals. As you look down the graph towards the higher goals, the number of players decrease drasticly. These outliers are the league's top goal scorers. This graph overall shows that the majority of goals in the MLS are scored by a small group of players.
This Histogram shows the distribution of assists by MLS players. The distribution is right-skewed by a lot. This means that most players in the league are only assisting goals a small number of times. As the number of assists increases, the number of players decreases. The players with these high assist numbers are the outliers and the league's top playmakers. The graph overall shows us that the assists in the MLS are created by a small group of players.
Multi-Variable Plots
Looking at this chart, it is clear that for the most part, the highest goal contributors are forwards and midfielders. This is expected, although midfielders appear to contribute offensively more than expected and are closer to forwards in goal contributions. Another thing to note from the chart is the role that age plays. It seems that most goal contributions are from those in there mid 20s and early 30s, with some exceptions. While it is no surprise that position plays a role in goal contribution, age also appears to influence when players reach their peak in goal contributions.
This box plot shows the distribution of player ages for each position. It appears that goalkeepers are the oldest on average, as they have the highest median age, which is in the late twenties. This may suggest that goalkeepers require more experience than other positions. On the other hand, forwards and midfielders have the lowest median ages, while defenders fall in the middle. This could be because these positions require qualities often associated with youth, like speed and endurance, especially forwards and midfielders.
Prediction Models
(used age and minutes played to predict goals and assists )
Variation 1 - Max Depth = 3
This Decision Tree Regressor has a max_depth equal to 3. This model had a training MSE of 18.06 and a test MSE of 19.53. What this meant was that the model's predictions were off by about 4.2 goals and assists on the training and 4.4 on test. There is only a small difference between train MSE and test MSE, which means the model is not overfitting by a big margin.
Variation 2- Max Depth = 4
This Decision Tree Regressor has a max_depth equal to 4. This model had a training MSE of 16.86 and a test MSE of 18.90. What this meant was that the model's predictions were off by about 4.1 goals and assists on the training and 4.3 on test. There is only a small difference between train MSE and test MSE, which means the model is not overfitting by a big margin
Variation 3 - Max Depth = 2
This Decision Tree Regressor has a max_depth equal to 2. This model had a training MSE of 19.43 and a test MSE of 20.94. What this meant was that the model's predictions were off by about 4.4 goals and assists on the training and 4.6 on test. There is only a small difference between train MSE and test MSE, which means the model is not overfitting by a big margin.
Linear Regression
Variation 1 - ( minutes)
This Linear Regression model was trained using only minutes played . This model had a training R² of 0.2 and a test R² of 0.141. What this means is the model explained about 20% of the variance in goals and assists on the training data and 14% on the test data. There is a small difference between train and test R², which means the model is not overfitting by a lot.
Variation 2 - (Age)
This Linear Regression model was trained using only age. This model had a training R² of 0.01 and a test R² of 0. What this means is that the model explained almost none of the variance in goals and assists on both the training and test data. Age alone is pretty much not a useful predictor of goal contributions in the MLS.
Variation 3 - (age + minutes)
This Linear Regression model was trained using age and minutes played. This model had a training R² of 0.201 and a test R² of 0.154. What this means is that the model explained about 20% of the variance in goals and assists on the training data and 15% on the test data. There is a small difference between train and test R², which means the model is not overfitting by a big margin.
Decision Tree Regressor & Linear Regression Summary
Overall, both models performed badly. This would suggest that the age and minutes played are not good predictors of goal contributions in the MLS. Out of the two bad models, however, the decision tree regressor came out on top. more specifically, the decision tree regressor with a max depth of 4, because it had the lowest test MSE, 18.9. The best R² for Linear Regression was 0.154, and while you can't directly compare the two different metrics, you can get a good idea with these numbers. In the end, the recommended model is the decision tree regressor with a max depth equal to 4.