Regression based Machine Learning modeling

Project done as part of the course 24-787 (Machine Learning and Artificial Intelligence for Engineers) at Carnegie Mellon University (Fall 2022)

The transfer fees of sports players have become astronomical. The reason for such increase in the transfer fee can be attributed to bringing players of great future value to the club which in turn can help a club thrive or be successful in the ever evolving transfer market or make informed decisions on player acquisitions.

The aim of this study is to use Machine Learning (ML) regression framework to correlate the relationship between a player and their transfer value using multiple relevant features such as Age, Nationality, Position played, etc. narrowed by Feature Engineering using only the most relevant features which truly affect the players value. By using the regression framework, the market value of a player can be determined through different stages in their career. Responsible for dataset collection, feature engineering and applying regression based models to data.

The overall timeline for completion was divided into three different phases, which started from Researching relevant data, feature engineering and modeling based on selected features. The first step in this timeline was to select the dataset to be used. The dataset used for the study is sourced from FIFA 20 (sofifa.com), encompassing various player attributes like age, height, weight, nationality, club, overall rating, potential, market value, salary, skill moves, international reputation, weak foot, release clause, and team/player position, etc. The next step in this process was to utilize Pearson's correlation coefficient to perform feature engineering and select the most relevant features. The most relevant features were also visualized to understand the distribution of data. The final step to predict the transfer value was done by benchmarking 40 regression based models and selecting the best out of the available models.

Two types of evaluation was done in two different phases, one for the Feature Engineering and the next for benchmarking regression models.

Feature Engineering:

Pearson's Correlation coefficient was chosen as the metric for evaluating the features which were most linearly correlated to the transfer value of a player.
We found that the Release Clause was most linearly related to the final value – r value 0.98
We found that the Age (surprising result!) was least linearly related to the final value – r value 0.082

Benchmarking of best performing Regression based models:

Benchmarking involved testing various machine learning models, with specific focus on linear regression, random forest, and support vector regression (SVR)
Performance metrics such as mean absolute error (MAE), mean squared error (MSE), median absolute error, R-squared score and adjusted R-squared score are employed for evaluation.
The Random Forest Regressor was found to be one of the best algorithms for this task due to its high accuracy and relative low error (R2 score - 0.99)
The Support Vector Regressor was not an appropriate algorithm for this task due to its high error and low accuracy (R2 score - -0.11)
The XGB Regressor (Best Performing Regressor) outperformed the Linear SVR by ~100%
Time taken to train the model was highest for Kernel Ridge Regressor and least for Transformed Target Regressor

The project successfully identifies critical features impacting a player's market value, with the release clause emerging as a pivotal factor. Random forest, among the regression models, exhibits high predictive accuracy. The findings emphasize the significance of feature engineering and appropriate model selection in predicting football player transfer values. Future enhancements may include incorporating real-world data, refining missing value handling, and considering dynamic factors like player performance boosts over time.

Page updated

Google Sites

Report abuse

Regression based Machine Learning modeling

Some Additional Links