ㅤ
ㅤ
ㅤ
ㅤ
ㅤ
ㅤ
ㅤ
ㅤ
In this project, I perform a detailed statistical analysis using a real-world dataset that includes both categorical and numerical variables. The aim is to apply key statistical concepts such as:
Association between categorical variables
Measures of central tendency
Measures of dispersion
Relationships between numerical variables
Calculation of covariance and correlation
This activity demonstrates my ability to extract insights from data using appropriate mathematical reasoning, visualizations, and interpretation.
FIFA-15 DATASET(click on logo)
ㅤㅤㅤ
The dataset used here is from FIFA 15 and includes over 15,000 player profiles. We extracted the relevant columns for Body Type and Work Rate and built a frequency table to show their association.
This analysis explores the association between two categorical variables in the FIFA 15 dataset:
Body Type (Lean, Normal, Stocky) and Work Rate (High/High, High/Medium, etc.).
We aim to understand if certain body types are more likely to have specific work rates — a useful insight when evaluating player roles and gameplay behavior.
➤Frequency Table (Top Table in Sheet):
Shows the count of players for each combination of Body Type and Work Rate.
Example: How many Lean players have a Medium/Medium work rate?
➤Percentage Table (Bottom Table):
Shows percentage distribution of Work Rates within each Body Type.
All rows (Lean, Normal, Stocky) sum up to 100%, making comparison easier.
➤100% Stacked Column Chart:
Visualizes the % table.
Each bar represents a Body Type, split into different Work Rate categories.
Helps you quickly spot trends, like which Work Rate is dominant.
Medium/Medium is the most common Work Rate across all Body Types, especially among Normal players.
High/Medium is more prevalent among Lean and Stocky players than others.
Work Rates like Low/High and Low/Medium are relatively rare for all Body Types.
ㅤㅤㅤ
In this section, we calculated the Mean, Median, and Mode for three numerical variables:
Age (Column E in the dataset)
Weight (in kg) (Column H)
Overall Rating (Column M)
The calculated values were presented in a table, followed by a column chart comparing them visually.
Reason:
The measures of central tendency (mean, median, and mode) were calculated to summarize the overall distribution of the selected numerical variables—Age, Weight, and Overall Rating—in a single representative value each.
This helps in:
Understanding the general profile of players in the dataset (e.g., typical age, typical weight, and average skill rating).
Identifying whether the data is symmetric or skewed by comparing mean, median, and mode.
Providing a baseline reference for further analysis in dispersion, correlation, and other statistical measures.
Interpretation:
The measures of central tendency (mean, median, and mode) are quite close for all three variables:
For Age, the values are tightly grouped, suggesting a fairly symmetric distribution.
For Weight, if the mean is slightly higher than the median and mode, it indicates a slight right skew (a few heavier players).
For Overall Rating, the closeness of all three values suggests a balanced distribution of player performance in the dataset.
ㅤㅤㅤ
▶To understand how spread out the data is for each variable, we calculated appropriate measures of dispersion. These help interpret the variability in the player attributes. The three numerical variables chosen were:
Age
Weight (in kg)
Overall Rating
▶We used the following measures:
Range – the difference between the maximum and minimum values.
Variance – the average of the squared deviations from the mean.
Standard Deviation – the square root of the variance.
Each of these was calculated directly in the sheet using formulas, and the results are presented below the central tendency values.
▶These measures help us interpret how consistent or varied the data is. For example:
A higher standard deviation in Weight would indicate players vary a lot in physical build.
A low variance in Overall Rating suggests most players have a similar skill level.
The range, variance, and standard deviation were calculated for Age, Weight (kg), and Overall Rating. A column chart is also added to compare their standard deviations visually.
ㅤ▶Interpretation:
Weight (kg) shows the highest standard deviation, indicating greater variability in player body mass.
Overall rating has moderate variation, while Age has the lowest variability, suggesting that most players fall within a similar age group.
ㅤ
ㅤㅤㅤ
(Scatter Plot – Age vs Overall Rating)
We selected a random sample of 99 players from the FIFA 15 dataset to explore the relationship between a player's age and their overall performance rating.
A scatter plot was used to visualize this relationship. Each point on the plot represents a player, where the x-axis shows the Age, and the y-axis shows the Overall Rating.
Interpretation:
The data does not show a strong linear correlation between age and overall rating.
High-rated players are found across a wide age range.
This indicates that factors other than age—like experience, skill set, and potential—may influence overall performance.
ㅤㅤㅤ
We analyzed the relationship between a player's Age and Overall Rating using a sample of 99 FIFA 15 players.
Covariance value shows the direction of the relationship.
(Covariance for above is 0.05876951331)
Correlation value shows the strength and direction.
(Correlation for above is 0.008939160707)
A negative correlation means that as the age increases, the overall rating tends to decrease. This is common in sports datasets where younger players are rated higher for performance potential.
ㅤㅤㅤ