Statistics 2 is a foundational course in the BS in Data Science program at the Indian Institute of Technology Madras. The course focuses on probability theory, random variables, discrete and continuous probability distributions, expectation, variance, and conditional probability. It develops analytical thinking and mathematical reasoning required for data science, artificial intelligence, and statistical modeling. Through real-world datasets and practical activities, the course strengthens the ability to analyze uncertainty and make data-driven conclusions.
For this activity, I used the dataset titled Students Performance in Exams, obtained from Kaggle. This dataset contains academic performance data of students along with demographic and background variables.
The dataset consists of over 1000 rows and 8 columns, including both categorical and numerical variables. This makes it suitable for descriptive statistics as well as probability-based analysis.
Each row represents one student. It includes demographic details and their scores in mathematics, reading, and writing exams.
The dataset contains the following variables:
Gender – Male or Female
Race/Ethnicity – Group classification of students
Parental Level of Education – Education level of parents
Lunch – Standard or Free/Reduced
Test Preparation Course – Completed or None
Math Score – Marks obtained in mathematics (0–100)
Reading Score – Marks obtained in reading (0–100)
Writing Score – Marks obtained in writing (0–100)
The first five variables are categorical, while the last three are numerical variables.
Â
📊 Visualizations and AnalysisÂ
To understand the impact of the test preparation course on student performance, the average scores in Mathematics, Reading, and Writing were calculated separately for students who completed the preparation course and those who did not.
From the analysis, it is observed that students who completed the test preparation course scored higher on average in all three subjects compared to students who did not complete the course.
In Mathematics, the average score increased from 64.08 to 69.70.
In Reading, the average score increased from 66.53 to 73.89.
In Writing, the average score increased significantly from 64.50 to 74.42.
This indicates that participation in the test preparation course is associated with better academic performance. The difference is particularly noticeable in writing scores.
The visualization clearly shows that students who completed the preparation course consistently outperform those who did not, suggesting that structured preparation positively influences exam outcomes.
CHART-1
To analyze the relationship between reading and writing performance, a scatter plot was created using reading scores on the x-axis and writing scores on the y-axis.
The scatter plot shows a clear upward linear pattern, indicating that as reading scores increase, writing scores also increase.
The correlation coefficient between reading and writing scores was calculated as:
r = 0.9546
Since the correlation coefficient is very close to +1, this indicates a very strong positive linear relationship between the two variables.
This means that students who perform well in reading almost always perform well in writing. The strength of the correlation suggests that these two skills are highly related and tend to improve together.
From a statistical perspective, this demonstrates how correlation quantifies the strength and direction of association between two random variables. The high correlation value confirms that reading and writing scores move together in a consistent and predictable manner.
CHART-2
Scatter Plot
This dataset demonstrates how statistical tools and probability concepts can be applied to real-world educational data. By analyzing score distributions, relationships between variables, and conditional probabilities, we gain meaningful insights into factors affecting student performance. This activity strengthened my understanding of how probability theory connects with practical data analysis.
ㅤㅤㅤ