EDA (Exploratory Data Analysis) is the process of examining and summarizing a dataset to understand its key characteristics before applying any modeling or advanced analysis.
We have a data set of 768 people who were or were not diagnosed with diabetes.
There are eight input variables and one output variable.
The variables can be summarized as follows:
Input Variables (X):
Pregnancies=Number of times pregnant
Glucose=Plasma glucose concentration in plasma a 2 hours in an oral glucose tolerance test
BloodPressure=Diastolic blood pressure (mm Hg)
SkinThickness=Triceps skin fold thickness (mm)
Insulin=2-Hour serum insulin (mu U/ml)
BMI=Body mass index (weight in kg/(height in (m)2 )
DiabetesPedigreeFunction=a function which scores likelihood of diabetes based on family history
Age=Age (years)
Output Variables (y):
Outcome=Class variable (0 or 1)
STEP 2: Understanding the Data and statistical analysis
The “Outcome” feature is the target variable we aim to predict:
0 = No Diabetes
1 = Diabetes
Out of 768 total records:
500 cases are labeled as 0 (No Diabetes)
268 cases are labeled as 1 (Diabetes)
STEP 3: Statistical Analysis
Invalid Zeros Detected In:
Glucose, BloodPressure, SkinThickness, Insulin, BMI
These columns should not have zero values → treat them as missing.
Action Taken:
Replace zero values with NaN → makes it easier to count and handle missing data.
Later, impute NaNs with appropriate values (e.g. mean or median)
Outliers in Insulin:
Insulin values exceed 3 standard deviations, indicating extreme outliers.
To reduce impact, remove data points beyond 2 standard deviations from the mean.
STEP 4: Data Visualization
Data visualization is the process of representing data using graphs and charts to make patterns, trends, and insights easier to understand. It plays a crucial role in exploratory data analysis (EDA) by helping us:
Spot distributions, outliers, and imbalances
Understand relationships between variables
Compare trends across different groups (e.g., diabetic vs. non-diabetic)
DASHBOARD
Diabetes Correlation: Strong statistical significance (p < 0.001) in hemoglobin, hematocrit, and skeletal metrics for diabetic vs. non-diabetic individuals.
Pregnancy Impact: Higher pregnancies correlate with increased diabetes risk.
Glucose Levels: Diabetic individuals show distinct glucose level distributions compared to non-diabetic.
Age & Blood Pressure: Diabetes prevalence rises with age, often paired with elevated blood pressure.