Exploratory Data Analysis (EDA)

EDA (Exploratory Data Analysis) is the process of examining and summarizing a dataset to understand its key characteristics before applying any modeling or advanced analysis.

STEP 1: Explaining the data set

We have a data set of 768 people who were or were not diagnosed with diabetes.

There are eight input variables and one output variable.

The variables can be summarized as follows:

Input Variables (X):

Pregnancies=Number of times pregnant
Glucose=Plasma glucose concentration in plasma a 2 hours in an oral glucose tolerance test
BloodPressure=Diastolic blood pressure (mm Hg)
SkinThickness=Triceps skin fold thickness (mm)
Insulin=2-Hour serum insulin (mu U/ml)
BMI=Body mass index (weight in kg/(height in (m)2 )
DiabetesPedigreeFunction=a function which scores likelihood of diabetes based on family history
Age=Age (years)

Output Variables (y):

Outcome=Class variable (0 or 1)

STEP 2: Understanding the Data and statistical analysis

The “Outcome” feature is the target variable we aim to predict:

0 = No Diabetes
1 = Diabetes

Out of 768 total records:

500 cases are labeled as 0 (No Diabetes)
268 cases are labeled as 1 (Diabetes)

STEP 3: Statistical Analysis

Invalid Zeros Detected In:

Glucose, BloodPressure, SkinThickness, Insulin, BMI
These columns should not have zero values → treat them as missing.

Action Taken:

Replace zero values with NaN → makes it easier to count and handle missing data.
Later, impute NaNs with appropriate values (e.g. mean or median)

Outliers in Insulin:

Insulin values exceed 3 standard deviations, indicating extreme outliers.
To reduce impact, remove data points beyond 2 standard deviations from the mean.

STEP 4: Data Visualization

Data visualization is the process of representing data using graphs and charts to make patterns, trends, and insights easier to understand. It plays a crucial role in exploratory data analysis (EDA) by helping us:

Spot distributions, outliers, and imbalances
Understand relationships between variables
Compare trends across different groups (e.g., diabetic vs. non-diabetic)

DASHBOARD

Diabetes Correlation: Strong statistical significance (p < 0.001) in hemoglobin, hematocrit, and skeletal metrics for diabetic vs. non-diabetic individuals.
Pregnancy Impact: Higher pregnancies correlate with increased diabetes risk.
Glucose Levels: Diabetic individuals show distinct glucose level distributions compared to non-diabetic.
Age & Blood Pressure: Diabetes prevalence rises with age, often paired with elevated blood pressure.

Page updated

Google Sites

Report abuse