We used the Stroke Prediction Dataset from Kaggle, consisting of 5,110 patient records and 12 attributes. This dataset is widely used in medical AI research for classifying stroke risk based on patient history.
Total Records: 5,110 patients
Attributes: 12
Target Variable: stroke (Binary: 0 = No, 1 = Yes)
age (float): Patient's age
hypertension (int): 1 (Yes), 0 (No)
heart_disease (int): 1 (Yes), 0 (No)
avg_glucose_level (float): Measured in mg/dL
bmi (float): Body Mass Index (some missing values)
gender, ever_married, work_type, Residence_type, smoking_status: All categorical
Insights from Exploration
Insight: The dataset is heavily imbalanced, with the majority of samples labeled as stroke = 0. The minority class (stroke = 1) is significantly underrepresented (≈4%).
Implication: Without handling this imbalance, most models will default to predicting the majority class, leading to high accuracy but poor recall for stroke prediction.
Insight: NULL values in bmi, less than 4% of it,
Implication: Possible to address this by filling in with the median value.
Age
Distribution: Fairly uniform with slight peaks at middle-aged and elderly ranges.
Note: Stroke cases seem to increase with age — confirming that age is a critical risk factor.
Avg Glucose Level
Distribution: Right-skewed. Many patients have glucose levels below 150, but there are long tails up to 250+.
Implication: Outliers may influence model behavior — consider transformation or binning if needed.
BMI
Distribution: Approximates a normal distribution but with a long right tail.
Implication: A few extreme outliers exist (e.g., BMI > 60). May consider capping/extreme value treatment for robustness.
Gender
Balanced between Male and Female — no clear gender bias.
Hypertension
Those with hypertension show a higher stroke proportion despite being a minority — strong predictor.
Heart Disease
Heart disease patients show an elevated stroke rate — strong predictor.
Ever Married
Most stroke patients were married. Could be confounded by age, as older individuals are more likely to be married and at higher risk.
Work Type
“Private” sector dominates, and relatively high in Govt_job and Self-employed. Could relate to lifestyle stress or healthcare access.
Residence Type
Stroke cases between the two appear proportionally similar, so this might not be a significant predictor.
Smoking Status
Distribution shows strokes slightly more frequent among formerly smoked and never smoked (though this might be age-related again). Unknown is a large group — consider how you handle this (e.g., impute or flag).
Key Correlations with Stroke:
Age (0.23) – the strongest correlation (positive), expected for stroke prediction.
Hypertension (0.14) and Heart Disease (0.13) – moderate positive correlations.
Avg Glucose Level and BMI – weak positive correlation with stroke.
Work Type (Children) shows negative correlation.