Before modeling, we preprocessed the data:
Handled Missing Values
Dropping Non-Predictive Columns
Handling Categorical Data
Feature Engineering
We checked for missing (null) values in the dataset. Only the bmi column had missing data (~3.9%), which is less than 5%, so we used the median value to fill them in.
The id column, which serves purely as a unique identifier, does not contribute to the prediction task. It was removed from the dataset to avoid introducing noise or misleading the model..
Some columns like gender, marital status, work type, and smoking status used text values. Since machine learning models work better with numbers, we used Label Encoding to convert these categories into numeric form while keeping their meaning. This step helps the models understand and use the data effectively.
4. Feature Engineering
To make the dataset more meaningful for prediction, we created new features:
Grouped Features: We grouped age, glucose, and BMI into categories like age groups or glucose levels to spot trends better.
Interaction Features: We combined important factors like age × glucose or glucose × BMI to capture more complex patterns.
Risk Scores: We added new features like:
cardiovascular_risk (based on hypertension and heart disease),
lifestyle_risk (from smoking behavior),
health_score (a combined score based on multiple health risks).