Data Preprocessing

Before modeling, we preprocessed the data:

Handled Missing Values
Dropping Non-Predictive Columns
Handling Categorical Data
Feature Engineering

1. Handle Missing Values

We checked for missing (null) values in the dataset. Only the bmi column had missing data (~3.9%), which is less than 5%, so we used the median value to fill them in.

2. Dropping Non-Predictive Columns

The id column, which serves purely as a unique identifier, does not contribute to the prediction task. It was removed from the dataset to avoid introducing noise or misleading the model..

3. Handling Categorical Data

Some columns like gender, marital status, work type, and smoking status used text values. Since machine learning models work better with numbers, we used Label Encoding to convert these categories into numeric form while keeping their meaning. This step helps the models understand and use the data effectively.

4. Feature Engineering

To make the dataset more meaningful for prediction, we created new features:

Grouped Features: We grouped age, glucose, and BMI into categories like age groups or glucose levels to spot trends better.
Interaction Features: We combined important factors like age × glucose or glucose × BMI to capture more complex patterns.
Risk Scores: We added new features like:
- cardiovascular_risk (based on hypertension and heart disease),
- lifestyle_risk (from smoking behavior),
- health_score (a combined score based on multiple health risks).

Page updated

Google Sites

Report abuse