PREPROCESSING

Since the original dataset was not fully cleaned, several preprocessing steps were crucial before model development. These steps ensured that our machine learning algorithms received clean, reliable input data and could generate meaningful results.

Check Data Distribution

After conducting an initial exploratory data analysis (EDA) and reviewing the distribution of each feature, we determined that replacing the zero values with the median was the most appropriate approach.

Class Imbalance

The original dataset was imbalanced, with 65.1% non-diabetic and 34.9% diabetic cases. To prevent the model from being biased toward the majority class, we applied various resampling techniques:

SMOTE
Random Oversampling
Random Undersampling
SMOTE + Tomek Links
SMOTE + ENN

Best Resampling Method

Applying resampling significantly boosted performance, especially with SMOTE + ENN, which we recommend for final model training due to its high accuracy and consistency.

Outliers Detection and Treatment

Outliers are extreme values that deviate significantly from the rest of the dataset. In clinical datasets like this one, they can:

Skew the distribution
Distort model training
Mislead feature importance and correlations

To ensure data quality and improve model performance, we performed outlier detection and treatment.

Feature Scaling

We used StandardScaler from scikit-learn to standardize all numerical features. This transformation:

Centers each feature around mean = 0
Scales it to have standard deviation = 1

This ensures that each feature contributes equally during model training.

Page updated

Google Sites

Report abuse

PREPROCESSING

Since the original dataset was not fully cleaned, several preprocessing steps were crucial before model development. These steps ensured that our machine learning algorithms received clean, reliable input data and could generate meaningful results.

Class Imbalance

The original dataset was imbalanced, with 65.1% non-diabetic and 34.9% diabetic cases. To prevent the model from being biased toward the majority class, we applied various resampling techniques:

SMOTE

Random Oversampling

Random Undersampling

SMOTE + Tomek Links

SMOTE + ENN

Best Resampling Method

Applying resampling significantly boosted performance, especially with SMOTE + ENN, which we recommend for final model training due to its high accuracy and consistency.

Feature Scaling