Check Data Distribution
After conducting an initial exploratory data analysis (EDA) and reviewing the distribution of each feature, we determined that replacing the zero values with the median was the most appropriate approach.
Outliers Detection and Treatment
Outliers are extreme values that deviate significantly from the rest of the dataset. In clinical datasets like this one, they can:
Skew the distribution
Distort model training
Mislead feature importance and correlations
To ensure data quality and improve model performance, we performed outlier detection and treatment.
We used StandardScaler from scikit-learn to standardize all numerical features. This transformation:
Centers each feature around mean = 0
Scales it to have standard deviation = 1
This ensures that each feature contributes equally during model training.