An anomaly refers to a pattern in the data that does not conform to expected behavior. Anomalies can be indicative of novel, rare, or unexpected events.
An outlier is a data point that significantly deviates from the other observations in a dataset. Outliers can occur due to variability in the data or experimental errors.
1. Anomaly Detection
To improve data quality and eliminate rare or extreme behavior, an Isolation Forest algorithm was applied for anomaly detection. About 2.02% (103 Rows) of the data was identified and removed as anomaly . This step was visualized using PCA before and after anomaly removal to confirm its effectiveness.
2. Capping Extreme Values
Two numeric attributes, bmi and avg_glucose_level, exhibited long right-tailed distributions. To mitigate the potential impact of outliers on model performance—especially for algorithms sensitive to extreme values such as Logistic Regression—capping was applied. BMI values were capped at 60, and glucose levels were capped at 250. This ensures better model robustness and stability.