We trained several machine learning models to predict stroke occurrence using the cleaned dataset.
To find the best machine learning model for predicting strokes, we tested 8 different algorithms:
Logistic Regression
SVM (Support Vector Machine)
K-Nearest Neighbors (KNN)
Decision Tree
Random Forest
XGBoost
Gaussian Naive Bayes
Bernoulli Naive Bayes
We selected three machine learning models: Logistic Regression, Random Forest, and XGBoost due to familiarity.
Logistic Regression was chosen for its interpretability and suitability as a baseline classifier.
Random Forest was selected due to its ability to capture non-linear relationships and handle mixed data types.
XGBoost was included for its strong predictive performance and ability to handle imbalanced data using built-in regularization and boosting techniques.
The dataset was split into training and test sets using a 75/25 ratio, while keeping the stroke class distribution balanced (stratified). Since stroke cases were rare, we used ADASYN to generate more synthetic stroke samples in the training set and reduce imbalance.
Before training, we selected the top 15 most relevant features using a feature selection method and scaled them for better model performance.
We tested three models: Logistic Regression, Random Forest, and XGBoost, each tuned to handle imbalanced data. Instead of using the default prediction threshold, we adjusted it to prioritize high recall (≥80%), which is crucial in medical tasks to avoid missing stroke cases.
Model performance was evaluated using F1 Score, ROC AUC, and cross-validation, to get a full picture of how well each model performs.
All three models achieved high recall, with Logistic Regression and Random Forest both reaching 82.8%. Logistic Regression stood out with the highest F1 Score (0.219) and the highest ROC AUC (0.851), indicating a better balance between sensitivity and false positive rate. Although Random Forest and XGBoost performed similarly in recall, their overall discriminative power was slightly lower. These results suggest that Logistic Regression may be preferable when balancing stroke detection sensitivity and overall model robustness.