Uriel Nguefack Yefou - African Institute for Mathematical Sciences (AIMS), Data Scientist
Nji Ruth Mbikang - African Institute for Mathematical Sciences (AIMS), Data Scientist
Meryem El Bouz - Harouchi Mother-Children’s Hospital, University Hospital Ibn Rochd, Faculty of Medicine and Pharmacy, University Hassan II, Pediatrician
Charles Lugaaju - Mbarara University of Science and Technology (MUST), Divine Mercy Hospital, Father Bash Foundation, Medical Doctor
Our team is made up of two women and two men from two main backgrounds: Medical and Data Science. With two out of 4 members in the medical field and the other two in the data science field. We come from three African countries (Cameroon, Morocco, and Uganda) and reside in four different countries (Cameroon, Canada, Morocco, and Uganda). We all had one common goal while signing up for this and that was to use our knowledge to advance medicine. Understanding how versatile technology is and how data science and machine learning could contribute to the advancement of medicine, we decided to partner together, chipping in knowledge and researching on how to make this goal happen.
Our approach focused on maximizing the sensitivity to predict the in-hospital mortality in pediatric sepsis patients, thus prioritizing the identification of positive cases and minimizing missed diagnoses. We usedLightGBM, one of the most popular gradient boosting frameworks, optimized with Optuna for hyperparameter tuning and a 5-fold cross-validation to ensure the robustness of the model. We engineered clinically relevantfeatures, such as BMI and a severity score, and selected the top 50 features based on feature importance from aninitial model. The key innovation of our approach was the threshold tuning strategy, where we targeted a minimum sensitivity score of 0.85 to address the severe class imbalance in the data.. Overall, our approach combined feature engineering and model optimization to achieve high sensitivity and specificity.
For missing data, we imputed numerical features with their median values to preserve robustness against outliers and categorical features with the mode to maintain consistency. We removed 32 features with high missingness (>80%), low clinical relevance or redundancy including spo2other_adm (93.2% missing),nonexclbreastfedd_adm (96.9%). Outliers from numerical features were addressed using log-transformation of skewed variables (skewness >1) to reduce the impact of extreme values.
We engineered several features to enhance predictive power: BMI, Age in Years, Clinical Severity Score, Vital sign interaction: hr_rr_interaction (heart rate x respiratory rate) and spo2_hr_interaction (oxygensaturation x heart rate) to capture the combined physiological effects. Log transformations on numerical features with skewness > 1, one hot encoding on categorical features and select 50 final features.
In this challenge, we used the LightGBM model. Our model was trained with 5-fold stratified cross- validation to ensure generalizability. We chose LightGBM because of its speed, efficiency and the ability to deal with imbalance data and proven performance in recent research in healthcare.
Initial model and feature selection: We trained an initial LightGBM model using optuna to optimize hyperparameters with 5-fold stratified cross-validation. Then, we used feature importance and the top 50 features were selected based on their contribution to the model performance.
Final model tuning: Using the selected 50 features, we retrained the LightGBM model with Optuna to fine-tune hyperparameters, maximizing the combination of the 2 metrics: AUPRC and F1-score.For each fold, we applied the threshold tuning to achieve a minimum sensitivity of 0.85, and selected the model with the highest combined AUPRC and F1-score.
GitHub Repository: https://github.com/nguefackuriel/The-2024-Pediatric-Sepsis-Challenge-Team-AIMS