Qiuyan (Crystal) Qin
Jin Peng
Sven Bambach
Steve Rust
We are part of the data science team in Nationwide Children’s Hospital. Our team mainly focuses on predictive modeling using electronic record data and large language modeling using clinical notes.
To develop an open-source machine learning model that accurately predicts in-hospital mortality using routinely collected admission data from pediatric patients in Uganda.
We engineered features across six major domains based on a literature review and available variables. From clinical indicators at admission, we included hypoxemia (SpO₂ <90%), deep coma (Blantyre Coma Scale ≤2), and severe malnutrition (MUAC <11.5 cm). Under demographic and socioeconomic factors, we derived a composite social economic status score using variables like maternal education, cooking fuel type, and household size. For comorbid conditions, binary flags captured HIV, cardiac disease, tuberculosis, and others. We extracted laboratory/physiological markers like lactate, glucose, heart rate, and blood pressure. Symptoms such as coma, seizures and swelling of both feet were flagged. We created perinatal and nutritional risk scores using birth history and feeding data. In addition, we generated clinically meaningful interaction terms identified in prior studies, including SpO₂ and its measurement quality, and the interaction between deep coma and glucose level. In addition, we constructed social economic status score based on their living environment and care giver status.
We generated 53 more variables based on feature engineering. After we transform the categorical variables into dummy variables, we have 282 variables in total. Some of the variables could be helpful for model prediction, but others may bring additional noise to the data. Therefore, we implemented a machine learning model and a variable selection strategy to select a set of variables that could contribute most to model prediction. We applied the default XGBoost model on the train dataset, and calculated SHAP value for each feature. We selected variables with SHAP values that could add up to 95% of the total SHAP value. XGBoost excels in sparse and structured data, which could be suitable for the rare event prediction. SHAP value quantifies the contribution of each feature to model prediction. After the SHAP value selection, 110 variables were included.
Due to the data imbalance issue, we used borderline SMOTE to generate synthetic samples for the minority group with a focus on the hard-to-define area to improve the prediction performance. One the other hand, using default boderline SMOTE, generating synthetic minority samples to make the data 50/50 balance, could lead to largely over-prediction the minority cases. Thus, we set a target ratio at 30/70 to generate synthetic samples to mitigate the prediction issue.
We selected a stacking model with XGBoost and CatBoost and combined it with a final estimator, random forest classifier to be our predictive model. In addition to XGBoost, Catboost handles categorical variables natively and is less prone to overfitting on noisy data. Because each model may overfit or underfit in different ways, stacking the models could potentially balance bias and variance cross models.
We used 5-fold cross validation and OPTUNA, a hyperparameter tuning optimization framework to select the hyperparameters in XGBoost and CatBoost. During each round of model training, we applied boderline SMOTE to generate synthetic samples for the minority group in the training dataset before training parameter. We used the weighted raw factor score as the optimization metric and set the sensitivity >=0.8 for the validation dataset to calculate the metric for each training round.
After we selected the hyperparameters for the stacking model, we used isotonic, a non-parametric method, with 5-fold cross validation to calibrate the model. This step largely decreased the model calibration error.
Finally, we trained the model on the entire train dataset and calculated the SHAP value for each variable. We created a list of SHAP cut-off values, and we selected a subset of the variables for each cut-off value. We then trained models using the selected variables. We applied each model to the test dataset with the selected variables and evaluated the model performance. We finally picked the set of variables that had the best performance on the test dataset.
When we determine the model performance, we need to select the threshold value. We applied 5-fold cross-validation to calculate the threshold probability using the train data and selected variables. We trained the model on every 4 folds of the training data and generate predictions on the validation fold. We set the sensitivity >=0.8 for the prediction on each validation set and determine the threshold probability, and we calculated the mean of 5 threshold probabilities.
Our final model results are {"score": {"AUC": 0.8216045038705138, "AUPRC": 0.2606135937553033, "Net Benefit": 0.028102128356403036, "ECE": 0.009929096950477844, "F1": 0.15217391304347827, "Sensitivity": 0.8571428571428571, "Specificity": 0.5816696914700544, "Parsimony Score": 0.4853, "Inference Time": 0.000152, "threshold_used": 0.020512820512820513, "tp": 42, "fp": 461, "fn": 7, "tn": 641, "weighted_score": 0.5936, "scaled_weighted_score": 0.3014}, "completion_time": "2025-07-24T13:52:12Z"}
Github repository: https://github.com/qiuyuanqin/2024-sepsis-data-challenge/tree/main