Sierra Strutz - University of Wisconsin-Madison Biostatistics and Medical Informatics PhD Student
Guanhua Chen - Associate Professor, Biostatistics and Medical Informatics, University of Wisconsin-Madison
Neil Munjal - Assistant Professor, Pediatrics, University of Wisconsin-Madison
Anoop Mayampurath - Assistant Professor, Biostatistics and Medical Informatics, University of Wisconsin-Madison
Briefly, our modeling pipeline is comprised of a nested 5-fold cross-validated gradient boosted machine model whose predictions undergo isotonic probability calibration for improved prognostic performance. We tuned model hyperparameters using a custom, weighted multi-objective score in Optuna, which permitted simultaneous optimization of model hyperparameters across all challenge metrics of interest. We gradually moved towards the generation of an ensemble of 5 gradient boosted machine models to stabilize final prediction confidence and improve model performance across challenge metrics.
As gradient-boosted tree-based models are capable of implicitly handling missing data during model development, data imputation was not necessary. Guided by clinician feedback, we grouped available features according to overarching clinical themes. We then developed a series of 5-fold cross-validated gradient boosted machine models using successive combinations of these feature groups, evaluating the relative performance of each new model using AUC. The final subset of features selected was based on model performance, where AUC was first maximized. Binary and categorical variables in our final selected subset of features were one-hot encoded. To remove co-linearity, the first encoded value per feature was dropped, resulting in n-1 encoded variables per encoded binary or categorical feature.
Gradient boosted machine learning models are robust to missing data and do not require extensive data-preprocessing, making them particularly well-suited for prognostic modeling of medical outcomes, especially in resource-constrained settings. Thus, we decided to utilize the computationally efficient lightgbm python package to implement gradient boosted machine models for this machine learning challenge. As the competition progressed, we realized that the raw predictions outputted by our gradient boosted model would benefit from further probability calibration. Platt calibration and isotonic regression methods from the sklearn python package were both tested, and isotonic calibration ultimately resulted in the best overall model performance.
Hyperparameter optimization of our lightgbm models was performed using Optuna, a straightforward and convenient python package that seamlessly enables optimization of hyperparameters across multiple criteria. Our Optuna multi-objective score involved monitoring of a composite weighted score derived from the individual metrics used to evaluate our final challenge submissions. We originally performed a single round of nested 5-fold cross validation to determine the best set of hyperparameters to use for final development on the whole training dataset. However, we found that the hyperparameters selected per fold varied extensively, and averaging these hyperparameters seemed incongruous. Therefore, we created a nested 5-fold cross validation ensemble of 5 lightgbm models as our final machine learning challenge submission, and we achieved our best overall, most stable model performance.
Github Respository: https://github.com/SAStrutz/BadgerPediatricSepsisChallenge
Please visit the University of Wisconsin-Madison’s ICU Data Science Lab website at https://icudatascience.medicine.wisc.edu/#/ to learn more about our lab’s active projects and recent publications.