Jon Cyrus - Coalesce Health, Developer
Munaf Noorani - Coalesce Health, Ideas & Technical Expertise
Mike Poplwaski - Coalesce Health, Ideas
Coalesce Health helps healthcare organizations, health technology and pharmaceutical companies to solve data infrastructure and analytics problems.
We developed a lightweight, interpretable predictive model using CatBoost, chosen for its efficiency and native handling of heterogeneous clinical data. Our approach combined manual grid search for hyperparameter tuning, embedded feature selection via model-based importance scores, and fine-grained threshold optimization to meet strict task constraints on sensitivity, resource use, and performance. All modeling and evaluation workflows were containerized to approximate the challenge’s production environment.
The only true preprocessing step used was removal of any disallowed, hard-to collect feature as defined by the data dictionary – specifically, those centering on antibiotic administration – prior to training. This was done to reduce computationally intensive steps based on the stated hardware limitations expected in the real-world use case.
We selected CatBoost, a gradient boosting algorithm that builds an ensemble of decision trees in sequence, with each tree correcting the errors of its predecessors. CatBoost was chosen for its strong out-of-the-box performance and its robustness to overfitting, especially on smaller datasets.
CatBoost provided practical advantages for our use case, particularly its native support for categorical variables and its ability to automatically handle missing values in both numerical and categorical features without requiring explicit imputation or other preprocessing steps. Early exploratory data analysis revealed a high volume of categorical features and inconsistent feature collection across training records. These factors, combined with the challenge’s focus on performance in low-resource compute environments, led us to favor an approach that minimized the need for computationally expensive preprocessing. This approach allowed us to produce a model with strong discrimination (AUC), calibration (ECE), and practical utility (Net Benefit), while adhering to real-world constraints on inference time and model parsimony.
We implemented a structured, manual grid search to optimize model performance across four key dimensions:
Iterations (number of boosting rounds)
Tree depth (model complexity and interaction depth)
Learning rate (gradient step size)
Feature importance threshold (used for embedded feature selection via CatBoost’s build-in importance scores)
For each combination of hyperparameters, we:
Performed embedded feature selection by excluding features below a chosen importance threshold, which helped reduce noise and improve generalization.
Executed post-training threshold optimization, evaluating 3,000 probability thresholds per model to identify the cutoff that maximized task-relevant metrics (specifically Net Benefit, F1, and AUPRC) while ensuring that Sensitivity remained ≧0.8, in line with the challenge constraints.
This manual grid search approach, combined with per-model threshold tuning, proved more interpretable and controllable than automated hyperparameter optimization tools, which may be computationally infeasible to recreate in resource constrained settings.
Github repository: https://github.com/CasualJon/PSDC-Phase2-Epiphany
Team website: https://www.coalesce.health/