Lorne Walker MD, PhD – Pediatric ID physician, clinical informatics
Ben Orwoll MD, MS – Pediatric Critical Care physician, clinical informatics
Matthew Hudkins MD – Pediatric Critical Care physician, clinical informatics
Umair Zaidi BS – Health and Clinical Informatics PhD student
We are a group associated with the Oregon Health and Science University Division of Medical Informatics and Clinical Epidemiology and Department of Pediatrics. Our interests include clinical pediatrics, utilization of EHR data, clinical predictive modeling and informatics education. We have participated in prior clinical data science competitions (https://pubmed.ncbi.nlm.nih.gov/37745933/) and enjoyed the experience.
Initially, we approached the prediction problem with a domain-agnostic approach using off-the-shelf predictive models such as logistic regression, random forest, support vector machines and gradient boosted methods. We then held brainstorming sessions and iteratively experimented with enhancements to this initial approach. Some ideas came from our clinical background (normalizing vital signs to age-based norms) and others were non-specific to the problem (dimension reduction, feature selection, etc.).
Our initial approach used simple methods such as imputation with variable medians and centering and scaling approaches. Our main data processing effort utilized domain expertise to process the data. As clinicians we knew that some clinical data – particularly vital signs – are highly dependent on age. We normalized these variables using published age norms. We also manually identified categorical variables that could be assigned ordinal values (e.g. duration of breastfeeding). As we proceeded with our work, we also experimented with other pre-processing tools such as model-based imputation and feature selection/dimensionality reduction.
As described above, we imported vital sign norms to augment our data. We also experimented with augmenting the training data with additional simulated data, which did not yield improved results in our hands.
Our selection of modeling tools was largely based on empiric experimentation. As mentioned above, we tried several different off-the-shelf predictive modeling tools. Those that yielded the best test-set results were singled out for further evaluation. We also experimented with ensemble combinations of these tools when incorrectly classified samples were non-overlapping.
We used standard cross-validation approaches for hyperparameter tuning and evaluation of model performance. At the beginning of our modeling process we held out 500 stratified samples as an external validation set that we only used when considering a model for submission. Otherwise, we would separate 20% of the data as a test set, train a model via stratified cross-validation on the remaining data and evaluate on the test set. Despite using inverse-weighted models, we found optimizing the decision cutoff to be particularly challenging.
GitHub Repository: https://github.com/lwwalker/UBCdataChallenge_submission