Garik Kazanjian - Villanova University, Student
Kirtana Kunzweiler - Villanova University, Student
Venkat Margapuri - Villanova University, Associate Professor of Computing Sciences
C. Nataraj - Villanova University, Professor of Mechanical Engineering
Sanjiv D. Mehta - Children’s Hospital of Philadelphia, MD
Julie C. Fitzgerald - Children’s Hospital of Philadelphia, MD
Robert B. Lindell - Children’s Hospital of Philadelphia, MD
Daniel Balcarcel - Children’s Hospital of Philadelphia, MD
Nadir Yehya - Children’s Hospital of Philadelphia, MD
The team consists of students, faculty from Villanova University from various departments including mechanical engineering and computing sciences. Additionally, the Villanova researchers were supported by physicians from the Children’s Hosptial of Philadelphia (CHOP) with various backgrounds including pediatrics. Membersof the NOVACHOP team had prior experience in predictive diagnostics related to pediatrics and became motivated to pursue this challenge to complement and apply experience in an area they believed was not being addressed properly.
Our modeling pipeline began with gaining an understanding of the data and which features was believedto be most medically relevant. We generated heat maps, which allowed us to detect patterns and correlationsin the dataset.
The original dataset contained nearly 140 features. The study calculated the importance value of theremaining features in contributing to the predictive accuracy of an original benchmark model. The top 37 features were selected as features below these had significantly low feature importance score that did not optimize the models’ performance.
After conducting intelligent feature engineering tasks, we implemented an ensemble model combiningRandom Forest Classifier and an XGB Classifier (XGBoost).
We handled missing data by removing features with missing features of over 30%. In addition, we handled outliers by combining various features together (i.e. blood pressure).
Feature engineering or transformation methods used included numerical transformation using the Standard Scaler and categorical transformation using One Hot Encoder. We additionally did numerical imputation using the KNN Imputer and categorical imputation using the Simple Imputer with the ‘most_frequent’ strategy.
The frameworks involved included an ensemble model consisting of a Random Forest classifier and an XGBoost classifier. The ensemble model's focus was to optimize sensitivity (true positive rate) while maintaining acceptable specificity (true negative rate).
Techniques used included iterating class weights for the Random Forest Classifier to balance sensitivity and specificity, threshold tuning using an ROC curve analysis and an ensemble weight where the Random Forest Classifier was 70% of the weight, while the XGB Classifier was 30%. Additionally, SMOTE was used for class balancing with a 0.5 sampling strategy.