Barathi L – Madras Post Baccalaureate Research Fellow, Indian Institute of Technology
Maziya Ibrahim - Senior Project Scientist, Indian Institute of Technology, Madras
Hari Priya Narahari - Post Baccalaureate Research Fellow, Indian Institute of Technology, Madras
We are a team of data enthusiasts with diverse academic backgrounds. Maziya holds a Ph.D. in Computational Biology, and Barathi has a Master’s degree in Molecular Virology. Their strong foundation in biology was instrumental in interpreting the biological aspects of the project and analysing clinical data. Hari Priya, who holds a Master’s degree in AI and Data Science, brought her expertise in machine learning and data science to design and develop the overall analytical pipeline. All of us are associated with the Computational Systems Biology Lab at the Wadhwani School of Data Science & AI, IIT Madras, India.
Our approach aimed to strike the right balance between the biological relevance of the features and their correlation with the predictor variable. At every step, we ensured a clear understanding of which features were used and the rationale behind their selection. We deliberately chose traditional machine learning algorithms due to the availability of established techniques that enhance the explainability and interpretability of the results.
We addressed missing values using both Iterative Imputer and Simple Imputer:
For numerical features, we used IterativeImputer from sklearn, which models each feature with missing values as a function of other features, offering more accurate and robust imputations than simple strategies.
For categorical features, we used SimpleImputer with the strategy set to 'most_frequent' to fill in missing values with the most common category.
A comprehensive set of features was selected based on clinical relevance and domain expertise. These included:
Demographics (e.g., age, sex, height, weight, MUAC)
Vitals (e.g., heart rate, respiratory rate, blood pressure, temperature, SpO₂)
Neurological and physical exam findings (e.g., BCS, capillary refill, respiratory distress)
Vaccination status, comorbidities, and symptom checklist items (multi-hot encoded)
We implemented an ensemble learning approach using a soft voting classifier that combines the predictive strengths of the following machine learning algorithms:
Random Forest Classifier (RandomForestClassifier)
Histogram-based Gradient Boosting (HistGradientBoostingClassifier)
Extreme Gradient Boosting (XGBClassifier from the xgboost library)
Ensemble Learning (VotingClassifier):
We opted for a soft-voting ensemble to leverage the strengths of multiple diverse classifiers.
Combining different models typically improves generalization performance and reduces model variance.
Soft voting (averaging predicted probabilities) allows better calibration of outputs, especially in imbalanced datasets.
Random Forest:
Known for its robustness and ability to handle non-linear relationships and missing data.
It is less prone to overfitting and performs well even with a large number of input features.
Histogram-based Gradient Boosting:
Efficient and scalable for large datasets.
Provides built-in handling of missing values and captures complex interactions between features.
Often outperforms traditional gradient boosting in terms of speed and memory efficiency.
XGBoost:
A state-of-the-art gradient boosting algorithm optimized for speed and performance.
Hyperparameters:
Random Forest Classifier:
n_estimators = 123 – number of trees
max_leaf_nodes = 456 – limits tree size for generalization
random_state = 789 – ensures reproducibility
XGBoost Classifier:
scale_pos_weight = 10 – addresses class imbalance by penalizing false negatives
eval_metric = 'logloss' – aligns with the probabilistic output requirement
use_label_encoder = False – disables deprecated behavior for cleaner output
HistGradientBoostingClassifier:
Used with default parameters and random_state = 42 for consistency
GitHub Repository: https://github.com/Haripriya-Narahari/PediatricSepsis2024