Se Won Oh - Electronics and Telecommunications Research Institute, Data Scientist
Hyuntae Jeong - Electronics and Telecommunications Research Institute, Data Scientist
Seungeun Chung - Electronics and Telecommunications Research Institute, Data Scientist
Jeong Mook Lim - Electronics and Telecommunications Research Institute, Data Scientist
Kyoung Ju Noh - Electronics and Telecommunications Research Institute, Data Scientist
Our team is conducting a research project called HELP (Human Experience Learning and Prediction). We study human behavior in daily life by collecting sensor data and analyzing contextual information from everyday activities. Our research focuses on how people behave and adapt in real-world environments over time. Building on our experience with machine learning-based prediction studies, we joined this competition to contribute to identifying critical factors involved in sepsis prediction.
We believed that the key was to develop a machine learning model using a carefully curated training dataset, which involved selecting highly relevant variables from a large set and appropriately handling missing values.
We initially selected 27 variables that showed strong associations with sepsis mortality from the full set of variables. Next, all non-numeric variables were converted into categorical variables, with "Unknown" or "Other" treated as disnct categories. To handle missing values, we imputed numeric variables using the mean value within each mortality group, and categorical variables using the most frequent category.
We selected the random forest algorithm for training our model, as it is known for its relatively fast computation and strong performance. Moreover, it offers the advantage of enabling interpretation of how individual variables influence the prediction outcomes.
We set the number of estimators (n_estimators) to 300 and the maximum depth (max_depth) to 10, while keeping all other hyperparameters at their default values.
On the other hand, we initially understood that only three submission attempts were allowed for the final phase leaderboard. As a result, we were very cautious to avoid any flagged or error submissions and conservatively set the prediction threshold to ensure on-time submission. However, after reviewing other teams' results, we knew that flagged/error submissions were not counted toward the official submission limit. In retrospect, had we known this earlier, we could have taken a more aggressive approach (e.g., experimenting with a wider range of thresholds and hyperparameters), which might have led to better performance.