Max Rerisi
Class of 2026
Class of 2026
Semifinalist, New York-Metro Junior Science and Humanities Symposium ‘25
Finalist, Terra NYC STEM Fair ‘25
Type 2 diabetes is prevalent worldwide, and any effort to lessen its impact is critical to billions. Machine learning (ML) is a technique gaining traction in several fields, including healthcare. ML can help us identify diabetes earlier and cheaper using data to find variables associated with diabetes on an extensive scale.
My study used ML to predict the incidence of type 2 diabetes and then further experimented with maximizing the model’s performance, while minimizing dataset size. In my case, my models attempted to classify whether or not a patient has diabetes. In addition to maximizing performance, I also aimed to reduce the amount of data used to achieve a given result. A smaller dataset that still yields a strong performance is valuable as it means that implementing these methods into real healthcare systems would be cheaper, easier, and faster.
To my and my mentor’s knowledge, the methods I used for dataset reduction have not been utilized in this field for this specific application. My method of dataset reduction involved categorizing all the variables into group and then retraining the models using different subsets of these groups. Comparing the results of all these models allowed me to see which categories were unhelpful in training a model to predict diabetes.
ML is a method that has been applied extensively in medical studies but hasn’t been used nearly as much in actual application. Specifically, other studies have used the same models I have, including the eXtreme Gradient Boosting model, and XGBoost. That model achieved an accuracy of 93.3% when trained on all of the data. The best model I trained overall was a model that only used 65% of the dataset and a similarly strong score. That specific data subset used variables about medical history, social determinants of health, and medical tests. The reduction in dataset size of around a third is significant as less data means easier application of experimental methodology into real-world application in modern healthcare systems.