Askari Amir

DATA 606- Capstone Project

Spring 2021

Murat Guner

Predicting Diabetic Patients by Leveraging Data

Part 1- Introduction

Of the 34 million people in the U.S affected with diabetes, one in five of those people are unaware of their condition. More than a third of the US adults do not know they are pre-diabetic, and of that, eighty-four percent don't know that they have prediabetes. In the past two decades, the number of diagnosed adults with diabetes have more than doubled. The medical cost of diabetes onto the U.S population is astronomical, as the number rises each year. Medical cost for those with diabetes is twice as that of those without. CDC data states over $327 billion is attributed to medical costs, lost work, and wages for people diagnosed with diabetes. With the help of machine learning, data sets of diabetic patients can be leveraged to accurately predict if a person is classified as a diabetic, or at risk of being a diabetic patient.

Data Set

The data set was provided by UCI Machine Learning Repository: Center for Machine Learning and Intelligent Systems. Diabetes patient records were obtained from an automatic electronic recording device and paper records. The diabetes files consist of four fields per record. Each field is separated by a tab and each record is separated by a newline. Each field has a value attributed to it based on code value. Each code value denotes either insulin dose, glucose measurement, activity, or symptoms across twenty different possible variations. The diabetes.csv file used has nine columns and 769 rows of data.

Methodology

To predict diabetes diagnosis I will be using Logistic Regression Classifier. After loading and cleaning the data, I will select features and split the data to understand the models performance. The data set will be split into training and test sets. The logistic regression classifier will be used to create the predictive model and use to fit that model on the train set and perform a prediction. To evaluate the performance of the model I will use a confusion matrix giving me a classification rate to determine the accuracy of the model.

Part1 Presentation

Git Hub

Link

References

https://archive.ics.uci.edu/ml/datasets/Diabetes

https://www.cdc.gov/diabetes/basics/quick-facts.html

https://covid19.census.gov/