Data Management

To determine what variables have a correlation with diabetes rates, we found scatter plots to be the most effective technique. They are quick to analyze and simple to produce. In total, eight variables were analyzed. If the scatter plot showed a correlation, the variable was implemented into DANN (our artificial neural network). Graphs can be found in the Appendix. The diabetes csv data that is used is the 2017 diabetes rates for all thirty-three New Mexico counties from the CDC [16]. The other demographic data that was compared to the diabetes rates is from the U.S. Census Bureau [14]. The various figures that were collected from the Census range from 2017- 2019.

Using Python to Analyze .csv Files:

We found .csv to be a useful format for our data collection. Python has a csv module that can be easily used to extract data from a csv file, and the format is widely utilized across many sources like the U.S. Census and the CDC [2]. It was decided that using 6 Python is the best technique to create the scatter plots because a computational program is more capable of facilitating the data for all thirty-three New Mexico counties- a task that would be tedious using a method that requires manual entry. Eight programs were made to make the scatter plots, one for each variable. Each program works similarly- the only parameters that differ are the name of the .csv files, the name of the row that is being extracted, and the cosmetic details of the scatter plot such as the title. Matplotlib is a library that allows for comprehensive data visualization in Python. Once the programs have recorded the data from the csv files, they use Matplotlib to produce a scatter plot. After the scatter plot was made, it was manually determined if a correlation was present. Scatter Plot Overview: Considering the ambiguity of diabetes rates, we found several interesting correlations in the scatter plots. The variables that we concluded correlated with the diabetes rates in New Mexico are persons without health insurance, percent of American Indian and Alaska native alone, mean commute time, education (percent of population with high school or higher), and poverty rates. Note that the diabetes rate data includes both type 1 and type 2 diabetes. As stated in “Diabetes”, roughly 95 percent of diabetes diagnoses are type 2. Due to the unavailability of data that only contains the rate of type 2 diabetes, we must compare the variables to the whole diabetes rate. A situation where this could have interfered with our results could be in the education level scatter plot. Education would not affect type 1 diabetes, for no one knows how to prevent it. Since ethnicity has been found to play a role in both types of diabetes, the plot comparing diabetes to percent of American Indian and Alaska Native may be more viable than other tests that include a variable that doesn’t 7 contribute to type 1 as well. The plot with American Indian and Alaska Native shows almost no correlation. It was surprising that there was not a more significant correlation for that variable since ethnicity plays a major role in both types of diabetes. American Indian/Alaska Native adults are nearly three times more likely than Caucasian adults to be diagnosed with diabetes [15]. It was concluded that because of the known connections between those ethnicities on diabetes, and due to numerous outliers that support a positive correlation, percent of American Indian and Alaska Native county data would be used in our neural network. Persons without health insurance under age 65 is an interesting demographic. Since diabetes is often the result of how someone lives their life, it was necessary to have a way of measuring how accessible medical services are to the counties’ populations. We also wanted to potentially measure how much individuals in the area value their health involuntarily or deliberately with one’s attitudes. Unexpectedly, there is an apparent positive correlation between the insurance data and diabetes rates.