Computational Method

About Mellitus:

Our goal when creating Mellitus was to create a simple, easy to interpret, model that is as close to a real community as possible. In many aspects, this has been achieved. There have been seven versions of Mellitus. Mellitus is a Latin word for Diabetes. It is built in Netlogo 6.1.1. Over each version, it has been gradually improved. Figure 4 shows the difference between the first version and the final version of Mellitus. There is a significant improvement in the usability, functionality, visual appearance, and overall accuracy of the model. Version 1.0 was based on little real-world data. Version 5.1 has proven to be quite useful in predicting a logical diabetes-rate, given the appropriate variable values. A central element to Mellitus is DANN (Diabetes Artificial Neural Network), which is imported with the py extension built into Netlogo. We needed a way to properly measure the number of people that are active in a community whether they are travelers or residents. For example, a town like Tucumcari, NM, has a relatively low number of residents but a high number of travelers due to its proximity to Interstate 40. This affects a community’s demographic figures and can be an obstacle when analyzing data. In order to appropriately compare demographics, we created a metric called TPR (total population rating). It is the area’s annual average daily traffic (AADT) from the New Mexico Department of Transportation plus the community’s population. Mellitus Variables: Factors that Mellitus is capable of analyzing include education, commute-time, poverty, percent of population without health insurance, and percent of Indian and Alaska Native. These are all treated as independent variables, each one controlled by the user. The result is a change in the DANN diabetes rate prediction. The factors that were inputted into the neural network have been determined by producing and analyzing scatter plots. See “Scatter Plot Overview” for more information on this process. Broadband internet access and median household income did not have a viable correlation with diabetes rates in New Mexico. Custom Turtles: Custom Turtle shapes have been created to use instead of the default shapes in Netlogo. These are used to symbolize the demographics of the town. For example, when the TPR is raised more residential facilities appear in the interface.

Mellitus(5.0):

Mellitus 5.0 is the most polished version of Mellitus that doesn’t utilize DANN. This means that the py extension is not required. Instead of DANN, Mellitus 5.0 uses linear equations derived from the lines of best fit of the scatter plots created. Scatter plots were also used to determine the functionality dynamics of Mellitus. Since our project aims for Mellitus to symbolize a real-world town as accurately as possible, we made scatter plots visualizing TPR and variables like number of fast-food restaurants or number of healthy-food restaurants using a similar method that was discussed in “Data Management”. Then lines of best fit for the scatter plots were drawn, and the linear equations were implemented. For example, the scatter plot comparing the number of fast-food restaurants in a town, and its TPR showed that one fast-food restaurant tends to serve about 5000 TPR. In Mellitus 5.0, every 5000 TPR, a fast-food restaurant appears in the town. The lines of best fit were converted from slope-intercept form to x-equals so that their outputs could be utilized in Mellitus. Mellitus 5.0 also uses this method for the independent variables that contribute to the diabetes rate like education, poverty, and commute time. Its diabetes rate prediction is not accurate to the real world, but that was not our intention in this version. It outputs a reasonable number that changes appropriately based on how the variables are changed by the user. We created it to have a version of Mellitus that doesn’t require the Python extension. It greatly improves the accessibility of our research and makes Netlogo Web or a computer without Python an option for anyone that would otherwise not be able to use Mellitus. That’s also why 5.0 has essentially the same interface as 5.1 and is capable of analyzing many of the same variables.

Mellitus(5.1):

The central result of our project is Mellitus 5.1. It has many similarities to Mellitus 5.0, but its main upgrade is support for DANN 2.0. The scatter plot derived linear equations still control the visual functionality of the model, like how many restaurants are open in the town. However, DANN brings the ability for Mellitus to compute far more powerfully with a non-linear approach to the diabetes rate. Refer to “DANN(2.0)- Sigmoid” for an in-depth explanation on DANN 2.0; this section will focus on its execution in Mellitus 5.1. The program takes the value of each variable from the sliders and puts them in a format that can be read by DANN using the py:set command. DANN 2.0 is imported into Mellitus by using the py:run command. The diabetes rate prediction is made using py:runresult. Py:set is capable of giving a variable in a Python script a Netlogo value. Py:runresult converts Python values to Netlogo values. In Mellitus, it is used to store the float output of DANN in a Netlogo variable. Py:run is able to run the Python script required for the program. It allows a Python script to be used in Netlogo. DANN runs continuously in Mellitus, and its output is displayed in the interface on a monitor and a plot. When the sliders are adjusted the data is easily analyzed by the user, and a correlation can be made with the plot. Refer to the “Evaluation and Testing” section of “DANN(2.0)- Sigmoid” for an overview of the performance of DANN 2.0. Mellitus proves to be an effective medium for exhibiting the output of DANN 2.0, and shows far more and more detailed information than would be possible in native Python. The plot tool in Netlogo allows correlations between the slider controlled independent variables and changes in the virtual diabetes rate to be clearly made. Since DANN is constantly running in Mellitus, the virtual diabetes rate is constantly adapting and factors visually link before the user’s eye. We couldn’t have achieved this using any other method. In many situations, the correlations made by Mellitus are outstanding and even somewhat contradict the correlations made by our scatter plots. The scatter plot with poverty data suggests a positive correlation between county poverty rates and diabetes rates, while generally, Mellitus suggests a negative correlation between the two. Remember that Mellitus does not function linearly and does not always behave this way. Previously, we mentioned that it was surprising to see such an insubstantial correlation between the percent of American Indian and Alaska Native county data and diabetes rates. This variable is widely known to affect one’s odds of being diagnosed with diabetes, so we were not surprised when Mellitus showed that it had a positive correlation with diabetes rates. However, even in Mellitus, it does not appear to affect the diabetes rate dramatically. The other variables mostly perform in the ways that were suggested by the scatter plots when operating in Mellitus. For instance, Mellitus exhibits a strong positive correlation between education levels and diabetes rates.

Neural Network

We had originally planned to use a Chi Square Test to analyze the variables and their effect on diabetes rates, however a few problems occurred. A Chi-Squared Test is routinely used to test if a distribution of categorical variables is independent. This approach won’t work because we aren’t analyzing categorical data, we’re analyzing a numerical disease rate. A Neural Network is a useful alternative because diabetes rates are determined by the lifestyle of a population, a wide range of factors to study. It could potentially observe enough frequency in the data to make a viable prediction off it. A Neural Network could also be implemented into Mellitus fairly easily with the Netlogo py extension. Our Neural Network is named DANN (Diabetes Artificial Neural Network).

About Neural Networks

Artificial Neural Networks are ambiguously based on the biological networks that compose a human brain. They can solve problems less like a computer and more like a human, without the need for explicit commands [4]. Instead, they analyze example data to learn. This process is especially useful when evaluating abstract figures like factors that contribute to diabetes rates. A Neural Network is made up of 3 main layers; an input layer, n number of hidden layers, and an output layer [18]. If there is more than one hidden layer, it is considered a Deep Neural Network. When values are transmitted through neurons, the weights are applied to the values and passed into the activation function with the bias. It is widespread for weights and biases to be set to random values initially so that they can be adjusted later on by the program. Parameters that cannot be regulated by the program itself are called hyperparameters. This includes learning rate, activation function, loss function, and more. The learning rate controls how quickly the network is acclimated to a data pattern [1]. If the learning rate is too high, the output may not be as accurate as anticipated. However, if it’s too low, the program could crash during the training process. The activation function is very important. Some of the most popular activation functions include Sigmoid, RELU, Tanh, and Softmax, usually in the output layer and often when classifying objects. Activation functions calculate a weighted sum and add bias to it. They’re the reason for Neural Networks’ capability of abstract thinking [6]. They’re so important that there is a dedicated section in this paper for them. The loss function optimizes the parameters in a Neural Network [10]. It compares the “correct” value from training data with the value that the network predicted so that backpropagation can occur, and parameters can be updated. More Hyperparameters that are important to this project include the optimizer, number of epochs, and batch size. Optimizers refresh the weight parameters to minimize the loss function. The loss function works by advising the optimizer on how accurate it is [17]. Epochs are the number of times the neural network passes through all the training data. Note that there is a significant difference between iteration and epoch. Iteration is the number of passes (one pass = one forward pass + one backward pass), an epoch is the number of times the entire training data has been viewed by the program. Batch size is the number of data samples that are propagated through the network in one iteration. Activation Function: The choice of the activation function is vital to receive desirable results, and the right choice is dependent on the problem being solved. For example, the Softmax Function is practical in situations where classification is required. We determined that the RELU (rectified linear unit) function is best for our project because it is capable of working with numbers higher than one, unlike the Sigmoid Function. The Sigmoid Function is useful for percentile data which can be converted to decimals between zero and one. It is what is used in DANN 2.0. The Sigmoid Function can also be built manually in Python, without the need for external APIs like Keras. This also means that execution in Netlogo is possible. Due to a bug in the Netlogo py extension, DANN 3.1, which uses the RELU function, cannot be applied to Mellitus. See “DANN(3.1)- RELU” for information on DANN 3.1. The linear activation function is simple yet essential to our project. It is essentially a linear regression model (y = ax). It wouldn’t be beneficial to use in the hidden layers because it isn’t capable of advanced calculations. However, it is useful in the output layer. About Tensorflow and Keras: Tensorflow is an open-source software library developed by Google for machine learning execution, including neural networks. Keras is also an opensource library. Its primary purpose is to create neural networks. Both APIs were used in Python. Keras works on top of Tensorflow [5]. Keras has been implemented in DANN 3.1 so that the RELU activation function can be used. Using Keras is relatively simple. We made a Sequential model, which is a linear stack of layers. Packages that were implemented in our project include: add, compile, fit, and predict. Add is used for adding new layers into the sequential model. Compile configures the model for training. Fit trains the model with a number of epochs. Predict is used to return predictions from the model.

DANN(2.0) - Sigmoid:

DANN 2.0 is the first stable edition of DANN and was built in Python 3.7. It uses Sigmoid as the Activation Function. This decision brings advantages and disadvantages. It has a domain of all real numbers and a range of 0 to 1. As long as the input values are proportional, it is capable of monotonically adapting the values so that they are between 0 and 1 and can be used by the network. This is practical for comparing percent variables. All significant variables were percentage values or had a form of percentage alternative. For example, percent of persons in poverty, percent of persons with a high school education or higher, and percent of persons without health insurance are all variables that contribute notably to diabetes. No external machine learning modules were used in this version. However, Numpy was used: a package for scientific computing with Python [7]. The Sigmoid Function was manually utilized using NumPy by inserting the Sigmoid equation into the program. One of the packages that was used is Exp, which is capable of calculating the exponential of elements in an array [9]. It is established on Euler’s number, an irrational number that is the base for the natural logarithm. Another is Array, which is useful for data handling and putting values into arrays. Also, Dot is a very important tool, the equivalent to matrix multiplication. The data was inputted in arrays. Twenty counties were used as training data. There are thirty-three counties in New Mexico. We determined that twenty training counties are most desirable so that we could retain a large and diverse set of testing data. DANN 2.0 analyzes education, poverty, persons without health insurance, percent American Indian and Alaska Native, and commute time divided by 100. The commute time data is divided by 100 for the Sigmoid function. Since the commute time dataset used to train the network is also divided by 100, this should not interfere with the accuracy of its prediction. Implementing the percent of American Indian and Alaska Native into DANN should also help eliminate the potential interference of type 1 diabetes. As stated before, there is nothing that is known to prevent type 1 diabetes, and about 95 percent of diabetes cases are type 2. Since a factor that contributes to both types of diabetes is ethnicity and the data that was used to train the network is the rate of both types of diabetes, the American Indian and Alaska Native variable may contribute extensively toward DANN’s output. DANN 2.0 is especially minimalistic. One neuron is modeled with five input connections and one output connection. When operating in a native Python IDE, we set the iterations at 300,000. This is more than enough for the program to come to a reliable conclusion. However, we have observed a noticeable decline in output accuracy when lowering the iterations. In Mellitus, the iterations are set to 100,000 to minimize the amount of lag. The reason why this version of DANN plays an important role in our project is that it is the only version of DANN that is functional in Mellitus. (see “Netlogo Execution Shell Bug” for more information about why other versions of DANN cannot be used in Mellitus). Evaluation and Testing: The performance of DANN 2.0 varies. In some situations, it is capable of predicting the diabetes rate almost exactly, while in others, it fails to reach this goal. Note that Table 1 shows the results from one trail, inputting the appropriate data for each county. We chose Los Alamos and De Baca county to test with because each has relatively unique statistics. For instance, Los Alamos County has a distinctively high education level, while De Baca county has the highest diabetes rate in New Mexico. The most prominent figure in Table 1 is the predicted diabetes rate of Los Alamos County. It was essentially able to predict the diabetes rate with perfect accuracy. This may be due to an anomaly in the singular trial. We did use “seed()” in the Python random module so that the random number output would remain the same. As stated in “About Neural Networks”, weights and biases are initially set to random values and adjusted later by the network itself. Using Seed means that the program begins with the same weights and biases every time it runs and therefore outputs a similar value every run given comparable data. If Seed was removed, we could see more or less accurate predictions, but they would certainly be far less consistent.

DANN(3.1) - RELU:

DANN 3.1 is the most advanced and complex version of DANN. It utilizes the RELU activation function, allowing for whole number data output. Keras is implemented and introduces access to many Hyperparameters that are not available in DANN 2.0. This allows us to fine-tune the neural network to produce the best results. In contrast to DANN 2.0, this version is also a Deep Neural Network, with two hidden layers. It works with the same dataset as 2.0 with twenty counties and five variables. However, it is capable of keeping the values in their original format instead of requiring them to be divided by 100. There are five nodes in each hidden layer. The RELU function is used in all layers except in the output layer. The output layer makes use of a classic linear activation function. After observing the performance of the network in several circumstances, we found that it was most accurate when the output layer lacks the RELU function. This means that the main calculations are taking place in the input and hidden layers, and the output layer reflects the result of these calculations with a linear regression. The program has 45 epochs and a batch size of 5. There are 20 training examples, so it takes four iterations to complete one epoch. The program has 180 iterations in total. This number is far less than in 2.0, but 3.0 is able to complete its iterations more effectively due to many improvements, like additional adjustable hyperparameters. Its loss function is a commonly used regression loss function, mean-squared-error (MSE), also known as mean-squared-deviation (MSD). The program’s optimizer is the Stochastic Gradient Descent (SGD) optimizer, which brings support for new hyperparameters like momentum and learning rate. Since there is no exact method to determine where the hyperparameters should ideally be set, various frameworks were experimented with, and it was eventually concluded that the most desirable results are achieved with the learning rate set to 0.001 and the momentum set to 0.7. DANN 3.1 is a Sequential Keras model. Refer to “About Tensorflow and Keras” for explanations of the Keras packages that were used to construct the network. Like 2.0, it was built in Python 3.7. Evaluation and Testing: DANN 3.1 has proven to be considerably more accurate than DANN 2.0. In this version, the program tends to output a different prediction every run for the counties. This is due to the program adjusting parameters differently each run. No tool is utilized to use the same random values every run like Seed was used in DANN 2.0. Although this version contains many more hyperparameters, weights and biases still start off with random values. DANN 3.1 was more accurate than DANN 2.0 in most counties in which it was tested. This is likely due to all of the improvements that were discussed, like an additional hidden layer with more neurons and the RELU activation function. Table 2 shows the behavior of DANN 3.1 from one trial.