Week 4 and 5and 6

The goal of these weeks was to finalize our models and attempt to improve the accuracy of them as much as we could. Previously, we were attempting to predict whether a CS student would graduate within 6 years or not. To increase the size of our dataset and make the study more generalizable to other fields, we broadened our scope to STEM students. Using STEM students vastly increased the size of our dataset. However, we noticed many students had not taken the SAT. My scripts for extracting only STEM students was also removing all students with any blank columns (i.e. any students who had not taken the SAT). As a result, many students were being left out of the dataset. To compensate for individuals who did not take the SAT (but did take the ACT), we decided to fill in their SAT scores with the median SAT score. Although this step did not improve our overall accuracy, it validated the results we were getting.

Another major step I took during these weeks was to massively simplify my scripts. When you first work on a project, you focus more on getting the output. You fail to realize that getting an output is only 50% of the battle. Confirming if your outputs are right, and then fixing possible mistakes, is much harder. I realized my scripts were causing massive confusion and chaos. Whenever we analyzed a model or created a visualization, we were always initially suspicious that somewhere, we had forgetten to account for a year or cohort. I was responsible for creating the scripts to get these datasets, so most questions about any errors came to me. Not only was the process of checking my scripts incredibly time-consuming and frustrating (I spent 70% of the time in these weeks checking them), it was highly error prone. My code is very simple (just boolean conditions mostly), but because there were so many conditions, it was hard to fix the scripts. We were running many experiments (CS students vs. STEM, humanities, etc) and we didn't have a robust system to quickly obtain unique datasets.

Professor Synder gave me the suggestion of splitting up my conditions into lists.

So instead of having a jumpled up boolean condition like this:

cohort_2009_grad_2011 = degrees_total[(degrees_total.GRADTERM.isin(year_2011) & (degrees_total.degmaj1.isin(deg_majors) | degrees_total.degmaj2.isin(deg_majors))) & (degrees_total.cohort.isin(year_2009))]

I could split it up into various conditions, like this:

cond_2014 = degrees_total.GRADTERM.isin(year_2014)

deg_major1 = degrees_total.degmaj1.isin(deg_majors)

deg_major2 = degrees_total.cohort.isin(year_2009)

cohort_2009_grad_2014 = degrees_total[(deg_major1 & deg_major2) & cond_2014]

To most of you with some programming experience, this may seem very obvious. In my case, I felt the bare minimum lines of code written would be better. In a attempt to reduce the amount of code I wrote, I gave up simplicity and understanding. Fixing this issue massively reduced the amount of time I spent on this issue.

Another achievement for these weeks was to use preprocessing techniques to improve the accuracy of our models. I tried using standerization, MinMaxScalar, and normalization. The only preprocessing technique that improved our accuracy was the KBinsDiscretizer. The K-BinsDiscretizer partitions all continous variables into intervals. Continous variables such as age are better understood when put into intervals. For example, the relationship between age and mortality due to flu, for example, is not completely linear. Children and the elderly are the most likely to die from conditions like the flu, yet this would not be discovered without discretization.

My advisor (Huzefa Rangwala) suggested that we try as many models as we could.

As a result, I tried KNN, SVM, Decision Trees, Random Forest, Neural Networks, Adaboost, Naiive Bayes, Stochastic gradient descent, Logistic Regression, Gradient Boost, and Deep Learning. Overall, Logistic Regression had the highest accuracy for any model. Logistic Regression is specifically built for binary classfication problems.

We also ran GridSearch on all of our models but Deep Learning (DL was a bit complicated to run), yet, this did not improve our accuracy. Any accuracy we gained, for example, was probably due to variance. For example, running GridSearch with KNN gave me an optimal value of 29 for K. Yet, when I ran the model on a train_test_split, the model accuracy at times was at times slightly higher (+1) or slightly lower (-1) than the default K=5 value. This trend continued for algorithms like SVM (for SVM we tried using the linear kernal, altering the C value). Tuning the hyperparameters did not result in a meaningful increase in accuracy for any of the models.

For our accuracy metric, we decided to use F1-macro. Since we wanted to penalize for incorrectly identifying dropouts more (the minority class), using F1-Macro was appopriate. According to a paper I read, Macro- and micro-averaged evaluation measures, "Because the F1 measure ignores true negatives and its magnitude is mostly determined by the number of true positives, large classes dominate small classes in microaveraging". In our charts, we include AUC and accuracy values as well.

Unrelated, but we had a major scare during our data analysis process. From our dataset, we were getting the impression that as GPA declined, the acceptance rate also declined. This was a very confusing trend.

Dropouts # Acceptance Rate

2009 51 64

2010 64 49.8

2011 46 49.3

Average SAT Average GPA

2009 1231 4.531

2010 1202 3.636

2011 1215 3.683

Using Python, I decided to find the min and max values of the GPA column. From here, I discovered that someone (not from the REU site) had incorrectly entered in 80.0 for a student's GPA. As a result, the average GPA of 2009 was affected by this one outlier. Learning from this experience, I decided to remove extreme outliers from our dataset. If a column had a value whose z_score > 3 (which covers .3% of the whole dataset), the value was removed from the dataset.

We're on the publication stage right now, so we're busy writing the discussion and conclusion. We have already finished the literature review , introduction, and methods section. We are also focused on finishing the project and the poster.

Next week, you'll hear about how we analyze the fairness of the models we ran. Preliminary results indicate our model might be more accurate in predicting one demographic than another. More analysis is needed to confirm these results. Stay tuned!