Week 3

Our third was the culmination of efforts related to data manipulation and modeling. To be more specific, I've created scripts to give .csv files of students graduated from CS/ACS with their merged demographics, student data, and admissions data. Although this sounds relatively simple, there were quite a few formatting issues initially. For example, we were tasked to merge the datasets together with 1 ID per student. However, many students had multiple entries with the same ID (we were unsure to whether or not to drop duplicates in the dataset). We ultimately decided removing duplicates was the best way to go, ending up with some data loss. We also had demographic ID and students who dropped out ID’s did not match, which also caused some issues.

However, these issues were eventually resolved. When we finally merged all our data together and made our model, we noticed an extremely high accuracy with Random Forest (90%). However, some colleagues reminded me that I was using cross-validation instead of splitting up the data into train and test splits. I realized I used the wrong tool to evaluate accuracy. Cross-validation is used to see which features (and model optimizations) are best to use in the final model--it is not meant to be used as an accuracy metric. As a result, I used train and test splits (from the sklearn library) with the F1 score to measure our accuracy. Our score was still high (86%), however, it was not as high as before.

We were still skeptical of the high accuracy--everyone suspected an imbalance in the dataset (because of the majority of students in most universities graduate). However, the graduate and dropout ratio was still relatively balanced. Then, I decided to investigate which features the Random Forest model was classifying as the most important (Sklearn has a function called feature_importance for this task). Consistently, we noticed the term that the student began/the cohort year being the most important variable. The easy solution would have been to remove the "cohort year" feature. However, with a help of a colleague, I identified the real issue in the dataset. I had included students who had graduated from 2009-2017. As a result, the 2009 cohort had many graduates, while the 2013 cohort had very few graduates. From this, we noticed an imbalance in the way cohorts were being judged. I was not being impartial to all the cohorts. From here, I'm still working on fixing the scripts to give appropriate numbers of graduates or dropouts. I am also discussing the possibility of included all STEM majors (instead of CS/ACS) in case the end dataset is not balanced.

This week, I also read a research paper related to my current project, "Why the High Attrition Rate for Computer Science Students: Some Thoughts and Observations". Although this was not an article featuring machine learning models, the article did provide some helpful hints on why computer science drop out. First, many entry-level computer science students have poor problem solving skills. The paper laments the number of entering computer science students who do not have basic problem solving skills. The article also mentions poor project management skills among students, the choice of language and objects early vs objects late, graduate student teachers, poor advising, poor designed labs, and a lack of practice. In my opinion, all of these factors are true to a certain degree. From my experiences, I can definitely attest to having poor problem skills while entering my university's computer science program. I can also attest to sometimes having poor advisors. Many times, the advisors who did not always know what to suggest to a struggling student. Both of these factors could be major contributor to overall student retention. From the student admissions data, I will attempt to see if the SAT Math score (a measure of how quickly one can problem solve) is predictive of the CS retention rate. I'll also attempt to see how "advising" could possibly be numerically measured.