George Mason REU Blog
Week 6
I spent the past several weeks continuing to develop our research questions and testing differences in retention between different groups of students based on various factors. Our data is extensive enough that I have been able to explore several different angles, but also encountered some challenges. One of the biggest challenges has been the loss in accuracy that comes with a higher-dimensional dataset; as much as we would like to include every single variable we have, introducing too many actually reduces the ability of ML models to find the patterns that really matter. That fact has directed our research questions and enabled us to hone in on more specific traits in the student body.
On the recreation side, a bunch of us went to Six Flags on Wednesday and had an absolute blast. On the side are pictures of us terrified out of our minds on the ride, and a snazzy selfie Noah took.
One thing I was not aware of coming into this project was the fact that the scope of one's research can be constantly changing as one makes new findings. For instance, we started out our project looking at all George Mason students and trying to find patterns in that larger group. As we progressed, we examined various sub-populations, including CS majors, graduated students, and students from different academic backgrounds. In doing so, our project has taken some different turns from what we initially expected to explore. Above all, this has taught me the value in being flexible and constantly curious--we can't know a pattern is there unless we take the time to look for it.
Week 3
This week, I focused my readings on previous research and statistical analyses of course evaluations. One article I found to be pertinent to our work is “Course Evaluations: What are Social Work Students Telling Us about Teaching Effectiveness?” by Jirovec et. al, which centers on course evaluations of social work classes at Wayne State University. The paper seeks to uncover the teaching qualities that relate most closely to overall teacher ratings. The evaluation questions they examine include organization, classroom rapport, respect for students, and promptness of feedback.
The statistical analyses reveal that the trait that most clearly related to overall teaching was an instructor’s organizational skills, rapport, and grading skills. From this correlation, the authors conclude that their findings “support the premise that teaching effectiveness is closely related to concrete, identifiable teaching skills” (235). This was where I had the greatest qualms: they describe these skills as “concrete” and “identifiable,” while relying on the very evaluations they are investigating as evidence that these concrete skills exist. I would argue that in order to actually demonstrate a correlation between perceived quality of an instructor and these skills, they needed to collect objective metrics that did not rely on student opinion, such as grading turnaround times and the presence of a syllabus and/or calendar.
Another qualm I had was in the authors’ characterization of certain courses as popular or unpopular. In their introduction, they describe “methods” and elective classes as popular, and “research,” “policy,” and required classes as unpopular. Their only source for this characterization is a note in a footnote that this distinction is “drawn from a student perspective.” This wording is both ambiguous and unconvincing: is that one single student, or a broader students’ perspective, and if so, where is their evidence that those perceptions exist? The fact that they continue to use this delineation throughout to draw conclusions and even prescribe certain methodologies to administrators is concerning.
All that aside, however, the paper provided many insights into how Alexandra and I might go about with our statistical analysis of course evaluations. They use t-tests, p-values, zero-order correlations, and multi-step regression to characterize associations between variables. While their methodologies were not very specifically outlined, these broad tools should give us a jumping-off point when it comes to proving and disproving correlations ourselves. And, given that one of our main questions right now is how to demonstrate a lack of correlation, this statistically rigorous illustration of a lack of correlation between certain variables will offer us a good resource to refer back to.
Beyond reading more literature, we also continued to refine our problem description. The updated paper outline, with our problem description included, is to the right!
Week 2
This past Wednesday, we spent the day in a hackathon, trying to predict wine quality based on various metrics such as pH and citric acid. We were given roughly five hours to develop predictions based on various machine learning models, statistical techniques, and neural networks. The highest accuracy any of our groups scored was 77%; our group achieved an accuracy of 76%. In other words, the data was not easy to work with. As someone who had, in the past, mostly followed tutorials or used the default parameters for machine learning models, I found this experience really enlightening. We were given hours to tinker with these models endlessly, with instant feedback on how our tinkering affected the final results. Here are a few of the lessons I learned.
- In predictive models, there is room for both quantitative and qualitative analysis. One of my group members, Gennie, was a statistical wizard: she was able to run the data through various R functions that could pinpoint extraneous variables and outlier instances in the wine. With her human eye for interpreting these variables, we were able to run a more nuanced model with a fuller understanding of why we were making the choices we made, instead of simply sticking the entire thing through a classifier and waiting to see what came out on the other end. Of course, there was also room for the bigger, quantitative side of things, as a lot of what we did was running classifiers dozens of times to see how they performed in aggregate.
- Explaining things to another person really helps one's understanding. One of our group members had less experience with machine learning, so explaining to her why and how we were making the choices we did enabled us to crystallize that for ourselves, as well. For instance, I was fiddling with the parameters of the Random Forest classifier and realized I knew very little about why those parameters mattered and how changing them affected the classifier. What I realized is that the biggest test of understanding is not whether you can type something up and run it, but whether you can explain what that model, parameter, or function does to another person. And that made me want to question and re-learn a lot of things I thought I knew about machine learning.
- Sometimes, good enough is good enough. As I mentioned before, no one was able to get above a 77% accuracy. That felt, to me, incredibly low, but the more I adjusted things, the more I realized that was a semi-hard cap. At a certain point, our tinkering stopped creating improvements; in fact, sometimes it seemed to decrease the accuracy of our predictions. Part of this was likely due to time and knowledge constraints, which prohibited us from adjusting the model to the fullest extent possible. Some of the problem, however, came from the data just being wonky. The human element in determining wine quality made the data too subjective to truly get an accurate handle on.
- Data visualization helps, even on the code-writer's end. In explaining the redundancy of certain variables, Gennie showed us a grid that contained each variable's correlation which each other variable; strong colors showed where two variables were highly correlated. This grid really helped us understand why certain columns were unnecessary to include in our final calculations. In another instance, I graphed several parameters across a spectrum of values. This enabled us to understand how those parameters, such as min_samples_split and and max_depth, affected over- and under-fitting of the model. For instance, a larger max_depth leads to more overfitting because the tree is making more decisions, pigeonholing itself to better fit the training set. This is visible on the graph because at large max_depths, the model's performance on the training data diverges widely from its performance on test sets. So as much as the max_depth leads to more specific decision-making, when looked at visually we can see where it ceases to be useful.
At the end of the day, our Random Forest model yielded 76% accuracy. What was most interesting, however, was how different other groups' approaches were: while each group earned similar accuracy rates, I can't think of a single method that was used twice. Some used grid search, others used regression, and others, like us, experimented with neural networks, as well. That, to me, is the most interesting part: the fact that we can all arrive at similar results from extremely different paths.
We also began work on our research papers this week; our outline is to the left!
Week 1
Welcome to my blog! Here, I will be documenting my progress through the NSF Research Experience for Undergraduates at George Mason University. Prior to arriving, I wrote an initial piece on what research means to me; you can find it on the right. In addition, my research partner Alexandra Plukis and I have begun our project exploration phase, in which we begin to narrow in on a research topic. Our research proposal is also visible on the right. I'm excited for the journey ahead, and can't wait to keep you posted!