This week, with the results section of our research write-up quickly becoming a priority, I had a good opportunity to consider the broader meanings of metrics as they pertain to data mining. The python library sklearn is an incredibly powerful tool: it makes machine learning experiments as simple to execute as typing 5 lines of code (after data pre-processing, at least!). However, the truly meaningful results from machine learning don't come from just running models--they come from the metrics by which we test them.
This week, Anlan and I explored the various metrics for testing our predictions on student success. Although the default performance metric for models is accuracy, the more meaningful metrics are really much more specific to the projects on which they're run. For us, f1 is a better metric than accuracy because the classes we're predicting are imbalanced. Many more students pass a course than fail it, and many more students graduate college with a non-CS degree than with a CS degree. If we were using merely an accuracy score, we would be able to predict that everyone passed, or no one graduated in CS and we would be able to falsely benefit from this imbalance. When reading papers in the data mining field, this issue concerns me--often papers do not mention the "baseline" score, or they don't establish the balance of the classes the models are predicting.
Without clear context, it's difficult to determine the true efficacy of the methods used. In education data mining, where the stakes are so high for students and schools alike, result transparency is especially important--choosing and presenting the most appropriate metrics is crucial for the research to have lasting meaning. Keeping all of this in mind, Anlan and I gained insight as to the research standards for metric reporting from Dr. Rangwala, and discussed the metrics used with our REU cohort. This concentration on meaningful results will hopefully continue as we work on the research write-up and we create appropriate visuals of our findings!