This week the REU group took part in a data mining hackathon with students from the ASL research team. The entire group was split into random teams, we got our assignment, and then had 3 hours to try different approaches to solve the problem (and eat pizza).
The Assignment: predicting the quality of a Portuguese red wine given certain physicochemical attributes.
The Methods: any and all that we could come up with!
My teammates for this project were one of the ASL research students and a graduate student working with Dr. Rangwala. All of us came from different skill levels and computing backgrounds, so after receiving the problem statement we first had to convene and discuss how we could each contribute to the project. We decided that the best way to begin the mining would be to try out different models on the data and compare their performances. My teammates began on a neural network approach to the problem after we consolidated our data. Because the problem was a classification problem and I had done some coding with Support Vector Machine (svm) classifiers over the summer before arriving on the REU site, I started with that approach. The results were already high--around 70% accuracy--but I decided to try other models as well to see how they compared. I tried using a random forrest classifier, which resulted in an improved accuracy score. After deciding to continue working with a random forrest classifier, I began looking at the attributes of the data to figure out if there were some attributes that could be dropped. I made a correlation graph to figure this out, and played around with excluding combinations of variables. In addition to the variable selection work, I looked at individual records to see if there were outliers that could be dropped to improve the quality of the training data. However these changes did not significantly improve our accuracy, which peaked at 75% on the test set.
If I were to continue with with project, I would use a grid search to more efficiently improve the values used as parameters for the data mining models. In addition, my group used 1-0 normalization for the attributes and I would like to explore the results of changing the method for normalization. Two of the top groups used the standard scalar method in sklearn, which normalizes each of the attribute columns individually and making their means equal to 0 and standard deviations equal to 1.
I really enjoyed taking part in this hackathon because it was an opportunity to work on a data mining problem without a specific path enforced. Although we could test our models ourselves to determine the their efficacy, there were no requirements set on the types of models we should try or data cleaning methods. In addition, working with teams in the same room provided motivation to continue competing to improve the accuracy of our models. Before beginning to prepare for the REU site I didn't have much experience in python or data mining, so being able to achieve a good level of accuracy just by looking up DM documentation and exploring different methods strengthened my confidence that I would be able to follow the same protocol with the data for research this summer.
The most valuable takeaway from the hackathon was understanding the path of researching, trying, evaluating, and modifying DM models for use on a test case. I come away from the experience with a stronger vision of and more excitement for the work ahead!