// Wk2:

HACKATHON REFLECTIONS

This week we worked together with some of the SCIP research students to compete in a data mining hackathon. The problem was wine quality prediction: given a set of physicochemical features, can we accurately predict whether a wine will be good or bad? We were allowed to use any data mining technique as well as any libraries we pleased; Mark and Huzefa suggested using the sci-kit learn SVM tutorial we completed for homework as a starting point, but otherwise we were totally in charge of finding a model with the best prediction ability.

How to describe my first hackathon? Let me start first with the expectation: according to the calendar, the hackathon was scheduled to last 5 hours, so naturally I expected it to be long. I also expected some of element of competition and logically some degree of competition-induced stress. I envisioned myself working with my project partner and referring somewhat frequently to the data mining tutorial that we completed as homework prior to week 1, but otherwise not encountering unfamiliar territory. At this point the reader has probably assumed that I generally enjoy coding (safe and accurate assumption, Reader!), so lastly I expected it to be a fun experience.

As for the reality? In actuality, the hackathon lasted closer to 3.5 hours, so it wasn't quite as long as scheduled, but even then it felt much shorter than I anticipated. I thought that near the end I would surely be losing steam, but it was very much the opposite. By the end of the 3.5 hours, I found myself still gaining steam and wishing we had more time to experiment with different methods. There was in fact a degree of competition, as expected, to see who could achieve the greatest accuracy, but what contributed most to my personal stress levels was having the expectation that I would know exactly what to do and then having to confront the reality that a solution (for me) wasn't as straight-forward as I was hoping. We ended up working in randomly-assigned groups, so my team consisted of three people working in three different stages of three different SCIP/REU projects, and I ended up referring to much more than just the data mining tutorial we completed for homework. So, needless to say, I encountered some relatively unfamiliar territory after all.

My team tried a few different approaches. Making sure to cross-validate each time, we tried SVM, k-nearest neighbors, and random forest. Random forest seemed to work best, so we tried to improve our model by eliminating individual outliers as well as redundant features. For example, citric acid and pH were the most negatively correlated features, so we tested the effects of removing either of them on the accuracy of our model. Unfortunately we were unable to improve beyond ~71% accuracy given the time constraint, but given more time, we would have liked to try more approaches. One of my teammates remarked that his main takeaway was 'sometimes simpler is better,' and in another world we should have begun with a simple regression and worked up that way.

The experience of my first hackathon left me with two main insights into the way I prefer to work. First: Practice is good. I can listen to Mark and Huzefa talk about data mining as attentively as I can, and I can re-read their slides as much as I want, but at the end of the day, it's a matter of putting those principles and concepts to test through actual practice that solidifies my understanding. Second: I love teamwork. I love not only collaborating on projects with others, but also just working in a shared space with others who are working towards similar goals. Despite feeling mildly out of my comfort zone with how the reality matched up with my expectations, I was extremely happy with the experience. I left feeling like I had learned a lot and could replicate the same work process later on.