Week 2

Week 2 got started off with a python overview/tutorial run by Mark Snyder, our resident programming languages expert and a co-PI for this REU. I've already had a good amount of experience working with Python as I used it a lot during my research project last summer. Still, he covered things I hadn't seen before including some weird edge cases (like the fact that integers between -5 and 255 are stored as built in existing objects but for all other values a new object is created? weird...).

I spent the next few days trying to figure out the direction of my project. I'll be working with another undergraduate, Peter Cherniavsky, and we have Mark as our adviser. The focus of our project is going to be developing a model which can predict in class performance based on prior activity/performance in the class. We are working with a data set of a couple hundred GMU students taking an introductory CS course over the past 2 years. The data includes grades on all graded activities during the course (such as homework, quizzes, tests, and labs), as well as submission information and timestamps from the online textbook zybooks, and finally submissions to the in class polling system Pytania.

On Wednesday we had a data mining 'Hackathon'/competition. We were split into teams of three and given the task of creating a model for predicting wine quality based on a variety of physical attributes (e.g. pH, citric acid content, density, etc..). We were given a training data set of 800 samples, each labelled on a binary scale as good or bad. Our first thought was to try and get a feel for the data and look for any obvious correlations between features and the target variable. So we made a bunch of graphs plotting each feature against the target. Sadly, there was essentially no correlation evident in any of our graphs.

Next, we decided to try out a few of the classification algorithms provided by scikit-learn. Methods we tried included a neural network and a support vector classifier (SVC). We were able to get the best results on our cross-validation using a support vector classifier, with about 72% accuracy. Next, we thought it might be a good idea to apply some preprocessing normalization of the data using sklearn.preprocessing.normalizer. This lead to a sizable drop in accuracy. Fortunately, we persevered and tried another method of normalization with the StandardScaler. This time, our results improved all the way up to 76% accuracy, which put us near the top of the leaderboard. In the time we had left we tried a few other things, such as replacing 0 values which we assumed to be missing data with the mean value of the column. Unfortunately, this significantly decreased the accuracy of our predictive model. We also tried playing around with the parameters for the SVC classifier, but in the end we were unable to improve upon the results from the default values.

One of my big takeaways is that data mining involves a lot of trial and error. There were a number of things that we tried, expecting them to improve our model, which actually had no effect or even hurt our accuracy. Still, as long as you don't give up and keep trying new things eventually you will make improvements in your predictions. I was also surprised by how easy it was to create a predictive model using the sklearn package. It almost felt like cheating when I was able to implement a neural network, which is an incredibly complex algorithm, in only a few lines of code. Of course, building a thorough understanding of the way these algorithms actually work would enable me to apply them more thoughtfully and effectively and this is one of my goals for the summer.