Week 3

This week I jumped into playing around with our data for the first time. I spent some time figuring out how to load in all the relevant data sets and I made a bunch of simple graphs using Python's Matplotlib library to help me visualize and better understand the basics of our data. The short term goal now is to create a linear regression classifier for predicting grades on the final exam. To start, I'll make a separate classifier for each of our three sources of data (gradebook, zyBook, and Pytania). The classifier should be setup so that a parameter can easily be tuned to change how early on in the semester we want to make the prediction (e.g. I might want to be able to predict final grades right after the first test, or I might want to predict them halfway through the course).

Our deliverable for this week was to critique a research paper related to our project. I choose the paper 'Mining Educational Data to Analyze Student's Performance' (Baradwaj, Pal). The paper was published in 2011 and aimed to "justify the capabilities of data mining techniques in the context of higher education'. The data set was obtained from VBS Purvanchal University and consisted of course records from about 50 students. Features of the data set included grades in previous semesters, test and seminar scores, and attendance, among others. Surprisingly, all of these features were recorded in a discrete manner with only a small number of options. For example, attendance was labelled either as 'Good', 'Average', or 'Poor', rather than using a more precise metric.

Baradwaj and Pal decided to use a decision tree classifier for performing predictions. Other methods were described in the definitions section, such as KNN and neural networks, but only the decision tree was actually applied. They used ID3, an iterative greedy algorithm which repeatedly makes splits on an attribute based on maximizing information gain, in order to construct their decision tree. The resulting classification rules were presented, however no cross validation was performed to measure the accuracy of the decision tree classifier.

The general idea of this paper is similar to what Peter and I are trying to accomplish. We wish to predict in class performance by applying data mining techniques to student data. However, we have access to a wider array of attributes (such as interaction with the online textbook and the in class polling system) as well as a much larger sample size (around 400 students vs. the 50 that they used). We also aim to apply a variety of classification techniques and compare their efficacy in order to create the best possible classifier. Beyond the classification task, we want to achieve a better understanding of factors which contribute to student success. The decision tree method employed in this paper should allow for interpretable results, however no discussion was provided. Hopefully we will be able to analyze our classifier to discover exactly which aspects of the course or which student behaviors correlate with better or worse performance. These results could then be used to improve the course in the future.