Week 2

Our second was much more hands on from a technical standpoint. We received the dataset from one of our colleagues, Xavier. Our direction for this week was still a bit vague--our mentor (Rangwala) helped us with questions to define the scope of the problems we were trying to solve. Our mentor also gave us the option to work on either a dashboard to answer for administrative uses or work on predicting the retention rate of students. I personally felt more attracted towards working on the retention rate of students.

This week, we also had our first Hackathon! In this Hackathon, we were given a task to predict whether wine was good or bad (represented by G and B, respectively). We were given a dataset with wine attributes (such as citric acid, ph, etc) for the task. Our first task in the Hackathon was to convert the categorical variables to numerical variables. To do so, I used one-hot encoding to convert the G's and B's to 0 and 1. The next step I took was to make a correlation matrix. A correlation matrix a visualization to see how correlated each feature is to each other. I was mainly looking for features that had no variance/were constant (for example, a row of all 1's or 0's). If a feature does not have variance, it does not help the model distinguish between features. Features that have no variance are represented by white in the correlation matrix. However, when I made the model, there were no features that had no variance and needed to be immediately removed. As a result, I decided to test some preliminary models with train_test_split. Train_test_split is library in sklearn that splits the training data automatically into training and test data. This is so you can evaluate your model and see what the accuracy is. At first, I tried linear regression. However, I realized that for a classification problem, using linear regression did not make sense (as linear regression is used to predict continuous variables). As a result, I decided to make a model with logistic regressionCV. On my first try, we actually managed to get a score of 76%. However, I was not satisfied with my model. As a result, I decided to try Gaussian Bayes (gave us an accuracy of 43%), K-means (gave us an accuracy of 73%), and K-means (gave us an accuracy of 63%). However, the model that gave us the most accuracy was SVM with GridSearchCV. Grid Search is an algorithm that manually chooses the best parameters through the process of cross-validation. From SVM, we ended up with an accuracy of 77%. As a result, we ended up winning the competition. Technically speaking we tied with the other teams, but for this blog, let's say it was a win!