Groups:
Group 1 (xena01-03) - Usiri, Calvin; Bomer, Dan; Holloway, Taylor
Group 2 (xena04-06) - Burnett, Jesse; Skogman, Brett; Andres, Robbie
Group 3 (xena07-09) - Croxton, John; Yang, Mary; Reyes, Miguel
Group 4 (xena10-12) - Chang, Stephen; Koeller, Jordan; Walker, Blair
Group 5 (xena13-15) - Newton, Michael; Whitten, Marcus; Witecki, Ian
Group 6 (xena16-18) - Fordin, Sarah; Ang, Sam; Burton, Craig
Group 7 (xena19-21) - Viltoft, Jorgen; Samoray, Nicholas; Herbert, Emily
Data Set:
This week's data set comes from the Trinity office of admissions. Your goal is to write a predictor that will tell, based on a variety of information, whether a prospective student will choose to attend Trinity based on a variety of factors from their records. The data file is called AdmissionAnon.tsv and it is in the /data/BigData/admissions directory. (Note that by our standards, this file is actually quite small with slightly under 3000 records.) Your overall goal for the week is to come up with the best classifier for the last column in that file based on the other columns.
In Class Questions:
1. How many rows and columns does the data have?
2. How many different values does the last column have?
3. How many rows are there with each of those values?
4. Using the corr method of the org.apache.spark.ml.stat.Correlation object, calculate the correlation matrix for the numeric columns. The example at https://spark.apache.org/docs/latest/ml-statistics.html could be very helpful for figuring out how to do this. Copy the matrix into the email.
5. Based on the correlation matrix, which three columns are most highly correlated (or anti-correlated) with the last column?
Before you leave class, one member of your group needs to send me an email with your group answers to these questions and the code you wrote to solve them. Make sure the email also includes the names of all the group members who were present to work on this.
Between Class Questions:
All the code that you write to answer these questions should be put in a package called sparkml in the in-class repository. You should also make a file called sparkml-classify.md in the top level of your repository that includes a write-up with your answers to the questions and any requested plots.
1. Find the best classification scheme you can for the last column. There are two ways to interpret the classification, and I want you to do those independently. (See a and b below.) Also, include an explanation of what the key elements are for the classification. You should probably try a few of the classifiers to see which ones work best, but your explanation of factors might use a classifier that isn't optimal.
a. Classify all values for the last column.
b. Classify the last column into two sets: 0 and 1 vs 2, 3, and 4.