Classification Problems

Data Set:

This week's data set comes from the Trinity office of admissions. Your goal is to write a predictor that will tell, based on a variety of information, whether a prospective student will choose to attend Trinity based on a variety of factors from their records. The data file is called AdmissionAnon.tsv and it is in the /data/BigData/admissions directory. (Note that by our standards, this file is actually quite small with slightly under 3000 records.) Your overall goal for the week is to come up with the best classifier for the last column in that file based on the other columns.

Questions:

All the code that you write to answer these questions should be put in a package called sparkml in the assignment repository. You should also make a file called sparkml-classify.md in the top level of your repository that includes a write-up with your answers to the questions and any requested plots.

1. How many rows and columns does the data have?

2. How many different values does the last column have?

3. How many rows are there with each of those values?

4. Using the corr method of the org.apache.spark.ml.stat.Correlation object, calculate the correlation matrix for the numeric columns. The example at https://spark.apache.org/docs/latest/ml-statistics.html could be very helpful for figuring out how to do this. Copy the matrix into the write-up.

5. Based on the correlation matrix, which three columns are most highly correlated (or anti-correlated) with the last column?

6. Find the best classification scheme you can for the last column. There are two ways to interpret the classification, and I want you to do those independently. (See a and b below.) Also, include an explanation of what the key elements are for the classification. You should probably try a few of the classifiers to see which ones work best, but your explanation of factors might use a classifier that isn't optimal.

a. Classify all values for the last column.

b. Classify the last column into two sets: 0 and 1 vs 2, 3, and 4.