MLLib and Regression Problems

Data Set:

For these problems, you will be working with CalCOFI Oceanographic data (https://www.kaggle.com/sohier/calcofi). This has 60 years of data from bottles that have been thrown into the Pacific Ocean. I have a copy of this data set in /data/BigData/Oceans/. There are two files. The bottle.csv file has data from individual bottles that have been used for sampling. The cast.csv file has information on the casts of the bottles. Multiple bottles are generally thrown out at each cast. The Cst_Cnt column can be used to join the two sets.

All the code that you write to answer these questions should be put in a package called sparkml in the assignment repository. You should also make a file called sparkml-regression.md in the top level of your repository that includes a write-up with your answers to the questions and any requested plots.

Questions:

1. How many columns in bottle.csv have data for at least half the rows?

2. How many bottles were in each cast on average? (Don't make this harder than it needs to be.)

3. Plot the locations of the casts using Lat_Dec and Lon_Dec. Make the point size indicate depth and the point color indicate temperature.

4. Using linear regression, make a prediction of salinity based only on temperature for the bottles data. What is the average error in your predictions?

5. Now do a linear regression that also includes Depth and O2ml_L. What is the average error for this?

6. Using any regression algorithm you want in SparkML, make a prediction of O2ml_L from other columns other than O2Sat, O2Satq, and other columns that have O2 in the name. If you restrict yourself to a 3-D input, what is the best prediction you can make? What method and set of input columns produces it?

7. Make a plot that demonstrates what you found in #5.