Spark/RDD Problems

Data Set:

The file /users/mlewis/workspaceF18/CSCI3395-F18/data/ghcn-daily/2017.csv was pulled from the "by_year" directory at ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/. That same directory also has files called ghcnd-stations.txt and ghcnd-countries.txt that you will need to answer the questions for this week. You will need to read some of the support files on the web site in order to get information on the format of the data files.

Questions:

All the code that you write to answer these questions should be put in a package called sparkrdd in the assignment repository. (Note that a package is just a subdirectory. Since all of your code is going in src/main/scala for sbt, the code for this should be in src/main/scala/sparkrdd.) You should also make a file called sparkrdd.md in the top level of your repository that includes a write-up with your answers to the questions and any requested plots.

1. How many stations are there in the state of Texas?

2. How many of those stations have reported some form of data in 2017?

3. What is the highest temperature reported anywhere this year? Where was it and when?

4. How many stations in the stations list haven't reported any data in 2017?

5. What is the maximum rainfall for any station in Texas during 2017? What station and when?

6. What is the maximum rainfall for any station in India during 2017? What station and when?

7. How many weather stations are there associated with San Antonio, TX?

8. How many of those have reported temperature data in 2017?

9. What is the largest daily increase in high temp for San Antonio in this data file?

10. What is the correlation coefficient between high temperatures and rainfall for San Antonio? Note that you can only use values from the same date and station for the correlation.

11. Make a plot of temperatures over time for five different stations, each separated by at least 10 degrees in latitude. Make sure you tell me which stations you are using.