Spark/RDD Problems

Data Set:

The file /data/BigData/ghcn-daily/2019.csv was pulled from the "by_year" directory at ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/. This file can only be accessed on the Pandora machines. That same directory also has files called ghcnd-stations.txt and ghcnd-countries.txt that you will need to answer the questions for this week. You will need to read some of the support files on the web site in order to get information on the format of the data files. Tell me who you were partnered with for these questions in the Markdown file.

In Class Questions (done in groups):

1. How many stations are there in the state of Texas?

2. How many of those stations have reported some form of data in 2019?

3. What is the highest temperature reported anywhere this year? Where was it and when?

4. How many stations in the stations list haven't reported any data in 2019?

Between Class Questions (done alone):

All the code that you write to answer these questions should be put in a package called sparkrdd in the in-class repository. (Note that a package is just a subdirectory. Since all of your code is going in src/main/scala for sbt, the code for this should be in src/main/scala/sparkrdd.) You should also make a file called sparkrdd.md in the top level of your repository that includes a write-up with your answers to the questions and any requested plots. Once you do your last push of the files, send me an email with links to both files to let me know.

5. What is the maximum rainfall for any station in Texas during 2019? What station and when?

6. What is the maximum rainfall for any station in India during 2019? What station and when?

7. How many weather stations are there associated with San Antonio, TX?

8. How many of those have reported temperature data in 2019?

9. What is the largest daily increase in high temp for San Antonio in this data file?

10. Make a plot of temperatures over time for five different stations, each separated by at least 10 degrees in latitude. Make sure you tell me which stations you are using.