Special RDD Problems

Data Set:

Several year files are in /data/BigData/ghcn-daily/ on the Pandora machines that were pulled from the "by_year" directory at ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/. That same directory also has a file called ghcnd-stations.txt that you will need to answer the questions for this week. You will need to read some of the support files on the website in order to get information on the format of the data files.

All the code that you write to answer these questions should be put in a package called sparkrdd2 in the assignment repository. You should also make a file called sparkrdd2.md in the top level of your repository that includes a write-up with your answers to the questions and any requested plots. Make sure to mention in your write-up who you worked with during the in-class portion.

In-Class Questions:

1. What location has reported the largest temperature difference in one day (TMAX-TMIN) in 2019? What is the difference and when did it happen?

2. What location has reported the largest difference between min and max temperatures overall in 2019? What was the difference?

3. What is the standard deviation of the high temperatures for all US stations? What about low temperatures?

4. How many stations reported data in both 1987 and 2019?

Out of Class Class Questions:

5. Does temperature variability change with latitude in the US in 2019? Consider three groups of latitude: lat<35, 35<lat<42, and 42<lat. Answer this question in the following three ways.

a. Standard deviation of high temperatures.

b. Standard deviation of average daily temperatures (when you have both a high and a low for a given day at a given station).

c. Make histograms of the high temps for all stations in each of the regions so you can visually inspect the breadth of the distribution.

6. Plot the average high temperature for every station that has reported temperature data in 2019 with a scatter plot using longitude and latitude for x and y and the average daily temperature for color. Make 100F or higher solid red, 50F solid green, and 0F or lower solid blue. Use a SwiftVis2 ColorGradient for the scatter plot color.

7. How much has the average land temperature changed from 1897 to 2019? We will calculate this in a few ways.

a. Calculate the average of all temperature values of all stations for 1897 and compare to the same for 2019.

b. Calculate the average of all temperature values only for stations that reported temperature data for both 1897 and 2019.

c. Plot data using approach (a) for all years I give you data for from 1897 to 2019. (On the Pandora machines under /data/BigData/ghcn-daily you will find a file for every 10 years from 1897 to 2017 plus 2018 and 2019.)

d. Plot data using approach (b) for all years I give you data for from 1897 to 2019. (Use the files for every 10 years plus the most recent data.)

8. Describe the relative merits and flaws with approach (a) and (b) for question 7. What would be a better approach to answering that question?

9. While answering any of the above questions, did you find anything that makes you believe there is a flaw in the data? If so, what was it, and how would you propose identifying and removing such flaws in serious analysis?

10. What is the correlation coefficient between high temperatures and rainfall for San Antonio in 2019? Note that you can only use values from the same date and station for the correlation.