Special RDD Problems

Groups:

Group 1 (xena01-03) - Yang, Mary; Reyes, Miguel; Burton, Craig

Group 2 (xena04-06) - Usiri, Calvin; Whitten, Marcus; Herbert, Emily

Group 3 (xena07-09) - Viltoft, Jorgen; Croxton, John; Burnett, Jesse

Group 4 (xena10-12) - Samoray, Nicholas; Ang, Sam; Witecki, Ian

Group 5 (xena13-15) - Skogman, Brett; Holloway, Taylor; Walker, Blair

Group 6 (xena16-18) - Newton, Michael; Andres, Robbie; Fordin, Sarah; Taylor, Zachary

Group 7 (xena19-21) - Bomer, Dan; Chang, Stephen; Koeller, Jordan

Data Set:

Several year files are in /users/mlewis/CSCI3395-F17/DataSets/ghcn-daily/ that were pulled from the "by_year" directory at ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/. That same directory also has a file called ghcnd-stations.txt that you will need to answer the questions for this week. You will need to read some of the support files on the web site in order to get information on the format of the data files.

Note that there is a copy of this data in /data/BigData/ghcn-daily/, but it is only visible on the Pandora machines. If you ssh to those machines and run from there, you can use that copy. It will probably be faster to load.

In Class Questions:

1. What location has reported the largest temperature difference in one day (TMAX-TMIN)? What is the difference and when did it happen?

2. What location has reported the largest difference between min and max temperatures overall in 2017. What was the difference?

3. What is the standard deviation of the high temperatures for all US stations? What about low temperatures?

4. How many stations reported data in both 1987 and 2017?

Before you leave class, one member of your group needs to send me an email with your group answers to these questions and the code you wrote to solve them. Make sure the email also includes the names of all the group members who were present to work on this.

Between Class Questions:

All the code that you write to answer these questions should be put in a package called sparkrdd2 in the in-class repository. You should also make a file called sparkrdd2.md in the top level of your repository that includes a write-up with your answers to the questions and any requested plots.

1. Does temperature variability change with latitude in the US? Consider three groups of latitude: lat<35, 35<lat<42, and 42<lat. Answer this question in the following three ways.

a. Standard deviation of high temperatures.

b. Standard deviation of average daily temperatures (when you have both a high and a low for a given day at a given station).

c. Make histograms of the high temps for all stations in each of the regions so you can visually inspect the breadth of the distribution.

2. Plot the average high temperature for every station that has reported temperature data in 2017 with a scatter plot using longitude and latitude for x and y and the average daily temperature for color. Make 100F or higher solid red, 50F solid green, and 0F or lower solid blue. If you are using ScalaFX plots, you can have bins of 20 degrees with different colors. If you are using SwiftVis2, you can use a ColorGradient for the scatter plot color.

3. How much has the average land temperature changed from 1897 to 2016? We will calculate this in a few ways.

a. Calculate the average of all temperature values of all stations for 1897 and compare to the same for 2016.

b. Calculate the average of all temperature values only for stations that reported temperature data for both 1897 and 2016.

c. Plot data using approach (a) for all years I give you data for from 1897 to 2016.

d. Plot data using approach (b) for all years I give you data for from 1897 to 2016.

4. Describe the relative merits and flaws with approach (a) and (b) for question 3. What would be a better approach to answering that question?

5. While answering any of the above questions, did you find anything that makes you believe there is a flaw in the data? If so, what was it, and how would you propose identifying and removing such flaws in serious analysis?