Dataset and UDF Problems

Data Set:

For these problems, you will be working with the BLS (Bureau of Labor Statistics) unemployment data. I have put the data on the Pandora cluster under /data/BigData/bls/la/. Note that this directory is only accessible from the Pandora machines. To help keep balance, I suggest that each of you log into the Pandora machine with the number of your group as assigned above. This will prevent any one machine from having too much load.

Note that the BLS likes to use tab separated files, not comma-separated files. However, you can read it using the Spark CSV reader, you just have to set the delimiter to be a tab. That can be done with an option like the following.

.option("delimiter", "\t")

If you want local copies for home machines, the data was pulled from https://download.bls.gov/pub/time.series/la/. If you want to find more information on BLS statistics, including these files, go to https://www.bls.gov/data/.

To help with geographic plotting, I put a file called Geocodes_USA_with_Counties.csv one level up in /data/BigData/bls/ that contains latitude and longitude information. This data file came from https://data.healthcare.gov/dataset/Geocodes-USA-with-Counties/52wv-g36k. Note that this is not a BLS file, so you are going to have to do a little data wrangling in order to line up entries in this file with the BLS areas. Not all BLS areas will line up. That's fine. You just need to do a good enough job that you can produce latitude and longitude plots that look reasonable.

You will also use results from the 2016 presidential election (https://github.com/tonmcg/County_Level_Election_Results_12-16). There is a file called 2016_US_County_Level_Presidential_Results.csv in the /data/BigData/bls/ directory. This has the results of the 2016 presidential election broken out by counties. Ignore Alaska, the results for it seem to be pretty messed up.

All the code that you write to answer these questions should be put in a package called sparksql2 in the assignment repository. You should also make a file called sparksql2.md in the top level of your repository that includes a write-up with your answers to the questions and any requested plots.

In Class Questions:

1. What fraction of counties had a Republican majority?

2. What fraction of counties went Republican by a margin of 10% or more? What about Democratic?

3. Plot the election results with an X-axis of the number of votes cast and a Y-axis of the percent Democratic votes minus the percent Republican votes.

4. Using both the election results and the zip code data, plot the election results by county geographically. So X is longitude, Y is latitude, and the color is a based on percent Democratic with 40% or lower being solid red and 60% or higher being solid blue.

Out of Class Questions:

5. In this question, I want to look at the impact of some recent recessions on the unemployment distributions for the US. In particular, I want you to look at the following recessions:

a. 7/1990 - 3/1991

b. 3/2001 - 11/2001

c. 12/2007 - 6/2009

For each one, I want you to make a number of histograms for two different months. One is the month before the recession started, and the other is the last month of the recession. You will make a grid of histograms of the unemployment rates for all states combined (so you can use the big combined file I made) with bins of (0.0 to 50.0 by 1.0) for the types of series listed below for those six months. Note that the Plot.histogramGrid method can help you to make the grid. How has the distribution of unemployment rates changed over time?

a. Metropolitan Areas

b. Micropolitan Areas

c. Counties and Equivalents

I suggest that you make a grid with three rows and six columns. The rows are the different types of areas while the columns are the six different months. You might want to use colors to differentiate the month before a recession from the last month of the recession. Perhaps green for the month before and red for the last month. So you would have six columns that alternate green and red histograms. What do you observe in these plots?

6. I am interested in correlations between employment status and voting tendencies. Let's look at this in a few different ways.

a. For all counties, calculate the correlation coefficient between the unemployment rate and percent democratic vote for November 2016. (Note that this will be a single number.) What does that number imply?

b. Make a scatter plot that you feel effectively shows the three values of population, party vote, and the unemployment rate (again in November 2016). For the population, you can use the labor force or the number of votes cast. Your scatter plot should have one point per county. In addition to X and Y, you can use size and color. What does your plot show about these three values?

7. Look at the relationships between voter turnout (approximate by votes/labor force), unemployment, and political leaning. Are there any significant correlations between these values? If so, what are they? You can use plots or statistical measures to draw and illustrate your conclusions.