SparkSQL and DataFrame Problems

Data Set:

For these problems, you will be working with the BLS (Bureau of Labor Statistics) unemployment data. I have put the data on the Pandora cluster under /data/BigData/bls/la/. Note that this directory is only accessible from the Pandora machines. To help keep balance, I suggest that each of you log into a random Pandora machine (00-08). This will prevent any single machine from having too much load.

Note that the BLS likes to use tab separated files, not comma-separated files. However, you can read it using the Spark CSV reader, you just have to set the delimiter to be a tab. That can be done with an option like the following.

.option("delimiter", "\t")

If you want local copies for home machines, the data was pulled from https://download.bls.gov/pub/time.series/la/. If you want to find more information on BLS statistics, including these files, go to https://www.bls.gov/data/.

To help with geographic plotting, I put a file called Geocodes_USA_with_Counties.csv one level up in /data/BigData/bls/ that contains latitude and longitude information. This data file came from https://data.healthcare.gov/dataset/Geocodes-USA-with-Counties/52wv-g36k. Note that this is not a BLS file, so you are going to have to do a little data wrangling in order to line up entries in this file with the BLS areas. Not all BLS areas will line up. That's fine. You just need to do a good enough job that you can produce latitude and longitude plots that look reasonable. (There is an older file called zip_codes_states.csv with roughly the same data, but the data moved so it was harder for people not using the Pandora machines to access.)

In-class Questions:

All the code that you write to answer these questions should be put in a package called sparksql in the assignment repository. You should also make a file called sparksql.md in the top level of your repository that includes a write-up with your answers to the questions and any requested plots.

1. How many data series does the state of New Mexico have?

2. What is the highest unemployment level (not rate) reported for a county or equivalent in the time series for the state of New Mexico?

3. How many cities/towns with more than 25,000 people does the BLS track in New Mexico?

Out of Class Questions:

4. What was the average unemployment rate for New Mexico in 2017? Calculate this in three ways:

a. Averages of the months for the BLS series for the whole state.

b. Simple average of all unemployment rates for counties in the state.

5. This is a continuation of the last question. Calculate the unemployment rate in a third way and discuss the differences.

c. Weighted average of all unemployment rates for counties in the state where the weight is the labor force in that month.

d. How do your two averages compare to the BLS average? Which is more accurate and why?

6. What is the highest unemployment rate for a series with a labor force of at least 50,000 people in the state of Texas? When and where?

The following questions involve all the states.

7. What is the highest unemployment rate for a series with a labor force of at least 50,000 people in the full dataset? When and where?

8. Which state has most distinct data series? How many series does it have?

9. We will finish up by looking at unemployment geographically and over time. I want you to make a grid of scatter plots for the years 2000, 2005, 2010, and 2015. Each point is plotted at the X, Y for latitude and longitude and it should be colored by the unemployment rate. If you are using SwiftVis2, the Plot.scatterPlotGrid method is particularly suited for this. Only plot data from the continental US, so don't include data from Alaska, Hawaii, or Puerto Rico.