SparkSQL and DataFrame Problems

Groups:

Group 1 (xena01-03) - Skogman, Brett; Koeller, Jordan; Bomer, Dan

Group 2 (xena04-06) - Croxton, John; Herbert, Emily; Burnett, Jesse

Group 3 (xena07-09) - Samoray, Nicholas; Witecki, Ian; Yang, Mary

Group 4 (xena10-12) - Newton, Michael; Chang, Stephen; Fordin, Sarah

Group 5 (xena13-15) - Walker, Blair; Holloway, Taylor; Viltoft, Jorgen

Group 6 (xena16-18) - Burton, Craig; Andres, Robbie; Whitten, Marcus; Taylor, Zachary

Group 7 (xena19-21) - Reyes, Miguel; Usiri, Calvin; Ang, Sam

Data Set:

For these problems you will be working with the BLS (Bureau of Labor Statistics) unemployment data. I have put the data on the Pandora cluster under /data/BigData/bls/la/. Note that this directory is only accessible from the Pandora machines. To help keep balance, I suggest that each of you log into the Pandora machine with the number of your group as assigned above. This will prevent any one machine from having too much load.

Note that the BLS likes to use tab separated files, not a comma separated files. However, you can read it using the Spark CSV reader, you just have to set the delimiter to be a tab. That can be done with an option like the following.

.option("delimiter", "\t")

If you want local copies for home machines, the data was pulled from https://download.bls.gov/pub/time.series/la/. If you want to find more information on BLS statistics, including these files, go to https://www.bls.gov/data/.

To help with geographic plotting, I put a file called zip_codes_states.csv one level up in /data/BigData/bls/ that contains latitude and longitude information. This data file came from https://www.gaslampmedia.com/download-zip-code-latitude-longitude-city-state-county-csv/. Note that this is not a BLS file, so you are going to have to do a little data wrangling in order to line up entries in this file with the BLS areas. Not all BLS areas will line up. That's fine. You just need to do a good enough job that you can produce latitude and longitude plots that look reasonable.

In Class Questions:

1. How many data series does the state of New Mexico have?

2. What is the highest unemployment level (not rate) reported for a county or equivalent in the time series for the state of New Mexico?

3. How many cities/towns with more than 25,000 people does the BLS track in New Mexico?

4. What was the average unemployment rate for New Mexico in 2017. Calculate this in three ways:

a. Averages of the months for the BLS series for the whole state.

b. Simple average of all unemployment rates for counties in the state.

Before you leave class, one member of your group needs to send me an email with your group answers to these questions and the code you wrote to solve them. Make sure the email also includes the names of all the group members who were present to work on this.

Between Class Questions:

All the code that you write to answer these questions should be put in a package called sparksql in the in-class repository. You should also make a file called sparksql.md in the top level of your repository that includes a write-up with your answers to the questions and any requested plots.

1. This is a continuation of the last in-class question. Calculate the unemployment rate in a third way and discuss the differences.

c. Weighted average of all unemployment rates for counties in the state where the weight is the labor force in that month.

d. How do your two averages compare to the BLS average? Which is more accurate and why?

2. What is the highest unemployment rate for a series with a labor force of at least 10,000 people in the state of Texas? When and where?

The following questions involve the all the states.

3. What is the highest unemployment rate for a series with a labor force of at least 10,000 people in the full data set? When and where?

4. Which state has most distinct data series? How many series does it have?

5. We will finish up by looking at unemployment geographically and over time. I want you to make a grid of scatter plots for the years 2000, 2005, 2010, and 2015. Each point is plotted at the X, Y for latitude and longitude and it should be colored by the unemployment rate. If you are using SwiftVis2, the Plot.scatterPlotGrid method is particularly suited for this. Only plot data from the continental US, so don't include data from Alaska, Hawaii, or Puerto Rico.

Page updated

Report abuse