Clustering Problems

Data Set:

This week's dataset you are going to use a number of datasets that you have used previously, in conjunction with one new dataset. The old datasets are the BLS unemployment data (in /data/BigData/bls/la/), the zip code locations data (in /data/BigData/bls/), and the 2016 elections data (in /data/BigData/bls/). The new data is a different BLS dataset, the Quarterly Census of Employment and Wages (https://www.bls.gov/cew/datatoc.htm). You can find the data in /data/BigData/bls/qcew/. A description of the fields from the main file is at https://data.bls.gov/cew/doc/layouts/csv_quarterly_layout.htm. As the name implies, this data set looks at wage information in different sectors of the economy.

Questions:

All the code that you write to answer these questions should be put in a package called sparkml in the assignment repository. You should also make a file called sparkcluster.md in the top level of your repository that includes a write-up with your answers to the questions and any requested plots.

1. Which aggregation level codes are for county-level data? How many entries are in the main data file for each of the county level codes?

2. How many entries does the main file have for Bexar County?

3. What are the three most common industry codes by the number of records? How many records for each?

4. What three industries have the largest total wages for 2016? What are those total wages? (Consider only NAICS 6-digit County values.)

5. Explore clustering options on the BLS data to try to find clusters that align with voting tendencies. You will do this for two clusters and for more clusters. (I won't tell you exactly how many, but something like 3, where you have clusters for strongly Democractic, strongly Republican, and neutral would be a option.) You need to tell me what values you used for dimensions, and how good a job the clustering does of reproducing voting tendencies. You can do that by giving the fraction of counties that were in the appropriate cluster.

6. Make scatter plots of the voting results of the 2016 election (I suggest ratio of votes for each party) and each of your groupings. Use color schemes that make it clear how good a job the clustering did.

!!! Work with temperature data. Combine clustering with linear regression to make a map of linearized temperature change over the timescale of the dataset.