Clustering Problems

Data Set:

This week's dataset you are going to use a number of datasets that you have used previously, in conjunction with one new dataset. The old datasets are the BLS unemployment data (in /data/BigData/bls/la/), the zip code locations data (in /data/BigData/bls/), and the 2016 elections data (in /data/BigData/bls/). The new data is coronavirus data set from the New York Times (https://github.com/nytimes/covid-19-data). You can find the data in /data/BigData/coronavirus/.

In-Class Questions:

All the code that you write to answer these questions should be put in a package called sparkml in the assignment repository. You should also make a file called sparkcluster.md in the top level of your repository that includes a write-up with your answers to the questions and any requested plots.

1. What county has had the most total cases to date? What about deaths?

2. How many cases and deaths have been recorded for Bexar County?

3. Which county has the most deaths per case?

4. Make a scatter plot showing the number of cases in Bexar County over time. Note that you can get the day of the year with code like java.time.LocalDate.parse("2020-05-21").getDayOfYear. This works well as the x-axis value for the plot.

Out of Class Questions

5. Explore clustering options on the coronavirus and BLS data to try to find clusters that align with voting tendencies. You will do this for two clusters and for more clusters. (I won't tell you exactly how many, but something like 3, where you have clusters for strongly Democractic, strongly Republican, and neutral would be a option.) You need to tell me what values you used for dimensions, and how good a job the clustering does of reproducing voting tendencies. You can do that by giving the fraction of counties that were in the appropriate cluster. Note that breaking out infection rates by time might be useful here.

6. Make scatter plots of the voting results of the 2016 election (I suggest ratio of votes for each party) and each of your groupings. Use color schemes that make it clear how good a job the clustering did.

7. Make an animation of plots showing coronavirus and unemployment geographically over time. You can do this by outputting multiple files with images of the plots and putting them together in an animation. I would recommend using SwingRenderer.saveToImage(plot: Plot, fileName: String, format: String = "PNG", width: Int = 800, height: Int = 800) to save the plot to an image. I like to use mencoder to put together multiple images into a movie. It is installed on the Pandora machines so you can use it there. For example, "mencoder "mf://movie.*.png" -mf fps=10 -o movie.avi -ovc xvid -xvidencopts fixed_quant=10" will make a movie file called movie.avi from a bunch of files called movie.*.png that shows the frames at 10 frames per second. Markdown doesn't display AVI files nicely (though I believe if you upload it to YouTube you can embed it on GitHub) so you could consider making a GIF instead with the ImageMagick convert command. Something like "convert -delay 20 movie.*.png -loop 0 movie.gif" could do it. Note that in both cases, if you number your frames you will want the numbers to pre-pad with 0s because they are sorted ASCII-betically, so movie.10.png comes before movie.9.png. You can do something like this with the Scala format string interpolator. So f"movie.$index%05d.png" will make a file name that is padded to 5 digits with 0s.