Data Set Reports

One of the requirements for this course is that you go look for three data sets that you think would be interesting to analyze. Early in the semester, you will turn in an initial description and analysis. This will be done as the README.md in your Spark assignments GitHub repository. (Note that you will have to learn a little Markdown in order to write this.) This first write-up will include the following for each data set:

    • A links to the data set. Note that you might need more than one source of data to answer the questions that you want.

    • A brief description of the data set in your own words.

    • Several questions that you would like to answer using the data set.

    • An explanation of why you find this dataset and the questions you listed interesting.

All datasets need to be at least 1 MB in total size, and at least one of the datasets needs to be 100 MB in size. While you aren't printing this, the text should be ~3-6 pages in length if it were printed. So each data set will have 1-2 pages of writing.

At the end of the semester, you will add one paragraph to each of your datasets where you analyze your original write-up. In particular, I want you to tell me whether the dataset is really capable of answering the questions that you put forward along with a brief description of either how you would do it or why it wouldn't work.