MLLib and Regression Problems

Groups:

Group 1 (xena01-03) - Herbert, Emily; Skogman, Brett; Andres, Robbie

Group 2 (xena04-06) - Chang, Stephen; Burton, Craig; Walker, Blair

Group 3 (xena07-09) - Samoray, Nicholas; Koeller, Jordan; Yang, Mary

Group 4 (xena10-12) - Bomer, Dan; Holloway, Taylor; Whitten, Marcus

Group 5 (xena13-15) - Witecki, Ian; Usiri, Calvin; Ang, Sam

Group 6 (xena16-18) - Burnett, Jesse; Newton, Michael; Reyes, Miguel

Group 7 (xena19-21) - Fordin, Sarah; Croxton, John; Viltoft, Jorgen

Data Set:

For these problems you will be working with the CDCs Behavioral Risk Factor Surveillance System dataset from 2016 available at https://www.cdc.gov/brfss/. I have put the files you need in the /data/BigData/brfss/ directory. I also copied the column layout into a file called Columns.txt in that directory. That file was just cut and paste from the link below, but it is needed for parsing the main data file, which is not in a standard format.

2016 Data - https://www.cdc.gov/brfss/annual_data/annual_2016.html

Column Layout - https://www.cdc.gov/brfss/annual_data/2016/LLCP_VarLayout_16_OneColumn.html

Full description of survey data fields - https://www.cdc.gov/brfss/annual_data/2016/pdf/codebook16_llcp.pdf

In Class Questions:

1. This data set is challenging to read in, so all I'm going to ask you for is to tell me the meaning of the following fields and give me the statistics you get from "describe" for each.

a. GENHLTH

b. PHYSHLTH

c. MENTHLTH

d. POORHLTH

e. EXERANY2

f. SLEPTIM1

Before you leave class, one member of your group needs to send me an email with your group answers to these questions and the code you wrote to solve them. Make sure the email also includes the names of all the group members who were present to work on this.

Between Class Questions:

All the code that you write to answer these questions should be put in a package called sparkml in the in-class repository. You should also make a file called sparkml-regression.md in the top level of your repository that includes a write-up with your answers to the questions and any requested plots.

1. The fields GENHLTH, PHYSHLTH, MENTHLTH, and POORHLTH give the respondents self-assessment of their health in different ways. Using regression analysis, I want you to try to predict the values of these using other values in the data set. What other survey questions are most significant in predicting each of these four values?

!!! For next year, expand on this. Consider making it competitive. Best fit wins.