MLLib and Regression Problems

Data Set:

For these problems, you will be working with the CDCs Behavioral Risk Factor Surveillance System dataset from 2016 available at https://www.cdc.gov/brfss/. I have put the files you need in the /data/BigData/brfss/ directory. I also copied the column layout into a file called Columns.txt in that directory. That file was just cut and paste from the link below, but it is needed for parsing the main data file, which is not in a standard format.

2016 Data - https://www.cdc.gov/brfss/annual_data/annual_2016.html

Column Layout - https://www.cdc.gov/brfss/annual_data/2016/LLCP_VarLayout_16_OneColumn.html

Full description of survey data fields - https://www.cdc.gov/brfss/annual_data/2016/pdf/codebook16_llcp.pdf

All the code that you write to answer these questions should be put in a package called sparkml in the assignment repository. You should also make a file called sparkml-regression.md in the top level of your repository that includes a write-up with your answers to the questions and any requested plots.

Questions:

1. This dataset is challenging to read in, so all I'm going to ask you for is to tell me the meaning of the following fields and give me the statistics you get from "describe" for each.

a. GENHLTH

b. PHYSHLTH

c. MENTHLTH

d. POORHLTH

e. EXERANY2

f. SLEPTIM1

2. The fields GENHLTH, PHYSHLTH, MENTHLTH, and POORHLTH give the respondents self-assessment of their health in different ways. Using regression analysis, I want you to try to predict the values of these using other values in the data set. What other survey questions are most significant in predicting each of these four values? Does this result make sense?

3. How do things change if you can include each of the original four values in trying to predict the other ones?

!!! Next year find some better data for regression.