Ok, y'all. This data set is pretty straight forward. You have two days of class time to work on it as we have no class on Friday. In addition, you can work on it Friday on your own time.
This data set is due to me Friday, March 15th by midnight.
Every person in your group should be able to explain the math, the code AND the reasoning behind each question.
I may give you a quiz Monday on one of these questions or one very similar, so be able to answer the pieces.
Question One.
Attached is a data set called NAEP. There are four columns of data:
State--the state abbreviation.
NAEP_score--score on the NAEP score for fourth graders. The score is a mean, out of 500
percent_proficient--the percent of students deemed capable of solving real world problems.
free_lunch==percent of students eligible for free lunch at school
1) Investigate the correlation between NAEP_score and percent_proficient.
a) Find a line of best fit, correlation, and any other numbers that are beneficial to us for this problem.
b) Make a graph, label it properly, and include your line of best fit.
c) Write a few sentences describing the correlation between the two variables and decide if it makes sense.
2) Investigate the correlation between free_lunch and percent_proficient
a) Find a line of best fit, correlation, and any other numbers that are beneficial to us for this problem.
b) Make a graph, label it properly, and include your line of best fit.
c) There are three states towards on the graph that appear to be possible outliers in the data at the far right, all with free lunch percentages over 60%. Make a graph that shows those points as red dots. On the same graph, color any other possible influential points light blue.
3) How many states have a NAEP_score over 240? DO NOT COUNT. Use r code to find the answer.
* as a hint, once you have created a new list with all the values over 240 (say you call it something creative like over.240), you can use this function:
length(over.240$NAEP_score)
to get an answer.
Question Two: Heavy Info.
Get the list of mammals from me. It gives their weight in kg and the average lifespan.
1) Create an appropriate graph and model for this graph, showing getting the best correlation you can. Be sure to explain any translations or changes you've made to the graph and the reasons for it. AS A CAUTION--try using logarithms on your x axis as well as on your y axis.
a) What does your correlation tell you about the information?
b) Are there any points on this graph you would consider removing?
2) Answer these questions:
a) what is the residual given by the point for dogs?
b) what point has the largest residual?
c) humans have an average weight of 65 kg. What does our least squares line predict our average life span to be?
d) human average life span is actually 67.2 Based on that, how much does our least squares line predict that we weigh? If you are unfamiliar with kg, convert to pounds.
e) Take the data from the beginning, and add a point fo humans (65, 67.2). Re-create a plot and a new line of best fit. Compare these and comment on what the outlier does to the data.
f) Hypothesize why the human point might be such an outlier s compared to the values of the other mammals. there is no right answer here, but make sure you are convinced of your answer and that you can convince me.
3) From your book, page 248, question 19.