Part One: working with me!
Part Two:
Question 1:
Included below is a .csv file with information pertaining to different states, their populations, and their populations over the age of 18 (this data is approximate and from 2009. It was found when looking for potential voters in the 2010 elections).
load this data into r. save is as pop .
a) Find the general shape and spread for the columns 'popestimate2009' and 'popest18plus2009'. Comment of shape and overall trends of the 50 states.
b) Let's get a new list, one that describes the relative percentages for each state.
percent18 = (pop$popest18plus2009/pop$popestimate2009)*100
Explain what r did for us with that command. Explain what each piece of the command did.
c) THIS SECTION IS ONLY FOR PART B: Should we use 1.5*IQR to check for outliers or mean and sd? Justify your reasoning, and then proceed to do the test. Be sure to include any graphs that you make.
as an aside, you can scale the size of your graphs. Go ahead and do that--imagine you made a number summary, a boxplot and a histogram to choose your outlier test (hint: probably a good idea). The two gaphs should be able to fit side by side fairly easily. That will make it easier to see and compare.
d) Are there any significant points or outliers? Be sure that if you want to call something an outlier and remove the point you make a good argument for doing so.
e) Can you think of any reasons or explanations for the significant points/outliers you found in part d?
f) THIS CHALLENGE IS OPTIONAL. COME BACK TO THIS QUESTION AFTER THE OTHER THINGS ARE DONE. There is another column in the data labeled both 'region' and 'division'. Both of these break the country up geographically. Using these breaks, make some statement or observation about the data relating to 18 year olds or population in general.
Question 2:
Below are the survival times of 72 guinea pigs after they were injected with an infectious bacteria in a medical experiment. The times are measured in days
guinea pigs: 43, 45, 53, 56, 56, 57, 58, 66, 67, 73, 74, 79, 80, 80, 81, 81, 81, 82, 83, 83, 84, 84, 88, 89, 91, 91, 92, 92, 97, 99, 99, 100, 101, 102, 102, 102, 103, 104, 107, 108, 109, 113, 114, 118, 121, 123, 126, 128, 137, 138, 139, 144, 145, 147, 156, 162, 174, 178, 179, 184, 191, 198, 211, 214, 243, 249, 329, 380, 403, 511, 522, 598
a) get the data into r. Describe the shape of the data--what does it tell us about the infected indiviuals?
b) Decide whether to check for outliers using the IQR method or the sd method. Explain your steps and reasoning for your decisions, including any needed graphs.
c) what points have you decided are outliers that should be removed? Did you find any possible outliers you chose to keep? Give your reasoning, and include any graphs or computations that you did to arrive at your conclusion.
***NOTE: QUESTIONS ONE AND TWO ARE DUE FRIDAY***
QUESTION THREE AND THE EXTENSION ARE DUE BEFORE CLASS/AT CLASS ON MONDAY.
Question 3:
The data for this question can be found on page 49, question 1.86.
a) Search each of the data sets for possible outliers for each of the three data sets, using the methods we have discussed. Allow me to follow your steps in the write up.
b) Compare and contrast the differences in the data sets.
c) Create some type of graphic that allows us to compare all of the data sets at once.
consider the following:
boxplot(redflower, col="pink", horizontal=TRUE, at=0.8, ylim=c(33,50))
boxplot(yellowflower, col="red",horizontal=TRUE, add=TRUE, at=0.6)
that's not the only way to do it, but it gives you a few more commands to play with.
Extension (due Monday):
Ask 25 different people on campus what time they go to bed and what time they wake up on a specific day (specify Monday, Tuesday, etc. so the data is all about one evening--there are big differences usually between a Saturday and a Tuesday, so we're trying to avoid graphing that).
a. Using the information collected, create a third set of data of how long each person slept.
b. Create a histogram, 5 point summary, and boxplot for the data created in part a. If there are any influential points, identify them. If you believe them to be outliers, make a case and then remove them from your data.
c. Create ONE GRAPHIC that easily shows when each preson went to sleep, when they woke up, and how long they slept. Try to use r to do this--but it will stretch your abilities. If not possible, send me the data and work from parts a and b, and then make a graphic by hand and bring it to class on Monday.