1: Old Faithful
Below is a data set for Old Faithful eruptions. The first data column is the duration of the eruption (in seconds). The second column is the Interval of Time before the next eruption (in minutes). We are going to use duration as the explanatory and use interval as the response variable.
Get a correlation between our data, find a line of best fit, and plot your data (and line).
Make a residual plot. Comment on any possible issues that are arising due to this residual plot and what it might tell you of this data.
Does a linear relationship make sense for this data? Explain any possible problems with using a linear relationship with this data.
If an eruption were to last for 256 seconds, what would the expected Interval time be?
If an Interval was 88 minutes, what would we expect the eruption time to be?
One of the points given was (207, 78). What is the residual of this point?
2: Anscombe's Quartet:
This data set is a set made up by a statistician. In the data set (found below), there are four groups of data. Please follow the order of this data set as it should lead you to an interesting conclusion. There are four data sets. xA and yA go together, xB and yB go together, xC and yC go together, and xD and yD go together.
Find the correlation for each of the four data sets.
Find the line of best fit for each of the four data sets. make sure to write it in the new form (y=b0+b1*x)
Create a graph for each of the data sets, making sure to include the line of best fit for each. These four graphs should be on the same page and should all be viewable at the same time.
Create a residual plot for each of the four data sets. These four graphs should be on the same page and should all be viewable at the same time.
Compare and contrast how the line of best fit interacts with the data.
Which data set does the line make the most sense for?
For each data set that a line of best fit doesn't make sense, explain why.
interested in one graph for all four? try this (in order for this to work, name your data table aq):
par(mfrow=c(2,2))
plot(aq$xA,aq$yA,main="Quartet A",abline(lm(aq$yA~aq$xA)),xlim=c(0,20),ylim=c(0,15),xlab="xA",ylab="yA")
plot(aq$xB,aq$yB,main="Quartet B",abline(lm(aq$yB~aq$xB)),xlim=c(0,20),ylim=c(0,15),xlab="xB",ylab="yB")
plot(aq$xC,aq$yC,main="Quartet C",abline(lm(aq$yC~aq$xC)),xlim=c(0,20),ylim=c(0,15),xlab="xC",ylab="yC")
plot(aq$xD,aq$yD,main="Quartet D",abline(lm(aq$yD~aq$xD)),xlim=c(0,20),ylim=c(0,15),xlab="xD",ylab="yD")
in order to get it back to plotting one, enter the code:
par(mfrow=c(1,1))
3: Detroit
Below is a data set dealing with statistics from Detroit. The explanation of the different pieces can be found here.
load the csv file into r. Save the data as detroit
use: plot(detroit) . Explain what r gave you in this graph. If you ae unsure, call me over and we will look at it together.
Find two variables you think would be described well with a line of best fit. find the correlation between your two variables, find the line of best fit, and then create a plot that includes the points, correct axes labels, and the line of best fit.
Explain what relationship you found and what it means in real life.
4: A friend of yours has been doing some research in the area of photo synthesis. They have collected the information and placed it into a .csv for you. They explain the four different variables as such:
Irradiance: the amount of light that was shining on the plant leaf.
C02 Concentration: how much C02 was in the air around the plant when the data was taken
Leaf Resistance: the resistance the leaf has to gases (how resistant the holes are that let air and water and gasses in and out)
Photosynthesis Rate: The rate at which the plant is currently photosynthesizing.
They ask you to help them answer the following question: "Which of the three factors (Irradiance, CO2 Concentration and Leaf Resistance) seem to have the strongest correlation with the photosynthesis rate?"
Using your knowledge of statistics, help them answer that question. I'm leaving it up to you to show what you need to, make the proper graphs and equations. But give me enough so that I will be convinced.
Extensions:
pets!
Given in the .csv below are how old pets are in human years (for example, after 4 years it is like a dog is 34). Your job is to make a graphic that has both cat years and dog years on the same graphic. Be sure to include lines of best fit for both. Make it a nice looking graph.
Let's take another look at the Old Faithful Data set. As we saw in the regression plot, it really breaks down into two different groups--those with short duration and interval and those with longer of both.
Break our data into two different sets. Call them short_faith and long_faith . Explain how you found the two different sets and where you chose to break them up (there can be different results based upon how you think about this problem)
Run a brief summary on each of these two different sets and get the following: correlation, line of best fit, regression plot and plot of the data. Comment on what appears to be happening.
Create a graph that has all the data points of both sets in different colors and both lines of best fit. Comment on the graph and whether or not you feel it is useful.