Chapter 10/11 notes

The video above is from a past semester.  The material is the same but any dates mentioned will be incorrect.   For the current term, please refer to the due dates in syllabus.

Section 10.1 Correlation

Linear Correlation- Exists between two variables when there is a correlation and the plotted points of paired data result in a pattern that can be approximated by a straight line.

Properties of the linear correlation coefficient r.
1)  -1 <= r <= 1
2)  r measures the strength of a linear relationship

The value of r^2 is the proportion of the variation in y that is explained by the linear relationship between x and y.

If |r| > critical value from table A-6, reject Ho and conclude there is a linear correlation.  Otherwise, no linear correlation.


10-1 #23 Listed below are the overhead widths of seals measured from photographs and weights of the seals. The purpose of the study was to determine if the weights of the seals could be determined from overhead photos.  Is there sufficient evidence to conclude that there is a linear correlation between overhead widths of seals from photographs and the weights of the seals? Significance level = 0.05

Since |r| (.948) > critical value (.811  from table A-6), reject Ho.  Ie there IS linear correlation between the variables.

There is sufficient evidence to support the claim of a linear correlation between the overhead width of a seal in a photo and the weight of a seal.

Section 10-2 Regression

Requirements/Assumptions
1) The paired (x,y) data is a random sample
2) The scatterplot shows a that the points approximate a straight-line pattern.
3) Outliers are removed if known to be errors.

Important: In predicting a value of y based on some given value of x we have two cases:
1) If there is NOT linear correlation, we simply use the mean of the y values (sometimes this is given and sometimes we need to calculate it.
2) If the is linear correlation, we plug the given x value into the regression equation.


10-2 #23 Using the same seal data from the problem above, find the best predicted weight of a seal if the overhead width measured from a photo is 2 cm. Can the prediction be correct? What is wrong with predicting the weight in this case?

The -76.5 can not be correct because you can not have a negative weight.  The reason for this is that we are trying to predict for a picture width of 2 cm which is well beyond the scope of the sample widths (the sample widths are between 7.2 and 9.8).  So basically, we built a model for adult seals and we're trying to plug in a baby seal into it...this generally doesn't work.  This is called extrapolation.  For reliable predictions, we need to stay within the sample data (or very close to it).

Video on using statcrunch for linear regression and prediction  

Back on the water again.  Is there a relationship between fish size and Methyl-Mercury levels?  We found out earlier in the course that Lingcod have high levels of Methyl-Mercury.  Do all Lingcod have high levels of Methyl-Mercury?  Does the size of fish have any relation to the amount of Methyl-Mercury?  Let's find out.

Is there sufficient evidence to conclude that there is a linear correlation between overall length of Lingcod and the Methyl-Mercury in Lingcod? Use a significance level of 0.05.  What is the predicted Methyl-Mercury level of a 32 inch Lingcod?

Length (in)     Weight(lbs)    Methyl-Mercury
22.4               3.6                   0.137
29.3               9.2                   0.452    Note: I have included the weight of the Lingcod as well.  I would
26                       6.8                   0.341    expect that weight would be correlated to mercury  but we 
25.3               4.9                   0.226    found that it was very difficult to get accurate weights on these fish
26.4               6.5                   0.238    in the open ocean.  Way to much movement in the seas to have a
28.5               7.5                   0.484    scale give an accurate weight. 
29                       9.1                   0.408
33.5               13.9           1.94
33.5               14.6                  0.789
33.5               14.8           0.55
34                       15.7           0.72
35                       16.6          1.49
38.5               25.6           1.59
40                       22.8           2.84
24.8               6.2                   0.193
26.5               6.6                   0.171
32.5               11.7           0.537
36                       13.8           0.655

To find the below output, we do the following: Enter in the data, click Stat, Regression, Simple Linear. Select the two variables, and click compute.  This will give a tremendous amount of data in return, we only need a bit of it (blue and red).

Simple linear regression results:

Dependent Variable: Methyl- Mercury (ppm)
Independent Variable: Total Length (inches)
Methyl- Mercury (ppm) = -2.8285835 + 0.11659546 Total Length (inches)
Sample size: 18
R (correlation coefficient) = 0.80200706
R-sq = 0.64321532
Estimate of error standard deviation: 0.45126449

Parameter estimates:

Parameter       Estimate       Std. Err.          Alternative     DF     T-Stat          P-value

Intercept          -2.8285835     0.67741192        ≠ 0               16    - 4.1755738   0.0007

Slope               0.11659546    0.021709337      ≠ 0               16      5.3707518   <0.0001

Analysis of variance table for regression model:

Source   DF       SS              MS                 F-stat       P-value

Model      1       5.8739802  5.8739802     28.844974  <0.0001

Error       16      3.2582343  0.20363964

Total       17      9.1322145


Is there sufficient evidence to conclude that there is a linear correlation between overall length of Lingcod and the Methyl-Mercury in Lingcod? Use a significance level of 0.05. 

From the above we can see that there IS linear correlation between length of Lingcod and Methyl-Mercury levels.  There are two ways to determine this (both will give the same conclusion):

1) P-value (of the slope row) (<0.0001) is less than alpha (0.05) 

2) R(correlation coefficient) (0.80200706) is more extreme than the critical value from table A-6 (.468) 

What is the predicted Methyl-Mercury level of a 32 inch Lingcod?
Since there IS linear correlation between the variables, we can use the regression equation to predict the level of Methyl-Mercury.  Just plug in 32 for the length

Methyl- Mercury (ppm) = -2.8285835 + 0.11659546 Total Length (inches)

=-2.8285835 + 0.11659546(32) =0.90 ppm of Methyl-Mercury (well above the do not consume level (0.44 ppm) for women under 45 and children.)


Chapter 11

Section 11-1 Goodness-of-Fit

Multinomial Experiment Requirements:
1) fixed number of trials
2) independence
3) outcomes of each trial must be classified into exactly one of several different categories
4) constant probabilities

Notation:
O- Observed frequency
E- Expected frequency
k- number of different categories
n- total number of trials

Assumptions:
1)  randomly selected data
2) data consists of frequency counts for each of the different categories
3) each category has an expected frequency that is at least 5

Critical values can be found in table A-4 (below), degrees of freedom=k-1 (notice that it is the number of categories minus 1, not the number of trials)

Goodness-of-Fit tests are always right tailed

11-1 #7 The author of the book drilled a hole in a die and filled it with a lead weight, then proceeded to roll it 200 times. Here are the observed frequencies for the outcomes of 1, 2, 3, 4, 5, and 6 respectively: 27, 31, 42, 40, 28, 32.  Use a 0.05 significance level to test the claim that the outcomes are not equally likely.  Does it appear that the loaded die behaves differently than a fair die?

Die #                                          1       2       3       4      5      6
Observed Frequency    27    31    42    40   28    32

We can also look at the p value (on statcrunch) to determine if we need to reject or fail to reject Ho.  If the p value is less than the significance level, reject Ho.

Video -Goodness of Fit test using Statcrunch

Section 11-2 Contingency Tables

Requirements:
1) Randomly selected data that are represented as frequency counts in a two-way table
2) Ho: Row and column variables are INDEPENDENT (one does not effect the other)
      H1: Row and column variables are DEPENDENT
3) The expected frequency E is at least 5 in each cell

degrees of freedom= (r-1)(c-1)
r = number of rows
c = number of columns

Tests are all right tailed.

Expected frequency = (row total)(column total)/(grand total)


11-2 #15 In a clinical trial of the effectiveness of Echinacea for preventing colds, the results in the table below were obtained.  Use a 0.05 significance level to test the claim that getting a cold is independent of the treatment group.  What do the results suggest about the effectiveness of Echinacea as a prevention against colds? 

Video on running a contingency table test on statcrunch

Note: above I compared the test stat to the critical value.  When using statcrunch we can also come to a conclusion using the p value.  If the p value is less than than significance value, reject Ho.