4. Two Sample Comparison

Learning objectives (and summaries)

Numerically analyze experiments and observational studies with a quantitative response variable.

Distinguish between quantitative 2-sample and quantitative matched pairs data and choose the appropriate interval / test procedures.
- If the study uses the same individuals more than once or specially-related individuals, then it is matched pairs. Use the list of differences to do an interval or test of a single quantitative variable
- If not (if it is a random or block design), then use a 1-quantitative, 1-categorical variable test/interval.
Find and interpret a confidence interval for a matched-pairs study.
- Decide which order to subtract. Subtract all pairs. Use the resulting list of differences in a confidence interval of a single mean.
- Interpret: I am _(percent)_% confident that _(each pair of individuals)_ _(have/succeed)_ an average of _(low end of interval)_ to _(high end of interval)_ more _(response variable units)_ with _(better treatment)_ than _(worse treatment)_..
- For example (single individual in both treatments): I am 90% confident that each local girls basketball player makes an average of 5.1 to 7.4 more free throws with their right-hand than their left hand (out of 25).
- For example (separate paired individuals): I am 99% confident that husbands make an average of 0.8 to 4.4 more free throws than their wives (out of 25).
- OR plus-minus form: I am 99% confident that husbands make an average of 2.6 more free throws than their wives (out of 25) with a margin of error of ±1.8 free throws.
Perform and interpret a significance test for a matched-pairs study.
- Decide which order to subtract. Subtract all pairs. Use the resulting list of differences in a test.
- The null is always that the average of the list of differences = 0. Write as μ_{groupA-groupB} = 0. This makes it easy to know what μ stands for. Note that with matched pairs, there is only one μ.
Find and interpret a confidence interval for a difference of two distinct sample means using StatKey.
- Enter data as a series of (Treatment, Value) pairs. Generate bootstrap samples and find the middle __%.
- Interpret: I am _(percent)_% confident that _(bigger group)_ _(has/succeeds)_ an average of _(low end of interval)_ to _(high end of interval)_ _(response variable units)_ more than _(smaller group)_..
- For example: I am 95% confident that dogs weigh an average of 13.3 to 22.5 pounds more than cats.
- OR plus-minus form: I am 95% confident that dogs weigh an average of 17.9 pounds more than cats with a margin or error of ±4.6 pounds.
Perform and interpret a significance test for a difference of two distinct sample means using StatKey.
- Null hypothesis is always that the average of group A = the average of group B. Write as μ_groupA = μ_groupB. Since there are two separate lists, there are two μ's.
- Enter data as a series of (Treatment, Value) pairs. Generate randomized independent samples. Find the probability of your data (or higher/lower/more extreme) given the null hypothesis of independence. Use this p-value to make conclusions on dependence.
Find and interpret a confidence interval for a difference of two proportions using StatKey.
- Enter data as a series of (Treatment, Result) pairs. Generate bootstrap samples and find the middle __%.
- Interpret: I am _(percent)_% confident that _(bigger group)_ _(has/succeeds)_ _(low end of interval)_% to _(high end of interval)_% more _(response variable units)_ than _(smaller group)_..
- For example: I am 95% confident that NFC teams win 4.5% to 9.2% more games than AFC teams.
- OR plus-minus form: I am 95% confident that NFC teams win 6.8% more games than AFC teams with a margin of error of ±2.3%.
Perform and interpret a significance test for a difference of two proportions using StatKey.
- Null hypothesis is always that the proportion of group A = the proportion of group B. Write as p_groupA = p_groupB. NOTE the use of "p" for proportions!
- Enter data as a series of (Treatment, Result) pairs. Generate randomized independent samples. Find the probability of your data (or higher/lower/more extreme) given the null hypothesis of independence. Use this p-value to make conclusions on dependence.
- Understand how StatKey simulates confidence intervals of one or two samples
- Match pairs: Before entering, you need to take the difference of each pair and enter the list of differences. It resamples this list with replacement to produce a bootstrap sample. It then plots the average of each of these samples and finds the range of the middle 95% (or whatever percentage you choose).
- For intervals of two means, it creates separate bootstrap samples from the original data in group A and the original data in group B, averages each of them separately, subtracts the two averages, and plots the difference. Then it finds the range of the middle __% as your interval.
- Understand how StatKey simulates tests of independence
- Since the null hypothesis is that treatment group does not affect response, It mixes up the treatment group and the responses randomly for each bootstrap sample. For each sample, it takes the average of group A and group B, subtracts, and plots the difference on the big graph distribution. After thousands of repeats, you get a probability distribution of possible values. Your p-value is the probability of finding data as high/low/extreme as the difference of your sample averages.

Assessment

- Test (12pts): 9 questions (2 MC, 6 short response, 1 written) and one of these free response questions about the following scenario (3pts):
- Explain how StatKey generates a confidence interval for matched pairs data.
  - Explain how StatKey generates a confidence interval for standard two-sample data.
  - Explain how StatKey runs a hypothesis test. Explain why this method makes sense and connects to the null and alternative hypotheses

Practice

Answer the following for each of the problems below:

a) What is the explanatory variable in this scenario. What are its options (treatments)?

b) What is measured for each individual (the response variable in this scenario)? Is it quantitative or categorical?

c) If it is quantitative, is there matched pairs or two distinct samples?

d) Based on the last two responses, what type of interval/test will you perform in StatKey?

e) What is the null hypothesis in this scenario? Use the correct symbols (μ or p) and use subscripts so it is clear what each symbol means.

f) What is the alternative hypothesis in this scenario? Re-read the problem to see if there is an intended direction.

g) What is the p-value of your test?

h) Is this an observational study or experiment? Based on this and your p-value, what can you conclude?

i) What is the estimated difference between the two groups (95% interval)?

j) Convert this interval to plus-minus form.

k) Interpret the confidence interval of the difference in a sentence. Use plus-minus form because it is often far more readable in a sentence.

1. A group of disc golfers wants to compare two different putting techniques. In one, the player throws the disc with the forehand, just like a normal Frisbee. In the other, the player uses a backhand flick to throw. To test the methods, they get 20 volunteers to throw 8 discs each from 30 feet away. Half of the volunteers are randomly assigned to one of the techniques, and the rest to the other, before the throwing starts. The group wants to prove that throwing forehand will work better on average than throwing backhand. Results (number of shots made out of 8):

Backhand: 2, 7, 3, 8, 3, 2, 1, 4, 3, 4

Forehand: 3, 3, 7, 6, 8, 4, 5, 6, 4, 8

2. After taking some criticism from their study design, the disc golfers tried a new approach: each player throws 10 discs with one technique and 10 discs with the other technique. There were 8 volunteer players. The order of which technique comes first is randomized for each player. Again, they still want to prove that throwing forehand will work better. Results:

3. While watching the study take place, a couple members of the local robotics teams decided to setup their own trial. One team had a frisbee shooter that fired using a straight track. The other had a circular frisbee shooter. They each had their robot take 40 shots. Results: the track robot made 16 and the circular robot made 23. Neither team was assumed to be better before running the test.

4. In a food testing experiment, students found 50 volunteers. They randomly blinded 25 taste-testers as they ate a piece of toast with margarine. They asked the participant whether they just ate butter on their toast. 16 out of 25 said yes. Then they gave the other randomly selected 25 taste-testers a piece of toast with butter and asked the same question. 19 out of 25 said yes.

5. A group of robot enthusiasts who were not impressed with the frisbee shooters made two t-shirt launchers. The first cannon, the larger one, has a 3" diameter barrel. The smaller one has a 2" barrel. 10 t-shirts are fired out of each one and the distances are recorded. The group is convinced that the large cannon is better.

Small cannon trials: 80, 56, 78, 61, 31, 64, 72, 66, 69, 78

Large cannon trials: 84, 97, 88, 77, 91, 83, 43, 89, 79, 67

6. A student group decided to compare how well players did using two different strategies of building a card tower. The subjects were first instructed on a specific method they needed to use for their tower and told them it was required to use this strategy. Since the group didn’t want players to mix strategies, they tested two completely separate groups of people. People volunteered to play and were randomly assigned to a strategy using a coin flip on the day of the experiment. The results:

Strategy 1 (seconds required to build the tower): 33, 42, 59, 68, 73, 91, 33, 45

Strategy 2 (seconds required to build the tower): 73, 33, 49, 62, 65, 48, 47, 66

7. Another team compared the ability to accurately kick a soccer ball into a goal with their left and right foot. Each person kicked the ball 10 times per foot to see how many they could get in the goal. The organizing team assumed that more people would succeed using their right, so they want to verify this in their test. Each person’s data is vertically stacked above each other.

Right: 8 8 4 4 9

Left: 3 2 5 3 8

Practice solutions

1. Frisbee #1:

a) Explanatory Variable: Method of throwing (categorical)

Treatments: Forehand, backhand (the options of the categorical)

b) Number of shots made (quantitative)

c) 2 distinct samples

d) Test for a difference in means. There is one categorical variable and one quantitative variable. The reason it is a difference in means is because there are two groups (the options) and each has a quantitative value associated with it. You take the difference of these means.

e) H₀: μ_f = μ_b

f) H_A: μ_f > μ_b

g) p= 0.054 (depending on your randomly generated data, it could be above or below .05)

h) Experiment; We cannot reject the null because our p-value was greater than p= .05. We can get a bigger sample size and

collect more data to try and get a better p-value.

i) -3.30 to 0.10

j) -1.60 ± 1.70 more backhand (1.60 ± 1.70 more forehand)

j) I am 95% confident that Frisbee players make an average of 1.6 ± 1.7 more shots with their forehand than their backhand.

2. Frisbee #2:

a) Explanatory Variable: Method of throwing

Treatment Groups: Forehand and backhand

b) number of shots made; quantitative

c) Matched Pairs (each person is doing both forehand and backhand)

d) Test for a single mean because once you find the difference in the two trials, it will become one number for each person instead of two.

e) H₀: μ_f-b = 0

f) H_A: μ_f-b_>0

g) p= .02

h) experiment; we can reject the null because p< 0.05. We can conclude the throwing forehand instead of backhand causes a higher amount of shots made.

i) .125 to 3.125

j) 1.625 ± 1.5 more forehand

k) I am 95% confident that each Frisbee player makes an average of 1.6 ± 1.5 more shots with their forehand than their backhand.

3: Robot Frisbee:

a) Explanatory Variable: Type of method used by the robot

Treatment Groups: Circular vs Straight Track

b) the individual is the shot, not the robot, like it would have been for #1 and #2. thus, the variable is whether the shot is made or not (categorical)

c) N/A

d) Test for a difference in proportions because we have two proportional values to work with

e) H₀: p_s = p_c

f) H_A: p_s≠ p_c

_{g) p= .19}

_{h) Not a proper experiment because each treatment "group" has one individual (no repetition) and there is no random assignment. Regardless, you cannot reject the null since p > .05}

_{i) -.375 to .05}

_j)-.1625 ± .2125 more for straight (.1625 ± .2125 more for circular)

k) I am 95% confident that the circular shooter makes 16% ± 21% more shots than the straight track shooter.

4. Food testing:

_{a) Explanatory Variable: What is on the toast}

Treatment Groups: Butter or Margarine

b) Whether or not they think they just ate butter; categorical

c) n/a

d) difference in proportions because we are working with two values that are not numerical

e)H₀: p_m = p_b

f) H_A: p_m≠ p_b

g) p= 0.52

h) Experiment, it is not statistically significant and we cannot reject the null

i) -.360 to .120

j) -.12 ± .24 (.12 ± .24 more butter)

k) I am 95% confident that testers believe they are eating butter 12% ± 24% more often when they are actually eating butter than when they are secretly given margarine.

5. T-shirt launcher:

a) Explanatory Variable: Size of the cannon,

Treatment Groups: 3" Diameter or 2" Diameter

b) Distance of the t-shirt; It is quantitative

c) 2 distinct samples

e) H₀: μ_s = μ_L

f) H_A: μ_s< μ_L

_{g) p= .021}

_{h) p-value is statistically significant and we can reject the null}

_{i) -26.60 to -1.80}

_{j) -14.2}± 12.4

_{k) I am 95% confident that}

6. Card tower:

- a) Explanatory Variable: Method Used
- Treatment Groups: Strategy 1 and Strategy 2

b) time required to build the tower; quantitative
c) 2 Distinct Samples
- d) 1 Categorical, 1 Quantitative: There are two groups, a quantitative response variable, and no matched pairs
- e) Null: the time required to build the tower with the two different strategies is the same
- H₀: μ₁ = μ₂
- f) Alt: the time required to build the tower with the two different strategies is different (because we don't know which one is supposed to be "better")
- H_A: μ₁ ≠ μ₂
- g) p = 0.989
- h) Experiment: 2 different groups (and their only difference is the strategy used), volunteers are randomly assigned to each group
- The data suggests that the 2 strategies are nearly identical. There is not a shred of evidence to support that they are different in the student population at this school.
- i) Roughly -15.5 to 16.5
- j) .5 ± 16
- k) I am 95% confident that

7. Soccer kicks:

- a) Explanatory Variable: What foot is used to kick the soccer ball
- Treatment Groups: right foot or left foot

b) Number of goals made; Quantitative
c) Matched Pairs
d) Test for a single mean
e) Null: people kick equally well with both feet
H₀: μ_{right - left} = 0 (this is ALWAYS the null for a matched pairs)
f) Alt: people kick better with their right foot
H_A: μ_{right - left} > 0
g) p= 0.027
h) Depends how it is run. If there is no randomization in which order people kick with, it will be an observational study. If it is randomized, it would be an experiment, but since people know which leg they are using, there might still be lurking variables that affect the experiment. We can conclude that kicking with the right foot causes people to make more goals.
i) .200 to 4.800
j) 2.5 ± 2.3
- k) I am 95% confident that

Notes

Build new intro video series:

1: distinguishing data types

2. calculate and interpret CI of difference

3. hypothesis statements and how they differ between data types

4. calculate p-values for differences, cause/link/no-link interpretation of result

Fix solutions to include a-j

Page updated

Report abuse