3. Sampling

Learning objectives (and summaries)

Identify a population of interest and develop a plan to collect data from a sample that yields varied, but unbiased, statistics.

    • Distinguish between a census and a sample

    • A census involves asking everyone which gives complete information. It is always better than any sample unless the population is too big or hard to get to.

    • Sampling makes it possible to predict things about an entire population when you only collect information from a few individuals.

    • Define a population of interest and use a sampling method that will produce a representative sample

    • For example: if you want to study high school boys in Minnesota (the population of interest), you could do a random sample of a few boys in Minnesota high schools (the sample).

    • Understand that a statistic, based on a sample, is used to estimate a parameter, based on the population. Memorize the meanings of common symbols such as μ, σ, x-bar, sx, p, and p-hat.

    • μ is the population mean (a parameter), x-bar is the sample mean (a statistic).

    • σ is the population standard deviation (a parameter), sx is the sample standard deviation (a statistic).

    • p is the population proportion (a parameter), p-hat is the sample proportion (a statistic).

    • (see link at the bottom of the page for more practice with these!!)

    • Understand that sampling is inexact but predictable in the long run

    • We EXPECT variation between samples, but on average our samples should find the true parameter. Larger samples have less variation than small samples in their ability to estimate the parameter.

    • Sample with a SRS (simple random sample)

    • Every individual has an equal chance of being picked. For example, you could pick names from a hat or randomly generate numbers.

    • Sample systematically and know when to do so

    • Choose a random starting point and survey every nth individual. For example, you could randomly select the 8th person to start and then ask every 15th person after them (8, 23, 38, ...). Best for transient populations (walking by you and you don't know what the eventual size will be).

    • Sample with a stratified random sample and know when it is appropriate

    • First break the population down into categories called “strata” (such as different grades or male/female). Then do an SRS within each category. If one category has a smaller population, they should have proportionately fewer people sampled. For example, if a school had 1000 students with 180 freshmen, a sample of 100 students should include 18 freshmen.

    • Use when groups are very different from each other, but within each group individuals are similar. Variation decreases by replacing an SRS with a stratified sample if you use proportional representation and stratify by a variable that makes members of the population different in a relevant way..

    • Sample with a cluster sample and know when it is appropriate

    • Select an entire group at a time, such as one or two entire herds of elk.

    • Use when groups are very similar to each other, but within each group individuals are very different such that 1 or 2 groups produces a mixed sample. Be careful -- cluster samples tend to be biased because each cluster usually has similar types of people. For example, if choosing one section from a stadium, a sample of people in the luxury box probably holds different opinions than a sample of people in the cheapest bleachers.

    • Understand why voluntary samples and convenience samples do not produce samples representative of the population

    • Convenience samples are biased because the surveyer often chooses people that are easy to access, which is usually not a representation of everyone.

    • Voluntary samples are biased because the people who choose to volunteer often care more about the topic or behave differently than the typical person. This is why TV, internet, and radio call-in polls are typically worthless as a measure of the population.

    • Understand/recognize bias through undercoverage, non-response, and lying.

    • Undercoverage is when individuals are considered part of the population but do not have a chance to be sampled. For example, if you randomly sampled elderly residents and contacted them via email, the residents who do not have an email account would have no chance of being selected. The reason this is a form of bias is that the people who can be selected and the people who cannot often have a difference in opinion but only one side (the chosen ones) get to speak.

    • Non-response is when individuals are selected to participate in a survey but choose not to. This could lead to bias if the people who are more likely to respond think differently than those who sit out. If non-response gets high enough, it starts to look like a voluntary survey (bad). However, don’t over-react to non-response: the Pew Research group has found that non-response rates as high as 70% don’t have a noticeable effect on their surveys. As a rule of thumb, don’t worry about non-response if at least half of your sample responds.

    • Lying/forgetting is an obvious form of bias -- if a survey question is overly personal / embarrassing or the person asking is somehow intimidating the person answering the survey, you might run into untruthful data.

Assessment (22 core points)

    • Survey activity (team grade, 8 pts, rubric below with project description)

    • Test (14pts): 11 questions (9 MC, 2 numeric); 1 of these free response questions (3pts):

      • When collecting data, there are many places to make errors in predicting population parameters. In plain English (no stats vocab like undercoverage), clearly describe at least four distinct types of error/bias.

      • An SRS is the gold standard of sampling. Yet, we show over and over that sample statistics are not equal to the population parameters. Why aren't statisticians bothered by this?

      • Imagine that a research team called 2000 students to conduct a survey. However, they were concerned about bias due to the high non-response rate. To address this, they called another 2000 students to raise the size the sample above its current value. Decide if this will help the problem of non-response bias and explain why or why not.

Instruction

Printable guided notes: version 1, version 2, PDF

Extra review on keeping your symbols straight.

Vocabulary

census- a survey given to all of a population

cluster sampling- small even, and evenly mixed groups from a population that is picked by SRS and those groups will serve as the sample

non-response - the individuals that do not respond to a survey

parameter-the number part of the stats of a population, such as mean or median

population- who or what is being studied

sample- a small portion from a large population

simple random sample (SRS) - a random sample, but gives everyone an equal opportunity to be picked

strata- a “layer” of a population, can be divided because of different characteristics. Layer means that there is a group of people with the same type of characteristics for the survey, and each layer is different (Ex: grade)

stratified sample- a sample not from the populations itself but from certain strata of the population. You need to do an SRS from each strata (ex: a proportion from each grade)

systematic sample- first you estimate the population size, decide how many people you want to sample and then divide the two numbers to decide which every nth person you sample

undercoverage- no chance for the person to be surveyed, for example the person was gone the day of the survey

Survey activity

In teams, you need to design and conduct a survey. You must choose a target population that you want to study, such as Byron coffee drinkers, BHS 9th graders, etc. Then you need to decide how to take a random sample using one of the techniques above to yield at least 20 people in your sample. You must ask your sample at least two questions -- one with a numerical response (such as "how many cups of coffee have you purchased this week?"), and one with a yes/no/true/false response (such as "are you a Packer fan?"). You may add more questions if you're interested.

Once you collect your results for the numerical question, find the mean and standard deviation, using the proper symbols for a sample. Describe your results in a sentence. Repeat for the yes/no question by finding the sample proportion and describing it in context.

    • Rubric on survey activity (team grade, 8 pts):

      • 1 pt: selected a target population

      • 4 pts: chose a survey method to get a representative sample of the population AND executed this method

      • 3 pts: write a sentence that summarizes the results of each question on your survey. Each sentence should include the population you studied, the question you asked, and the result (either mean+standard deviation or proportion) with correct symbols for a sample. For example: We estimate that Byron homeowners have an average of x-bar=1.1 acres of land with a standard deviation of s = 0.8 acres.

Practice

Do all people have an equal, random chance of being selected? Answer yes or no:

    • 1. You walk through a large crowd and interview the first person who makes eye contact with you.

    • 2. You write 100 names on slips of paper, put them in a hat, and draw out 4 of them.

    • 3. You number people 1-24 and then roll 4 dice and add them together to see which person to choose.

    • 4. You number people 1-35 and use a random number generator to create a number between 1 and 35.

    • 5. You break the population into 3 strata: in K-12 school, in tech post-secondary school, and not in school. You randomly select 40 people from each category.

You want to study the favorite juices of Olmsted County residents. To do this, you go through the phone book and randomly select 70 people to call. Of those you call, 32 answer and answer your questions.

    • 6. What is the population? What symbols represent the mean and standard deviation?

    • 7. Is there any undercoverage? If yes, who?

    • 8. Is there any non-response? If yes, what is the non-response rate?

    • 9. What is the sample? What symbols represent the mean and standard deviation?

Bias is everywhere. Find it below.

    • 10. Manny wants to study the buying patterns of Walmart shoppers. To do so, he sets up a station from 1-4pm in the front of a Walmart location and interviews every 30th person entering the store to ask them what they intend to buy.

    • 11. A non-profit organization wants to get a picture of charity donation habits of the state. They use random digit dialing to computer-generate random phone numbers in the state’s area codes (this is a method used by many pollsters to avoid the unlisted number/cell phone problem). 43% of the 300 people called answered and responded to the question. The question asked people how much they gave to charity during the last tax year and how many different organizations they donated at least $40 to during that time.

    • 12. Fox news created an online poll to sample what proportion of Americans support ongoing military intervention in Afghanistan. 10,546 people answered the survey before it closed.

    • 13. Many fast food restaurants want feedback from their customers, so they offer a free cookie/Whopper/etc. on the back of their receipts if you call in and take a short phone survey.

A school athletic director wants to know how student athletes feel about the programs the school offers and the coaching it provides. He wants to be sure to hear balanced perspectives from all of the fall teams. Imagine there are 67 boys football players, 26 girls volleyball players, 22 cross country runners, 19 girls soccer players, and 17 boys soccer players.

    • 14. What type of sampling ensures that each group is appropriately represented?

    • 15. To produce a sample of roughly 20 students, how many should be sampled from each group?

    • 16. What is the stats name for a group like this?

    • 17. How could you decide specifically who to sample in each group?

A school of about 600 students wants to systematically sample 12 students as they enter the building.

    • 18. Every __th person gets sampled (fill in the blank)

    • 19. When sampling like this, you need to generate a random number. Why?

    • 20. Let’s say you generated the number 34. List which 12 people you will need ask to be in your sample as they come through the door.

Know your sampling technique: list the method used in each scenario.

    • 21. The radio station reads off a number for you to call in and give your opinion.

    • 22. The lottery machine generates 6 numbers between 1-30 to determine the winner.

    • 23. A national firm breaks the country intro groups by race and gender and choose a few people from each group.

    • 24. Every 10th student is selected when they exit the mall.

    • 25. Names are drawn from a hat.

    • 26. You give every person a number and generate a random integer on your calculator.

    • 27. You break the county into groups by their mail carrier and make everyone in each of 3 randomly selected groups your sample.

Practice solutions

    1. No -- some people don't make eye contact very often, and you may not have picked a truly random place to enter the crowd.

    2. Yes (assuming all the pieces of paper are the same size, of course)

    3. No -- person 1, 2, and 3 will never be chosen (1 + 1 + 1 + 1 gives a lowest number of 4) and some of the middle values are much more likely than the end values.

    4. Yes

    5. No, because there are not equal numbers of people in each category/strata, so people in a less common strata would have a higher chance of being selected.

    6. The people of Olmsted Country, mu, sigma

    7. Yes -- anyone not in the phone book

    8. Yes -- 38 people did NOT answer, so 38/70 = 54% non-response

    9. The 32 people who answered the phone (NOT all of the people that were called!), mean: x-bar, standard deviation: s

    10. People may not buy what they intended to

    11. People may lie about what they want to buy

    12. You may get a biased group of people -- only those who are not working / have unusual hours so that they could shop 1-4pm

    13. Many people may refuse to participate in the face to face study

    14. Some people (like me!) have out-of-state area codes but live here

    15. Charity giving is seen as good by society, so some people may lie about their giving

    16. People might forget about how much they gave. They may tend to think they gave more than they did.

    17. Many people never responded to the survey -- about half of the people called

    18. The survey was voluntary and only announced to people who watch Fox News. There is a good chance that the group of people who both knew about the survey and took the time to go online and vote have political views that don't represent the entire country.

    19. It is a voluntary survey and the most likely people to take it are between 16-25 (people who buy their own fast food but can't afford / would rather not pay for that extra cookie). Pro tip: if you want to just get the free cookie and answer the fewest number of questions possible, just press "5" for every question (out of 5) so it doesn't ask follow-up questions. Kids that want free cookies don't care much about giving quality feedback to the Subway corporate office.

    20. Stratified random sample

    21. 151 total players

    22. Football: (67/151) * 20 = 8.87 = 9

    23. Volleyball: (26/151) * 20 = 3.44 = 3

    24. Cross Country: (22/151) * 20 = 2.91 = 3

    25. G Soccer: (19/151) * 20 = 2.52 = 3

    26. B Soccer: (17/151) * 20 = 2.25 = 2

    27. Note that because of rounding with such a small sample, boys soccer and cross country get a little bit under-represented and the others are a bit over-represented.

    28. strata!

    29. Do an SRS of each team: number from 1 to ___ and randomly generate numbers

    30. 600/12 = 50. Every 50th person.

    31. It's your random starting point -- with systematic samples, you can't just start at person #1 because that would be biased towards early people. Use a random number generator to generate your first number, start with that person, and then go every tenth person from there.

    32. Since there are ~600 students and you want a sample of 12, do 600/12 = 50, so skip to every 50th person

    33. Select person #34, 84, 134, 184, 234, 284, 334, 384, 434, 484, 534, and 584 as they come through the door.

    34. Voluntary

    35. SRS -- it will randomly generate a number

    36. Stratified random sample -- because you sample a few people from EVERY group

    37. Systematic sample

    38. SRS

    39. SRS

    40. Cluster -- because you first pick the groups, then ask everyone in those groups

Other Practice Problems and Solutions

Do all people have an equal, random chance of being selected? Answer yes or no:

    1. You walk through the middle of Times Square and interview the first person that makes eye contact with you.

    2. You write 50 names on slips of paper, put them in a box, and draw 5 of them.

    3. You number 18 people 1-18 and then roll 3 dice and add them together to see with person to choose.

    4. You number 1-10 and use a random number generator to create a number between 1 and 10.

    5. You split the population into 3 categories: in K-12 school, in college, and not in school. You randomly select 40 people from each category.

You want to study the favorite fruit of Olmsted County residents. To do this, you go through the phone book and randomly select 50 people to call. Of those you call, 26 answer.

6. What is the population?

7. Is there any undercoverage? If yes, who?

8. Is there an non-response? If yes, what is the rate?

9. What is the sample?

Find the bias in these articles:

10. A student wants to see if Justin Bieber or Miley Cyrus is more popular. They go to a Justin Bieber concert and survey every 20th person that walks through the door.

11. Sports Illustrated created an online survey to sample what percentage of Americans liked watching football rather than baseball. 15,324 people answered the survey before it closed.

12. Many clothing stores want feedback from their customers, so they will give you a free hat or key chain if you complete a survey on the back of their receipts.

13. William Benjamin Borkenhagen III wants to find out the purchasing habits of people in his community. He uses random digit dialing to computer generate phone numbers in his community (he has a community phone book).

A representative at the School of Circus Acrobats (SCA) wants to know how students attending SCA feel about the program, and the teachers it provides. He wants to be sure to hear balanced perspectives from all of the different groups of students. Imagine there are 201 senior students, 78 junior students, 66 sophomores students, 57 freshman students, and 51 high school students.

14. What type of sampling ensures that each group is represented?

15. To produce a sample of roughly 60 students, how many should be sampled from each group?

16. What is the stats name for a group like this?

17. How could you decide specifically who to sample in each group?

A team of about 100 players wants to systematically sample 10 players as they enter the gym.

18. Every th player gets sampled (fill in the blank)

19. When sampling like this, you need to generate a random number. Why?

20. Lets say you generated number 2. List which 5 players you will need to ask to be in your sample.

Know your sampling technique: list the method used in each scenario.

21. The radio station reads off a number for you to call in and win concert tickets.

22. The powerball machine generates 1 powerball between 1-35.

23. A football team breaks the team into groups by race and weight and choose a few from each group.

24. Every 20th student is selected when they exit the mall.

25. Names are drawn from a jar.

26. You give every other person a number and generate a random integer on your calculator.

27. You break the city into groups by their mail carrier and make everyone in each of the 3 randomly selected groups your sample.

Extra Practice Answers

    1. No, some people make eye contact more often than others.

    2. Yes

    3. No, person 1 and 2 can not be chosen.

    4. Yes

    5. No, because there is not an equal number of people in each category.

    6. The residents of Olmsted County, mu, sigma

    7. Yes, anyone who is not in the phone book

    8. Yes, 24 out of 50 did not answer, so 48% non-response rate.

    9. The 26 people who answered the phone. 10. People at a Justin Bieber concert will most likely be biased towards him.

11. Not all Americans go on SI.com

Some people can’t go online

You could get a biased group of people

12. People may come just to get a free hat or key chain

People may not care about the feedback.

13. Some people may have out-of-state phones

Some people may not have their number in the phone books

14. Stratified Random Sample

15. 453 total students

Seniors: (201/453) * 60 = 26.62 = 27

Juniors: (78/453) * 60 = 10.32 = 9

Sophomores: (66/453) * 60 = 8.73 = 9

Freshman: (57/453) * 60 = 7.56 = 9

High School: (51/453) * 60 = 6.75 = 6

16. Strata

17. Do an SRS of each class: number 1-? and randomly generate numbers

18. 100/10 = 10. Every 10th person

19. It's your random starting point. You can's just start with the first person because you would be biased towards early people.

20. Since there are 100 players and you want a sample of 10, do 100/10 =10, so skip to every 10th person

21. Voluntary

22. SRS

23. Stratified random sample

24. Systematic sample

25. SRS

26. SRS

27. Cluster

Notes