Sampling and bias

The total length of the videos in this section is approximately 69 minutes, but you will also spend time answering short questions while completing this section.


You can also view all the videos in this section at the YouTube playlist linked here.

Sampling and bias - let's get started

SamplingAndBias.1.Introduction.mp4

Question 1: I am interested in the proportion of current Wellesley students who have ever taken a statistics course. What would I need to calculate the estimand?

Show answer

A list of all current Wellesley students, along with whether they have taken a statistics course. The first option allows us to calculate the estimand, the quantity of interest, rather than estimate it. The third option is one estimate of this estimand. The second option is a set of information we could use to generate an estimate of this estimand.

Question 2: Suppose that you are testing a new drug meant to relieve symptoms of menopause. You want to know whether people who take the drug experience headaches.

Which of the following best describes the target population?

Show answer

All women in menopause. Depending on who you are working for, we should take this a step further: your target population is likely the set of women in menopause to whom you plan to market the drug. For example, maybe you work for an American company that aims only at customers in North America. So, you would only need North American women in menopause in order to calculate the estimand.

The fourth option leads to an important issue: the units (people) available for study are often not representative of the units in the target population.

It turns out that the U.S. Census is controversial, statistically. Pause for a moment and think about why: if we set out to find every person in the target population, why might we end up with a set of people who are not representative of the target population?

One 2009 article in the Wall Street Journal, written in advance of the 2010 Census, said that an attempt to count every person would tend to miss people in "traditional democratic area." (The article is here, but there is a paywall that may or may not be surpassable via a library log-in.)

Question 3: Why would an attempt to count every person in the US tend to miss people "traditionally Democratic areas"?

Show answer

Areas that tend to vote for Democrats also tend to have higher populations of hard-to-count people, because of poverty, homelessness, immigrant status, English-speaking skills, etc. This is only one of many legal areas where the letter of the law (that everyone must be counted) contradicts the statistical technique needed to follow the spirit of the law (learning about the US population).

Sampling

SamplingAndBias.2.Sampling.mp4

Question 4: Suppose that we are interested in a very small target population - maybe we are interested in the final exam scores of students in a first year seminar at Wellesley. There are 6 students in the target population, and their final exam scores were 70%, 80%, 80%, 90%, 90%, 100%. The estimand is the mean final exam score - because we have the final exam scores for all 6 six students in the target population, we can see that the mean final exam score is 85%.

Suppose that we don't have time to gather data from all 6 students, so we choose a simple random sample of 2 students from the 6 in the class. We will use the mean of these 2 students' exam scores to estimate the estimand. Which of the following are possible values of the estimate?

Show answer

All are possible except for 100%. The point is that each possible sample leads to an estimate. There are multiple possible samples, and therefore multiple possible estimates. This is why the process of sampling leads to uncertainty in our estimate: we know we could have obtained a different estimate of the estimand had we selected a different sample.

We start with sampling in this course because sampling is the foundation of traditional statistical techniques (like t-tests, p-values, ANOVA/regression...). However, the buzz-phrase in the data analysis world these days is "big data." When handling giant, computer-generated data sets, sometimes we really do have a census, and we can calculate rather than estimate the estimand. As discussed in the video, maybe we're working for a big online retailer, and we have access to a database describing all customer transactions in the last month. We don't have to estimate the average amount of money paid in these transactions: we can just calculate the average. In contexts like these, techniques based on sampling are not appropriate (people use them anyway, though).

The biggest problem with big data analysis, in my opinion, is that it is hard for people trained in a certain field to gain the skills they need to be effective. Some statisticians don't know enough about computing to handle big data efficiently, and computer scientists are sometimes not aware of important statistics fundamentals. As the field of data science evolves, we seek to do better!

Sample population

SamplingAndBias.3.SamplePop.mp4

Question 5: Who might be in the sample population, but not the target population?

Show answer

If I set up a table in front of the Science Center, some of the people walking by might be from other colleges, or they might be non-students. If I circulate a survey link, some of the people who receive it might not be in the target population, either. Etc.

Sample and respondents

SamplingAndBias.4.SampleAndRespondents.mp4

Question 6: Which step do we have the most control over?

Show answer

We have the most control over choosing the sample from the sample population. Our choice of sample population is usually constraint by practical concerns (where will we get a list of possible participants?), and we can't tell people whether to respond. But we can choose whom we try to include.

For a particular study, there can be more than one way to define the sample and the respondents. Consider the examples below.

Example 1:

Kily is interested in surveying Wellesley students. She puts all Wellesley students' names in a hat and draws 50. When she tries to contact those 50, only 30 respond.

Question 7: How many people are in the sample?

Show answer

50. Here, the definitions are clear. The target population is all Wellesley students. The sample population is all Wellesley students. The sample consists of the 50 students whose names were drawn. The respondents are the 30 who respond.

Example 2:

Kily is interested in surveying Wellesley students. She posts a link to her survey on social media. Forty people respond.

Question 8: How many people are in her sample?

Show answer

Either could be correct. The target population is all Wellesley students. The sample population consists of anyone who can see her facebook link (not a subset of the target population!). We could say that her sample is the same as the sample population, and there are 40 respondents; or, we could say that her sample consists of the 40 people (not selected randomly!), and all responded.

Sampling methods

SamplingAndBias.5.SamplingMethods.mp4

Question 9: Suppose that a lawyer wants to ask for feedback from clients who come in with a certain type of legal case. Which sampling methods might be appropriate?

Show answer

The clients arrive one at a time rather than all at once. So, a sequential sampling scheme like systematic or Bernoulli sampling might work well. If the lawyer decided to contact past clients, then she would have more options, such simple random sampling from the list of past clients whose cases meet the target qualifications.

Bias

SamplingAndBias.6.Bias.mp4

Question 10: Suppose that I approach Wellesley students in the Science Center and ask them about their drug use. Which type of bias should I be most concerned about?

Show answer

I was thinking about non-response bias here. Non-response bias occurs when the people (or units) who don't respond may be different from those who do respond. I'd expect that those who refuse to answer questions about drug use may have different drug use patterns from those who are happy to describe their drug use.

Note that there is also selection bias, because the students I approach in the Science Center are not representative of any particular target population. Suppose that my target population is all Wellesley students. Then, the students in the Science Center are probably mostly science majors. The ones I approach might be the ones who look like they might have a moment to talk. These less-frazzled science majors are not representative of all Wellesley students.

The 2020 US Census has been controversial for a different statistical reason than the usual arguments about sampling v. census. The Trump administration proposed that the Census include a question about citizenship, even though the Census is supposed to count non-citizens as well as citizens, as congressional seats are determined based on total populations in each geographic region. The Supreme Court blocked this proposal after the Census Bureau argued that it would not be possible to gather representative data if this question were included. Earlier, we were worried about the fact that it would be hard to collect data on every undocumented immigrant. But imagine the non-response bias if the survey included a citizenship question! One example of a news article on this topic is here.

Comparing two groups, and summary table

SamplingAndBias.7.Assigning Units to Groups.mp4

Question 11: Suppose that I'm interested in comparing the effectiveness of two over-the-counter allergy medications, sudafed and zyrtec, for the patients who see a particular doctor. I obtain a doctor's records and look whether each of her patients took sudafed or zyrtec. Then, I look at the patients' reported symptoms after a few months on their drug. I observe a difference and conclude that there is a causal effect of sudafed v. zyrtec on symptoms for this doctor's patients. What's the problem?

Show answer

The treatment groups are not representative of each other. I drew a conclusion about this doctor's patients only, and that's who I studied, so my target population is the same as my sample. However, it's not fair to draw a causal conclusion, because the people who take sudafed instead of zyrtec (and vice versa) are likely doing that for a reason. Perhaps the doctor suggests sudafed to a certain type of patient, for example. Then, we don't know whether the differences in symptoms are due to the drug or the baseline characteristics of the patients.

Question 12: I want to study the effect of Wellesley students' majors on their income after college. I gather data from the college about all students who graduated in a particular class year. Will I be able to draw conclusions about the causal effect of major on income?

Show answer

No. The students who choose English are likely different in terms of background characteristics from the students who choose physics. I could draw a causal inference if I assigned major randomly or if major was uncorrelated with any personal characteristics that might relate to future income. Some of you asked if we could ever learn about causation from this data set. You could try, but you would have to find students of different majors who are exactly alike up until the point of choosing a major, and this will be difficult if not impossible. If this were my research problem, I might try to identify a subset of students whose majors ended up being essentially random - for example, maybe a key course was not offered in a particular year, so half of the students who wanted to be astrophysics majors had to be physics majors instead.

Now you are done!

During this tutorial you learned:


Terms and concepts:

Estimate, estimand, unit, respondents, sample, sample population, target population, causation, generalizability, census, simple random sampling (SRS), stratified random sampling, cluster sampling, systematic/sequential sampling, Bernoulli sampling, non-response bias, selection bias, hypothesis testing, test statistic, null hypothesis, p-value