Introduction
Why Sample?
Pool of possible cases is too large (e.g., 320 million Americans as of 2014) -- would cost too much and take too long
Don't want to use up the cases: e.g., if you are a manufacturer testing your light bulbs to see how long they last, you take some bulbs and leave them on until they burn out. You can't test all the bulbs this way, because there wouldn't be any left to sell.
It's not necessary to survey all cases: taking a sample yields estimates that are accurate enough for most purposes
The trade-off is that sampling does introduce some error. You didn't interview everybody, so certain opinions or combinations of opinions won't be represented in your data. When the population is very diverse, your sample can't include all the possible combinations of attributes that are found in the population, such blacks and whites, men and women, cardiac patients and non-patients, black women, white men, white women with heart trouble who like Oprah but don't like Ally McBeal, etc.
Populations, Sampling Frames, and Elements
Population is the universe of cases. It is the group that you ultimately want to say something about. For example, if you want to report 'what Americans think about Clinton', then the population is all Americans.
Elements are the individual cases in the population (usually, persons)
Sampling ratio is size of sample divided by size of population. Contrary to popular belief, a large sampling ratio is not crucial.
Sampling frame is a specific list of names from which sample elements will be chosen. The Literary Digest poll in 1936 used a sample of 10 million, drawn from government lists of automobile and telephone owners. Predicted Alf Landon would beat Franklin Roosevelt by a wide margin. But instead Roosevelt won by a landslide. The reason was that the sampling frame did not match the population. Only the rich owned automobiles and telephones, and they were the ones who favored Landon.
Replacement. Sampling with replacement means that after you draw a name out of the hat and record it, you put the name back and it can be chosen again. Sampling without replacement means that once you draw the name out, it is not available to be chosen again.
Bias. Systematic errors produced by your sampling procedure. For example, if you sample people and ask them whether they watch Ally McBeal, but the percentage always comes out too high (maybe because you are interviewing your friends and your whole group really likes Ally McBeal)
Non-Probability Sampling
Haphazard/Convenience
Whoever happens to walk by your office; whoever's on the street when the camera crews come out
If you have a choice, don't use this method. Often produces really wrong answers, because certain attributes tend to cluster with certain geographic and temporal variables. For example, at 8am in NYC, most of the people on the street are workers heading for their jobs. At 10am, there are many more people who don't work, and the proportion of women is much higher. At midnight, there are young people and muggers.
Quota
Haphazard sampling within categories (e.g., first 5 males to come by)
Is an improvement, but still has problems. How do you know which categories are key? How many do you get of each category?
Purposive/Judgement
Expert judgement picks useful cases for study
Good for exploratory, qualitative work, and for pre-testing a questionnaire.
Snowball
Recruiting people based on recommendation of people you have just interviewed
Useful for studying invisible/illegal populations, such as drug addicts
Probability Sampling
A simple random sample (SRS) is a sampling scheme in which the probability of choosing each individual is the same. A simple random sample is an instance of a probability sample, which is any sampling scheme in which the probability of choosing each individual is known, so it can be readjusted mathematically to be the same. Probability sampling requires more work than convenience sampling, but is much, much more accurate. Probability samples also allow the researcher to calculate the amount of error she can expect, and this is really valuable.
Simple Random
Develop a sampling frame, then randomly select elements (place all names on cards, then randomly draw cards from hat; in Excel, there is a function for attaching a random number to each cell. You can then sort and take N largest)
Typically use sampling without replacement, but with replacement can be done (and is easier mathematically)
Any one sample is likely to yield statistics (such as the average income or the percentage of respondents that watch Ally McBeal) that are different from the population parameters
The average statistic from many random samples should equal the population parameter. In other words, if you took 150 different samples of Americans, each of 300 people, and calculated the percentage that like Ally McBeal in each of the samples, then averaged all those percentages together, that should equal the "real" percentage of all Americans that like Ally McBeal
It is the Central Limit Theory that guarantees that as the number of random samples increases, the average of those samples converges on the population parameter
Because of these mathematical guarantees, we can estimate how far off a sample might be from the population, giving rise to confidence intervals
Random samples are unbiased and, on average, representative of the population.
Example. A company of 680 employees wants to know whether to bother with instituting a program to deal with employee drug-taking. To find out, they will test a sample of employees on an anonymous basis: if a person tests positive, the company will not know who it is and will not try to find out. The objective is solely to estimate what percentage of the company might be doing drugs. If the percentage is high enough, the company will consider instituting a mandatory drug testing program. Given this objective, a simple random sampling design is perfect: the results will generalize to the whole company.
Stratified Sampling
Better than random sampling in terms of efficiency, but sometimes not possible
Procedure is this: Divide the population into strata (mutually exclusive classes), such as men and women. Then randomly sample within strata.
Suppose a company is 80% male and 20% female. To get a sample of 100 people, we could randomly choose 80 males (from the population of all males) and, separately, choose 20 random females. Our sample is then guaranteed to have exactly the correct proportion of sexes.
Especially important when one group is so small (say, 3% of the population) that a random sample might miss them entirely.
Stratified sampling is more efficient (enabling smaller samples) than simple random sampling when strata tend to be homogeneous within and heterogeneous between strata
Example. The VP for Human Resources of a large manufacturing is considering creating a stress-management program for employees. To get an idea of what kinds of needs the program would have to fill, she will interview a sample of 50 employees first. If she does a simple random sample, it's possible that her sample will not include any representatives of some of the smaller departments, just by chance. Since she knows that different kinds of jobs within the company produce different kinds of stress, she wants to get separate samples from the workmen (who handle dangerous chemicals), the foremen (who balance the interests of the workmen with management), and the managers (who are responsible to shareholders). So she uses a stratified random sample.
See also the wikipedia entry.
Cluster Sampling
Used when (a) a sampling frame is not available or too expensive, and (b) the cost of reaching an individual element is too high
E.g., there is no list of automobile mechanics in the US. Even if I could construct it, it would cost too much money to reach randomly selected mechanics across the entire US: would have to have unbelievable travel budget
In cluster sampling, first define large clusters of people. These clusters should have a lot heterogeneity within, but be fairly similar to other clusters. For example, cities make good clusters.
Then sample among the clusters. Then once you have chosen the clusters, randomly sample within the clusters.
Clusters might be cities. Once you've chosen the cities, might be able to get a reasonably accurate list of all the mechanics in each of those cities. Is also much less expensive to fly to just 10 cities instead of 2000 cities.
Cluster sampling is less expensive than other methods, but less accurate.
each stage introduces its own sampling error.
Suppose you want to sample college students. You start by sampling 300 colleges. Then choose 10 students from each college. Problem is, if the colleges are of different size, the probability of a person being chosen if they are from a big college is smaller than for a small college. So need to choose a proportion of students, not a fixed number. Or don't choose colleges with equal probability (let the big schools be more likely to be in the sample). This is called PSS, Proportionate to Size Sampling
Example. Once a quarter, a large retail chain sends auditors to randomly chosen stores to check that proper procedures are being carried out. They look at the physical layout, the interactions between staff and customers, backroom procedures, and so on. A simple random sample could have an auditor visiting a California store one day, a New York the next, then another California store, and so on. Using cluster sampling, the auditor might first select a random sample of states, then visit a random sampling of stores with each state, thus reducing travel time.
Sample Size for Simple Random Samples
The bigger the better, up to 2500. Beyond 2500, it just doesn't matter (accuracy increases very slowly after this point)
The smaller the population, the bigger the sampling ratio that is needed.
For populations under 1000, you need sampling ratio of 30% (300 elements) to be really accurate.
For populations of about 10,000 need sampling ratio of about 10%
Samples smaller than 20 are too small for classical statistics. But you may be able to use permutation methods.