As a statistical investigation progresses, we should always think back to the investigative question, and the best way to answer it.
In an ideal world, we would collect data on every cookie produced in March. We'd spend a whole month weighing every cookie, and counting all the chocolate chips, in both factories.
But to count the chocolate chips, we need to break the cookies apart. A census of all of the cookies produced would destroy all of the cookies. It is unlikely the cookie company can afford to waste this many cookies to let us collect data.
So instead, we want to take a sample.
We want that sample to fairly represent the groups we are investigating.
How we sample in the real world depends on many factors. Some kinds of sampling are harder to implement than others, depending on the nature of the data set.
In this standard, we are given a large data set and asked to sample from it. Large data sets are hard to deal with - and often it's not possible to get data on the whole population.
It is very important that you take a SAMPLE and answer your investigative question using the SAMPLE. You cannot make an inference about the population otherwise, which is the point of this standard.
It is a bit artificial in this standard to take a sample from a large data set - why not use all the data available?
We need to see that you know how to sample, and can make a inference about the population from the sample that you take.
Also, everyone student's sample from the data provided will be different.
A simple random sample chooses a random collection from the population.
Every object in the population has the same chance of being selected for the sample.
A disadvantage is that if one group is under-represented in the population, it will be under-represented in the sample too.
For instance, left-handers make up about 10% of the population. In a sample of 150, about 15 would be left-handed, and the rest would be right-handed.
A stratified sample takes a specified number of objects from each group we are interested in. The sample is randomly chosen from within each group.
This is typically the same number from each group.
This is good for when one group is under-represented in the population.
It doesn't matter that the overall sample is not representative; each group in the stratified sample is representative of the population group.
A systematic sample is taken by using a preset pattern and taking objects from the population systematically.
For instance, the sample might use every 50th object in a list.
An advantage is that this is easy to do.
A disadvantage is that if there is a pattern in the data, the systematic sample might over-represent (or under-represent) some subset of the population.
Cluster sampling takes a sub-group from the population based on how it is grouped.
If the population was grouped randomly, then we can expect a cluster to be representative.
However, there might be an underlying pattern that makes it different. For example, more movies released in December are blockbusters, compared to movies released in January. Using a cluster of movies released in December would not be representative of the population.
Quota sampling involves giving quotas to be filled in certain subgroups. This can cause problems when samples are no longer 'random'. Bias can be introduced when we go looking for subgroups only in the places where we expect to find them the easiest.
Convenience sampling involves finding all of your sample in the same location, ignoring whether that location is representative of the population. Often this leaves out locations where the population is less common.
Self-selected sampling allows objects in the population to choose to be part of your sample. This completely ruins any chance that the sample is representative of the whole population.
Your plan should outline how you intend to take your sample, and why you have chosen to sample in this way.
Using NZ Grapher (or similar software) we have two choices for sampling from a large data set: simple random or stratified sampling.
The sample size, or sizes of the samples from each group, should be specified.
Depending on the size of the data set provided, one group of 200, or two groups of 100 each, gives enough data in the sample to look at answering the investigative question.
There may be some circumstances where you can justify taking a small sample from one group and a larger sample from another.
The Plan section is important, but not a large part of your report. It should include:
Worksheet 3 has more practice on the Plan section. Look at it after you've worked through the section on Data.