We don't need to say much about the data, because it is provided for us. It is either a full population, or a large sample.
Before using a data set, we should check that it contains the data we expect. Go through it for blanks or unexpected values, cleaning up the data set.
Since we intend to take a sample, it's a better idea to check and clean the sample. It doesn't matter if one or two rows in the sample are removed in the cleaning process. We don't want to clean the entire data set if we don't have to!
We should correct an entry if we can see an obvious data entry error. For instance, if the engine size of a Toyota Corolla was 179 cubic centimetres, this could be fixed to be the same as most of the other Corollas - it was probably 1798 cubic centimetres.
There is one error in the GULLS data set.
This plot shows the full data with the first variable LENGTH and the second variable LOCATION, coloured by SEX.
All of the lengths look reasonable, and there are only two values for the SEX categorical variable.
However, we were expecting only four values of LOCATION, and there are five subplots. It looks like "MURWAI" should be MURIWAI, and we could correct this error (it is on row 12 of the data set).
In NZ Grapher, take a sample by choosing [Sample] from the menu, then either:
"Sample With [Simple Random]" and specify a total sample size, or
"Sample With [VARIABLE_NAME]" and specify the sample size for each subgroup.
Then click [Sample].
When specifying a variable, we are doing stratified sampling. This might be useful if one group is more common than another.
We are taking a sample and investigating the sample so that we can answer a question about the population.
We'll see later than the larger the sample, the better we can answer a question about the population.
In this standard, making an inference from a sample about a population is a little bit artificial. In the real world, we need to be able to work with a small sample because it will almost always be too difficult to take a full census. It would be too time-consuming, too expensive, or both.
Once we are stuck with working with a sample instead of the whole population, we need tools to tell us when we can infer.
Worksheet 3 gives more practice for Problem and Data.