A local biscuit factory has just expanded to a new site. The old site and the new site are both using the same recipe to make what should be the same cookies. Each cookie is expected to weigh 12 grams, but the company know that there will be some variation.
A natural question to ask would relate to whether or not the new factory is producing chocolate chip cookies that are bigger than the old factory.
This is not a sufficiently statistical question; we can do better in describing how to investigate data. We should describe in detail what we want to investigate, and what exactly we are trying to answer.
The cookies data set is described here. We are interested in the values of two variables from the data set; a categorical variable (FACTORY, which is either "old" or "new") and a numerical variable (WEIGHT, the weight in grams of each cookie).
An investigative question must:
use medians of a numerical variable
include units
compare between two subsets of a variable
predict a direction
describe the population being investigated.
Notice that there are other questions we could be asking about this data, for instance whether there is a relationship between the two numerical variables (WEIGHT and CHIP_COUNT).
However, for this standard, we need to use only one of the numerical variables, and make comparisons between two groups.
A good investigative question for this standard looks like this:
I wonder whether
the median [numerical variable]
(in [units])
of the [categorical variable 1st group]
is [greater/less] than
the median [numerical variable]
of the [categorical variable 2nd group]
in the [population].
The question should only have one numerical variable and one categorical variable.
The population will be described in detail with the data set provided. The question is about the whole population, not about the samples will be taking from it.
If there are no units (for example, a count of the number of chocolate chips in a cookies) they can be left out of the question. Most numerical variables have units.
A good Problem section of your report has the following:
An introduction to the data set and the variables you are interested in.
A reason to be investigating variables you are considering.
Some research or discussion of the variables relevant the context, where possible. Provide links to any extra research.
The investigative question itself - this is the guiding purpose of the investigation.
There has to be a believable reason to expect a difference between the median values compared between two groups.
For example, the following question, while it fits the template, doesn't make a lot of sense.
I wonder whether the median length of red-billed gulls (in cm) in winter tends to be smaller than the median length of red-billed gulls in summer, in the population of all Auckland red-billed gulls in 2017.
There's no reason to expect the length of red-billed gulls to be smaller because of the season. Although a gull might weigh less if food is harder to find, we wouldn't expect them to get shorter.
Although there's nothing wrong with this question statistically, it won't be very interesting to discuss, because the question doesn't make much sense within the context.
Why are we interested in these variables? You could do some background research into the variables, to be sure you know what they measure, how they are measured, and what they mean. At Level 2, this usually means looking at resources related to the context provided by your teacher.
For the GULLS data set, this is the NZ Birds Online Red-Billed Gull website.
If possible, find a 'typical' value, such as the expected 'average' weight of a red-billed gull, or the engine size of the car your parents let you drive.
Sometimes, the purpose of an investigation is clear. Other times, we might need to look into the background information surrounding the data set and context to find out more and determine who might find the answer to our question useful.
With the cookies data set, the purpose is clear. There is a commercial interest in checking that the new factory is doing as good a job (or a better job) of making 12 gram cookies for packaging and selling to consumers. Product consistency is an important factor to check.
Your investigative question should reflect what you have uncovered. You should also make an general hypothesis:
Based on my background research, I think my sample will show that male red-billed gulls are larger than female red-billed gulls.
Your Conclusion section should use the sample to answer the investigative question.
You should also say where the data came from - this does not mean "from NZGrapher". A data source will have been given. For instance, the data source for the COOKIES data set is from the cookie factories' production runs of March 23 2018.
Worksheet 1 gives more practice on writing investigative questions correctly.
Worksheet 2 gives more practice on background research and context.