When defining our problem statement we will need the following information:
Categoricial Variable: The two groups that we are comparing. eg: 'boys and girls', 'rural and urban', 'cats and dogs'
Numerical Variable (with units): The measured variable that we are comparing between the two groups. eg: height, age, weight, speed.
Population: The larger group from which our sample is being taken. This is the larger group we wish to make an inference about.
Direction: Stating a direction that you predict the result will go. These are your "-er" words.
When writing our problem statement, we will use the following template:
I wonder if the median travel time (minutes) for students who catch the bus is
longer than the median travel time (minutes) for students who walk,
for all high school students in NZ, for data from Census at School in 2015.
Additional Notes:
When comparing two different groups we are wanting to compare whether there is a difference between the populations of our two groups. Because of this, it is important how we use the word all. We are not expecting that all people in one group will be bigger/smaller than all people in another group, but instead that many people in the group will be bigger/smaller which will make the central tendency for the group to be bigger/smaller.
Secondly, when we are comparing the medians of our two groups, we are using the medians of the numerical variable. We will not say that the 'median boy' or 'median girl' but instead the 'median height of boys' and 'median height of girls'.
This section of your problem statement is to identify the following:
Explain why you have chosen your variables
Discuss who else might be interested in your investigation
Discuss any prior research you have done including references.
DO NOT discuss what your graphs show.
Example: I wonder if it could be possible to estimate the age of a kea using its crown length. Frederick J Jannett has shown that weights of skeleton parts, including the skull, can be used to predict the age of chickens. Apart from this, there appears to be little research into the use of crowns and beaks to predict age. If this investigation yields significant results, this could be used by conservation workers to more easily identify the age of kea without overly disruptive methods.
The final part of your problem section is your hypothesis.
Do you expect group 1 or group 2 to be bigger? Why?
Here you'll want to go over what you think the data will do and come up with some reasons why you think that. It's also essential to do some research to back up your reasoning. You can also refer to this research later in the analysis and conclusion as to whether or not it supported your findings.
You can write your analysis using the following format:
I expect the median numerical variable of category 1 would be bigger/heavier/faster than category 2 because of reason.
Research shows...
Here you want to discuss what your variables are (in detail) and how the data has/will be collected.
We want to take a big enough sample size, so that the results are reliable and precise enough to represent the population. The more data we have, the greater the precision of our results, and the lower the variation.
With a small sample size, it is much harder to find differences. With a larger sample size, you can find differences more easily.
It is also important that the people/objects that data is collected from are randomly selected so that the data is representative of the population. If people/objects have different chances of being selected, this will lead to a biased sample.
It is also best to discuss the source of the data, and how they achieved the data if possible. For excellence you can critique the reliability of the source.
Example: Crown length is the distance from the top of the beak to the back of the skull (occiput) and is measured in millimeters. The juvenile kea is between 1 and 2.4 years old; an adult kea is over 4 years of age. Since the kea in our sample came from a variety of locations throughout New Zealand we can consider that the population is all juvenile and adult kea in New Zealand.
When we discuss our sample there are two bits of information that must be included. This is the sampling method and the sample size. Below are four methods of random sampling that we can choose. An optimal sample size for is 30 < n < 100 per group. This can be justified by stating that is provides statistically significant results without requiring a large amount of resource investment.
The sample size is the number of participants we will have in each group. The number we select is important because if it is too low, we will not have a statically significant sample, but if we choose too high, it will be too costly and timely to execute.
Smaller sample sizes take a shorter time to collect data, but results are more precise.
Larger sample sizes take a longer time to collect data and results are more precise.
When sampling two different groups, they do not need to be an equal size. In fact, it is usually better to have a sample that is representative of the population proportions. That being said, if a population is particularly diverse and some groups would be under-represented and might not reach a sufficient minimum to be statistically valid, we will want to stratify our sample.
Different sample sizes for each group will lead to different sampling variation. Sampling variation refers to the natural tendency for different samples drawn from the same population to produce different results.
The larger the sample, the less the sampling variation.
The Confidence interval is a range of values that is likely to contain the true population parameter (median). The width of a CI is influenced by the sampling variation.
The larger the sample, the less the sampling variation, and the narrower the confidence interval.
Although there are four sampling methods mentioned, we will only be focusing on two - Simple Random Sampling and Stratified Random Sampling.
*Note: Next year, we will only be using Simple Random Sampling. As such, it would be best to use Simple Random this year too.
Simple Random Sampling:
Simple random sampling is a method where we use a randomization tool to randomly select our sample from a population. Using this method, all members of a population are equally likely to be selected.
This method allows allow w sample to be proportionally representative of the population. This may result in my two groups having different sample sizes and as a result, different sampling variation.
Stratified Random Sampling:
Simple random sampling is a method where we initially separate the population into group (strata) based on a certain characteristic (e.g. ethnicity, gender, income).
From here, we will randomly select members from each strata for our sample. Usually an equal amount from each strata is selected.
This method allows for a fair comparison of two group by having both be relatively equal in size, especially when our sample is diverse and some groups may be underrepresented. This will likely mean, however, that my sample is not proportionally representative of the population.
Example: I will be using a sample of n=100 for each of our groups. This will allow us to obtain significantly accurate results which is desirable due to the importance of our investigation, but will not require extensive resources in gathering and interpreting the data.
To obtain our sample we will be using random stratified sampling. As this is a randomized sampling method, it will remove any bias in selecting our sample while also representing each group equally.
Template - Simple Random (preferred):
When investigating the difference in numerical for participants between category 1 and category 2, I will take a simple random sample of n participants. I have chosen a sample of n because...
I have chosen simple random sampling because it will allow my sample to be proportionally representative of the population while removing bias due to randomizing. This may result in my two groups having different sample sizes and as a result, different sampling variation.
Template - Stratified Random (allowed):
When investigating the difference in numerical for participants between category 1 and category 2, I will take a stratified random sample of n participants who category 1, and n participants who category 2. This gives a total of n participants in my sample. I have chosen a sample of n because...
I have chosen stratified random sampling because it allows for a fair comparison of two group by having both be equal in size, especially when our sample is diverse and some groups may be underrepresented. This will likely mean, however, that my sample is not proportionally representative of the population.
Dot plots will create various shapes which we call distributions. There are six distributions that we will be focusing on.
Normal (Bell Shaped) Distribution
Left/Right Skewed
Triangular
Bimodal
Uniform
Irregular
We can identify which distribution our trend is by looking at its key features:
Symmetry
Tails
Peaks
A Box and whisker graph is broken down into four quarters, giving us five distinct points.
Minimum: The lowest value in our data set.
Lower Quartile (Q1/LQ): The data point that represents 25% through our set.
Median: the mid-way (50%) data point
Upper Quartile (Q3/UQ): The datapoint that represents 75% through our set.
Maximum: The highest value in our data set.
Interquartile Range (IQR): We can calculate the IQR by finding the difference between the upper and lower quartiles (UQ - LQ)
As shown earlier, we can identify a shape using its features of symmetry, peaks, and tails. We will often use all three when defining a distribution.
The shape of weights for the sample of back rugby players is approximately normal, because the weights are roughly symmetrical, unimodal, and follow a bell-shaped curve.
The shape of weights for the forwards rugby players is skewed to the right because the weights are unimodal, asymmetric and more spread out to the right-hand side.
Additional Notes:
When using NZGrapher we need to remember that when we identify the shape, we only consider the shape between the minimum and maximum values. (Use the whiskers on the box plot to remind you of these minimums and maximums).
The best way to do this is to use a highlighter to colour in the data section of the graph to allow you to focus on only that section. You also need to ignore any of the tails that extend beyond the data.
Compare the centres by finding the mean/median of both groups, identifying which group has a larger median, and then calculating how much bigger the mean/median is.
For Merit, make sure that you include the mean/median values and units of each group as evidence.
For Excellence, tell the story and connect to research, explaining why one group may (or may not) be bigger or smaller than the second group.
In my sample, the median weight for forwards rugby players is heavier than backs by 18 kg. The median weight for my sample of backs is 92kg. The median weight for my sample of forwards is 110 kg. Forwards need more muscles and weight to be able to both hold the line and push the line forwards, whereas backs tend to need to run fast, which often is less bulk than rugby forwards
.
Here you will be comparing the spread of the samples using either Interquartile range (IQR) or standard deviations (SD).
The IQR is the middle 50% of your sample data, and in our boxplot, it is the box.
The Standard deviations can be found on NZGrapher and represent how far the data points are from the centre.
When comparing the spread, don’t calculate the difference between the SD's/IQR’s, but instead use descriptive language (a little wider, significantly wider, much wider).
For Merit, make sure that you include the SD/IQR values and units of each group as evidence.
For Excellence, you may be able to tell the story, explaining why one group may (or may not) be more or less spread out than the second group.
In the sample, the spread of the middle 50% of weights of back rugby players is a little wider than the spread of the middle 50% of weights of forward rugby players. The IQR of weights for back rugby players is 8kg whereas the IQR of weights for forwards rugby players is 6.5kg.
A Confidence interval (CI) is the level of confidence that we have that the confidence interval range will contain the population median/mean.
The median weight of rugby players who play in the backs, is likely to be between 90.5kg and 93.6kg, for ALL rugby players.
The median weight of rugby players who play in the forwards, is likely to be between 108.4kg and 112.7kg, for ALL rugby players.
In order to make the call whether there is enough evidence we have to look at the confidence intervals and see if they overlap.
If the confidence intervals overlap, this means that it is likely that the population median of one group is either smaller, the same, or larger than the other group. As you can likely tell, this is not very useful information. As such we do not have enough evidence that the population median of one group is larger than the other.
If the confidence intervals do not overlap, this means that the only likely outcome is that one group is larger than the other. As such, we have enough evidence to suggest that the population median of one group is larger than the other.
The confidence interval for weights of rugby players who are playing in the back positions DOESN’T overlap with the confidence interval for weights of rugby players who play in forwards positions. This means that my sample suggests that back in the population, the medians aren’t likely to be the same. I can make the call, so I DO have enough evidence that the median weight of rugby players in the back position is lighter than the median weight of rugby players who play in forwards positions, for ALL rugby players
Taking different samples from the same set of data will almost certainly produce different results. This will be the case whether sampling with or without replacement. This is known as sampling variability.
When discussing sampling variability you want to include how the impacts on:
How the data would change?
How the analysis would be impacted?
How does this affect your confidence interval?
Will this impact your conclusion (making the call)?
This can be discussed as follows:
"If I took another sample, I would get different weights for rugby players as I would be collecting data from different rugby players. I would expect though that the summary statistics (minimum, LQ, median, UQ, and maximum) weights of forwards and backs to be similar to the values in my sample. Because the median weights for forwards and backs would be similar, this would lead to a similar confidence interval, and therefore the conclusion that the weights of forwards are larger than the weights of backs, is likely to stay the same."
This ties into the idea of sampling error very closely. Sampling error is the degree to which the sample mean/median is different from the population mean/median.
The larger the sample size, the smaller the sampling error.
When doing a statistical analysis like this, it is important to recognise that we are only looking at part of the picture. In our analysis we are only looking at two variables where other variables may be equally or more important. It's important to add sections where you believe this may be the case.
Here are a couple of examples:
There are a number of different factors that might affect the weight of children in NZ. For example, if a child has parents who are both slim and short in stature, then because of the genetic link it is likely that the child is also likely to be slim and short in stature. Equally, a child whose parents have bigger and heavier bones, and a wide/tall build are likely to be taller and heavier. A study published in the UK supports this and discusses how there is a link between people's weight and their genetics, where being slim is a heritable trait.
Another factor that could affect the weight of children is the amount of exercise they do each week. I expect that a child who is more active and spends more time each week exercising would have less body fat than a child who is less active and spends less time exercising each week. The more body fat a child has the higher their weight will be. Kids Health suggest that “kids can reach a healthy weight by eating right and being active”.
Always add references when including information outside of our analysis.
The final section of your report will be to mention any areas that may have gone wrong with your investigation and what further investigating may be required to answer your initial question.
A good example of this would be to look at your data and how it was gathered. If we are unsure about how the data was measured, there may have been faults in the data-gathering process that may lead to inaccuracies.
Here is an example:
It is like that a number of different people, including volunteers, took these measurements, and perhaps not all measured beak lengths in the same way. I would be skeptical that a kea would stay still long enough for someone to get an accurate measurement.