When defining our problem statement we will need the following information:
Categoricial Variable: The two groups that we are comparing. eg: 'boys and girls', 'rural and urban', 'cats and dogs'
Numerical Variable (with units): The measured variable that we are comparing between the two groups. eg: height, age, weight, speed.
Population: The larger group from which our sample is being taken. This is the larger group we wish to make an inference about.
When writing our problem statement, we will use the following template:
What is the difference between the median numerical (units) of category 1 and category 2 for all population.
Additional Notes:
When comparing two different groups we are wanting to compare whether there is a difference between the populations of our two groups. Because of this, it is important how we use the word all. We are not expecting that all people in one group will be bigger/smaller than all people in another group, but instead that many people in the group will be bigger/smaller which will make the central tendency for the group to be bigger/smaller.
Secondly, when we are comparing the medians of our two groups, we are using the medians of the numerical variable. We will not say that the 'median boy' or 'median girl' but instead the 'median height of boys' and 'median height of girls'.
This section of your problem statement is to identify the following:
Explain why you have chosen your variables
Discuss who else might be interested in your investigation
Discuss any prior research you have done including references.
DO NOT discuss what your graphs show.
I would like to know if juvenile kea and adult kea have different crown lengths. I wonder if it could be possible to estimate the age of a kea using its crown length. Frederick J Jannett has shown that weights of skeleton parts, including the skull, can be used to predict the age of chickens. Apart from this, there appears to be little research into the use of crowns and beaks to predict age.
The final part of your problem section is your hypothesis.
Do you expect group 1 or group 2 to be bigger? Why?
Here you'll want to go over what you think the data will do and come up with some reasons why you think that. It's also essential to do some research to back up your reasoning. You can also refer to this research later in the analysis and conclusion as to whether or not it supported your findings.
You can write your analysis using the following format:
I expect the median numerical variable of category 1 would be bigger/heavier/faster than category 2 because of reason.
Research shows...
Achieved:
Context: According to Wikipedia when the titanic sank it only had enough lifeboats to carry about half of those on board and third-class passengers who I think would have paid way less for their fares were largely left to fend for themselves, causing many of them to become trapped below decks as the ship filled with water. In fact “54% of those in third class died”.
Question: I wonder if there is a difference between the median fares of people that were on the Titanic in 1912, who survived and who did not, according to a sample provided from a Titanic passenger list. I am doing this investigation to discover whether the more money you payed increased a passenger's chances of survival.
Hypothesis: I think that if you were in first class on the Titanic, it would have increased your chances of survival because in Wikipedia it says that the first class passengers were closer to the lifeboats.
Merit:
Question: I wonder what is the difference ls between the median fare of the Titanic passengers of those who survived and those who unfortunately didn't survive the horrific Titanic accident in 1912.
Context: My research shows that the median fare price that was paid by the survivors was higher than the median fare price that was paid by non-survivors. For example, my research states that a first class ticket today costs around £83200 today and the third class tickets cost around £298-£793. Of the first class riders, 60% of them had all survived. I also discovered that first-class facilities and accommodation was located on the top decks of the Titanic to avoid the vibrations and noise of the engines which were at the bottom of the ship. This also meant that first class passengers were closer to the lifeboats. Only 25% of the third class ticket buyers survived, as they were closer to the bottom of the boat and because of this they were further from the lifeboats. Also, the Titanic itself was built to gold 32 lifeboats, however, only 20 were on board at the time. One of the lifeboats had the capacity to hold 65 people, however, only 27 carried on the lifeboat, so I feel that there should have been more lifeboats on board for caution. In total there was 2228 people on board the Titanic, but only 705 survived and 1523 unfortunately perished. Using this research and data, I will be able to conclude whether or not it is statistically true that those who paid a higher price were more likely to survive.
Hypothesis: My prediction is that the median price a person who survived paid for a ticket will be higher. The fare is the price that passengers of the Titanic paid to board the ship in British pounds.
Excellence:
Context: House prices vary between New Zealand's cities and towns . If a teacher is moving to New Zealand from overseas and having to choose where to take a position out of the four main centres of Wellington, Auckland, Dunedin and Christchurch from overseas house prices may influence their final choice of destination, especially if they have limited funds available to purchase a house. Having the right information available will help them to choose a centre where they can afford a house, and this will make it more possible for them to feel at home in New Zealand and not feel financial pressure. This report investigates whether there is a difference in North Island house prices compared to South Island house prices for median house sales price (NZD) in Auckland, Wellington, Christchurch and Dunedin in 2016 to help teachers choose where in New Zealand they may want to look at purchasing a house. Thus, the variables for this investigation are location by island (categorical) and 2016 house sale price (NZD). This data has come from the Real Estate Institute of New Zealand.
Question: What is the difference between the median house sales price (NZD) in Auckland, Wellington, Christchurch and Dunedin in 2016 for North Island median sales price compared to South Island median sales price?
Hypothesis: Given that, North Island prices are higher than South Island prices, and Auckland prices are higher than prices in other cities and the Auckland housing market is so ‘hot’, it is expected this report will show the median house sales price (NZD) in Auckland, Wellington, Christchurch and Dunedin in 2016 will have a higher median sale price for North Island properties than South Island properties
Sometimes when conducting an investigation, we can choose variables which are somewhat vague. Within our plan, it is always best to provide further clarification on what our variables mean, how they might best be measured, and if they possess any limitations or constraints on our overall investigation.
If possible always refer to prior or new research and be sure to included any citations where needed.
Example:
Variable: Crown Length (mm)
The measurement would start at the very tip of the upper beak (the maxilla) and continue to the fleshy area at the base of the beak where it meets the feathers of the forehead. Despite the kea's beak being curved, any measurements taken will be tip-to-tip.
Variable: Age (Juvenile and Adult)
The age of kea will be split into two groups; Juvenile and Adult. A juvenile kea is one that is considered to be between 1 - 2.4 years old. An adult kea is aged 4 years or older.
We want to take a big enough sample size, so that the results are reliable and precise enough to represent the population. The more data we have, the greater the precision of our results, and the lower the variation.
With a small sample size, it is much harder to find differences. With a larger sample size, you can find differences more easily.
It is also important that the people/objects that data is collected from are randomly selected so that the data is representative of the population. If people/objects have different chances of being selected, this will lead to a biased sample.
Since the kea in our sample came from a variety of locations throughout New Zealand we can consider that the population is all juvenile and adult kea in New Zealand.
Dot plots will create various shapes which we call distributions. There are six distributions that we will be focusing on.
Normal (Bell Shaped) Distribution
Left/Right Skewed
Triangular
Bimodal
Uniform
Irregular
We can identify which distribution our trend is by looking at its key features:
Symmetry
Tails
Peaks
A Box and whisker graph is broken down into four quarters, giving us five distinct points.
Minimum: The lowest value in our data set.
Lower Quartile (Q1/LQ): The data point that represents 25% through our set.
Median: the mid-way (50%) data point
Upper Quartile (Q3/UQ): The datapoint that represents 75% through our set.
Maximum: The highest value in our data set.
Interquartile Range (IQR): We can calculate the IQR by finding the difference between the upper and lower quartiles (UQ - LQ)
When finding the measure of central tendency, we will need to use either mean or median and justify our decision.
Median: The middle number (or the mean of the two middle numbers) when our data is put in order.
Mean: The sum of all data values divided by the number of pieces of data.
Mode: The most common data value.
Although the mean is often used as a better measure of the average, it can be heavily affected by extreme values. For example:
The following scores were achieved on a math test where the passing mark was 50:
21, 25, 32, 40, 97, 100
Mean = 52.5 Median = 36
If we were to use the mean as our measure of centre, we would conclude that on average, the class passed the test. If we were to use the median instead we would conclude that the class did not do so well.
In this case, it is more appropriate to use the median as the majority of the class scored below 50 and it is more reflective of the class's ability, whereas the mean was pulled up by the two high scores.
For symmetrical distributions, both the mean and median are appropriate.
For asymmetrical distributions, the median should be used as it is not influenced by extreme values.
The mode is only used for central tendency when analysing categorical data. As we are comparing the centres of numerical data, the mode won't be used.
Use the Standard Deviation (SD) when the data is symmetrical. This is because all measures of spread are appropriate when there are no extreme values and SD is the most sophisticated measure.
Use the Interquartile Range (IQR) when the distribution is asymmetrical. This is because both SD and Range are affected by extreme values whereas IQR is not.
As shown earlier, we can identify a shape using its features of symmetry, peaks, and tails. We will often use all three when defining a distribution.
The shape of weights for the sample of back rugby players is approximately normal, because the weights are roughly symmetrical, unimodal, and follow a bell-shaped curve.
The shape of weights for the forwards rugby players is skewed to the right because the weights are unimodal, asymmetric and more spread out to the right-hand side.
Additional Notes:
When using NZGrapher we need to remember that when we identify the shape, we only consider the shape between the minimum and maximum values. (Use the whiskers on the box plot to remind you of these minimums and maximums).
The best way to do this is to use a highlighter to colour in the data section of the graph to allow you to focus on only that section. You also need to ignore any of the tails that extend beyond the data.
Compare the centres by finding the mean/median of both groups, identifying which group has a larger median, and then calculating how much bigger the mean/median is.
For Merit, make sure that you include the mean/median values and units of each group as evidence.
For Excellence, tell the story and connect to research, explaining why one group may (or may not) be bigger or smaller than the second group.
In my sample, the median weight for forwards rugby players is heavier than backs by 18 kg. The median weight for my sample of backs is 92kg. The median weight for my sample of forwards is 110 kg. Forwards need more muscles and weight to be able to both hold the line and push the line forwards, whereas backs tend to need to run fast, which often is less bulk than rugby forwards
.
Here you will be comparing the spread of the samples using either Interquartile range (IQR) or standard deviations (SD).
The IQR is the middle 50% of your sample data, and in our boxplot, it is the box.
The Standard deviations can be found on NZGrapher and represent how far the data points are from the centre.
When comparing the spread, don’t calculate the difference between the SD's/IQR’s, but instead use descriptive language (a little wider, significantly wider, much wider).
For Merit, make sure that you include the SD/IQR values and units of each group as evidence.
For Excellence, you may be able to tell the story, explaining why one group may (or may not) be more or less spread out than the second group.
In the sample, the spread of the middle 50% of weights of back rugby players is a little wider than the spread of the middle 50% of weights of forward rugby players. The IQR of weights for back rugby players is 8kg whereas the IQR of weights for forwards rugby players is 6.5kg.
These could be singular points or a cluster of points. They should be obvious so only comment if they are.
Mathematically, an outlier is any point that is further than 1.5xIQR away from the LQ and UQ. NZ Grapher can identify these for you by clicking 'boxplot (no outliers)'. Any points that lay outside of the whiskers can be considered an outlier.
There are two outliers, who both play in a forwards position in rugby. One has a smaller than expected weight of around 96 kg, and another with a weight higher than expected at around 136 kg.
Excellence:
The shape distribution of weekly rental cost (NZD) of North Island properties in New Zealand appears to be right-skewed. This is because there is a longer tail on the right side of the distribution. Outliers are also visible on the far right of the main cluster of data points, around the $4500 range. This indicates a few properties with significantly higher weekly rental costs in the North Island. This may be due to property size, market demands, amenities, etc. Meanwhile, the shape distribution of weekly rental cost (NZD) of South Island properties in New Zealand also appears to be symmetrical. This is because the two sides of the distribution are a mirror image of each other. It has one peak and a bell-curved shape. There are no visible outliers. In context, this suggests that the weekly rental costs are relatively consistent across different properties on South Island. This might be because the features of the properties of South Island have similar characteristics, such as the size or quality of the construction. This similar nature can lead to more uniform, stable and predictable rental market prices.
The distribution for North Island is skewed right. Therefore, it’s an asymmetrical distribution. While the distribution for the South Island is symmetrical. Considering both of these features, I will be using the median as the measure of the centre. My reason for this is that the North Island distribution contains extreme outliers; therefore, if I use mean as my measure of centre, it will be heavily impacted by extreme values and outliers. Hence, using the median to measure the centre will provide a better representation, as it avoids distortion from extreme values. As for the distribution for the South Island is symmetrical; therefore, either mean or median can be used interchangeably. For these same reasons, I will be using the Interquartile Range (IQR) as my measure of spread because using IQR helps mitigate the influence of these outliers in the North Island distribution. While the South Island’s distribution is symmetrical, using the IQR provides a consistent measure of spread that can be compared directly to the North Island’s IQR.
The median weekly rental cost (NZD) of North Island is $750. Meanwhile, South Island's median weekly rental cost (NZD) is $570. This suggests that in my sample, the North Island median weekly rental cost is higher than the South Island median weekly rental cost by $180 on average. This further supports my research and prior research, highlighting the difference in regional rental costs between the North and South Islands of New Zealand. Once again, one major factor leading to this difference may be that North Island is home to larger cities like Auckland and Wellington, increasing demand for rental properties and driving up rental prices. According to a source, NZ Herald talks about house prices rising due to migration and demand for new-build homes, stating, “In an environment of strong population growth, this decline in new house supply can do only one thing to prices – push them higher.” This suggests that the surge in migration, both international and domestic, is significantly influencing rental prices and has intensified the competition for available rental properties, thus driving up prices.
North Island IQR
$950 - $580 = $370
South Island IQR
$640 - $485 = $155
The interquartile range of my sample of weekly rental cost (NZD) of North Island is $370. Meanwhile, the interquartile range of my sample of weekly rental costs (NZD) for South Island is $155. This means the middle 50% of both Islands show considerably wide rental cost variability. This may be due to economic conditions and urbanisation. The South Island’s lower IQR of $155 suggests that rental prices are clustered more closely around the median, showing less variability and more uniform stable rental prices. Meanwhile, North Island’s higher IQR of $370 suggests a broader spread of rental costs, thus resulting in greater variability and less stability in rental prices.
Within my sample of the North Island’s weekly rental cost (NZD), I can identify eight clear outliers. These outliers, ranging from $2000 to $4500 per week, are particularly intriguing. They may be due to different types of luxurious properties, ranging from high-end apartments to penthouses. The greatest outlier for North Island weekly rentals is NZD $4500, hinting at the possibility of a luxurious property, often located in prime urban or waterfront locations. Meanwhile, within my sample of South Island’s weekly rental cost (NZD), I can identify three outliers in the $1000 range. Although closely clustered around the upper quartile, they still qualify as outliers since they lie beyond 1.5 times the interquartile range above the upper quartile. A possible explanation is probably because of unique accommodations and properties with exclusive access to natural attractions, which drive up the rental prices.
High Merit:
The median of the survivors is £14.37 higher than the median of the non survivors, as it is £26.55 compared to the non-survivors £12.18. I am looking at the median rather than the mean because the extreme values affect what the center looks like, therefore the mean isn't used. So far this is lining up with my expectations as I expect that the people who paid a higher fare rate more likely survived. This supports my research because it showed that those who paid more were able to allow them to stay on one of the higher decks of the ship, therefore increasing their chances of survival as they were further away from the danger. For example, of the third class survivors, only 25% of the people that stayed on that level survived. Whereas the percentage of people that survived that stayed in first class was 60%.
The middle 50% of the non-survivors fares are between £8 and £26, whereas the middle 50% of the survivors fares are between £13.70 and £52.80 Saying this, there is quite a bit of overlap between the upper quartile of the non-survivors and the lower quartile of the survivors. So this suggests that there was quite a difference in the fares paid and that survivors on the Titanic in 1912 paid more for their ticket than the non-survivors. This once again can be linked to my research as this result lined up with my expectations, as I had predicted that those who survived paid a higher median fare price. This is because if they paid more, they would have been staying in one of the rooms on the upper deck.
The interquartile range for the survivors is £39.10 whereas the interquartile range for the non-survivors is £18, indicating that the survivors have more variation in the fare price that they had paid than the non-survivors. The standard deviation is also higher for the survivors. Overall, visually the survivors seem to be slightly more spread out than the non-survivors. The shapes of both graphs are different when compared to each other. The non-survivors fare paid seems to be a bimodal graph with a peak close to around £7-8 and another peak near £25.
On the other hand the survivors fare paid seems to be a unimodal graph with a mode around the median of £26.55. Although it is very skewed to the right it has a shorter tail stretching from the upper quartile to the highest value than the non- survivors graph. This supports my prediction because on the survivors data there is more people that paid more for their fare, which is why they were more likely to survive.
Sometimes due to budget/time constraints, it is only possible to obtain a small sample from the population. The smaller the sample, the less likely that the sample median will representative of the population median.
In order to make an inference about about where the population median is likely to lie, many resamples (with replacement) are taken from our sample.
Using the medians/mean of these from the basis of our bootstrapped inference.
Resamples should contain the same number of data points as the original sample. Just like sampling, resamples are likely to get different medians each time.
In order to use this method in practice, a very large number of resamples are needed to be taken. As this is very time consuming we use programs to do this very rapidly. We will usually take 1000 resamples.
When resampling, we want to compare the medians of our two groups each time. this is called the Difference between means/median (DBM).
When calculating the DBM, we want to first identify which group in our original sample had the higher and smaller value.
For each of our resamples, we will subtract using the same order:
DBM = Group 2 Median/Mean - Group 1 Median/Mean.
This means that it is possible to get a negative DBM in our resample. This means that the resample disagrees with our original sample.
Because we are repeating this 1000 times we will be able to identify a new distribution which is how we make our inference.
A Confidence interval (CI) is the level of confidence that we have that the confidence interval range, will contain the population median/mean.
From our 1000 resamples, we will end up with a bootstrapped distribution with 1000 datapoints; each of which being a DBM from a resample. From this, NZGrapher will be able to make a 95% confidence interval by finding the range at which 950 of our data points lie. This means that the highest and lowest 2.5% DBMs are not included.
Additional Notes:
Confidence intervals are affected by sampling error and therefore sample size. The larger the sample size, the smaller the range of the confidence interval. The inverse is also true.
Although we are 95% confident the that population median/mean lies within our confidence interval, this does not mean that is it certain. It is entirely possible that the true population median/mean is part of the 5% that we have excluded so it is important to not use definitive descriptions.
In order to make the call on whether your hypothesis was correct, you'll need to determine whether or not your bootstrapped confidence interval contains 0.
If CI contains 0, we cannot make the call
If CI does not contain 0, we can make the call.
You can write your formal inference using the following format:
We can be reasonably confident that for these populations of male and female kea from New Zealand, the median beak lengths will be between 5.8mm and 7.5mm longer for male kea than female kea. As zero isn't included in the bootstrapped confidence interval, we can make the call that there is a difference in the median beak lengths between male and female kea from New Zealand.
Taking different samples from the same set of data will almost certainly produce different results. This will be the case whether sampling with or without replacement. This is known as sampling variability.
When discussing sampling variability you want to include how the impacts on:
How the data would change?
How the analysis would be impacted?
How does this affect your bootstrapped confidence interval?
Will this impact your conclusion (making the call)?
This can be discussed as follows:
"If I took another sample, I would get different weights for rugby players as I would be collecting data from different rugby players. I would expect though that the summary statistics (minimum, LQ, median, UQ, and maximum) weights of forwards and backs to be similar to the values in my sample. Because the median weights for forwards and backs would be similar, this would lead to a similar bootstrap confidence interval, and therefore the conclusion that the weights of forwards are larger than the weights of backs, is likely to stay the same."
This ties into the idea of sampling error very closely. Sampling error is the degree to which the sample mean/median is different from the population mean/median.
The larger the sample size, the smaller the sampling error.
When doing a statistical analysis like this, it is important to recognise that we are only looking at part of the picture. In our analysis we are only looking at two variables where other variables may be equally or more important. It's important to add sections where you believe this may be the case.
Here are a couple of examples:
There are a number of different factors that might affect the weight of children in NZ. For example, if a child has parents who are both slim and short in stature, then because of the genetic link it is likely that the child is also likely to be slim and short in stature. Equally, a child whose parents have bigger and heavier bones, and a wide/tall build are likely to be taller and heavier. A study published in the UK supports this and discusses how there is a link between people's weight and their genetics, where being slim is a heritable trait.
Another factor that could affect the weight of children is the amount of exercise they do each week. I expect that a child who is more active and spends more time each week exercising would have less body fat than a child who is less active and spends less time exercising each week. The more body fat a child has the higher their weight will be. Kids Health suggest that “kids can reach a healthy weight by eating right and being active”.
Always add references when including information outside of our analysis.
The final section of your report will be to mention any areas that may have gone wrong with your investigation and what further investigating may be required to answer your initial question.
A good example of this would be to look at your data and how it was gathered. If we are unsure about how the data was measured, there may have been faults in the data-gathering process that may lead to inaccuracies.
Here is an example:
It is like that a number of different people, including volunteers, took these measurements, and perhaps not all measured beak lengths in the same way. I would be skeptical that a kea would stay still long enough for someone to get an accurate measurement.
Making the Call:
We can be reasonably confident that for these populations of people currently renting in the North and South Islands of New Zealand, the mean rent per week will be between $221.80 to $578.46 more expensive in the North Island than the South Island. As zero isn’t included in this bootstrapped confidence interval and there is enough evidence to make a call, that there is a difference between the mean rent per week price between rent in the North and South Island.
Connection to Hypothesis
This conclusion supports our hypothesis that people who rent a property in the South Island tend to have a cheaper median/mean weekly rent price (in $NZD) compared to those in the North Island. This is reinforced by our investigational data, including the results from our bootstrapping analysis, which provides strong evidence for the observed difference in rental prices between the two islands.
Sampling Variability:
If I took another sample, I would get different mean rent per week prices as I would be collecting data from different renters in the North and South Islands of New Zealand. I would expect my summary statistics to be (minimum, LQ, mean, UQ, maximum) to be similar, but not the same as my current investigational results. Because the mean rent per week price of North and South Island rent would be similar, this would lead to a similar bootstrap confidence interval, and therefore the conclusion that rent per week prices are higher in the North Island than in the South Island is likely to remain the same.
Mean vs Median:
I decided to select mean as my parameter for my investigation. This is because when removing outliers from the box and whisker graphs of both the North and South Island rent, the distributions of both were relatively normal and neither graph had any extreme values, meaning my results wouldn’t be greatly affected. If I were to use median instead we would get values of $700 per week for North Island and $565 per week for the South Island. These values are similar and different to the means, as the North Island mean for rent per week was $948.81, very different from the median. The South Island mean for rent per week was $571.92, which was very similar to the median. In the North Island, there is a big difference between the rent per week price between mean and median. This could be because there are more outliers present, and the outliers can have a greater effect on the mean, a reason as to why it is is greater that the median. In the South, the difference between mean and median for rent price per week is much smaller/they are more similar, this could be due to there being very little to none extreme outliers, meaning there is little to no effect on the mean.
Sample Size and Sampling Error
If I were to use a smaller sample for my investigation, it would result in larger sampling error. A larger sampling error would mean that there would be a larger confidence interval on my bootstrapped graph. Bootstrapping is the process of resampling a dataset multiple times, and in the case of my experiment I resampled my data 1000 times. Each resample has a different mean to the next or last, and the difference between these means (DBM) is plotted to create the bootstrapped confidence interval. There are a number of factors that could affect the mean rent per week price of rental properties in the North and South Islands of New Zealand. One factor is that my sample size was a small sample. To help widen the data and show a greater amount of rental prices, a larger sample size would be required, such as 500 rental properties per group. My sample is not representative of all rental properties in the North and South Islands of New Zealand. Another factor that could affect the mean rental price per week is the after-effects of the rent freeze that happened in New Zealand during the beginning of the COVID-19 pandemic. After this rent freeze ended in September of 2020, the mean prices of rent nationwide went from $450 to $501 in December 2020. In my research, I can see that this increase of rent has continued, showing the after-effects of COVID-19 are still not over.
Next Steps
I believe that further research is required to see whether subsidies or different living situation options are required to alleviate the struggles people that rent a property are experiencing due to the cost of living crisis, as mentioned in my purpose. The governmental agencies like the Ministry of Housing and Development, Ministry of Social Development and Tenancy Services (falls under Ministry of Business, Innovation and Employment) could contact all of the people who are currently renting a residential property in New Zealand and conduct a survey similar to the NZ Census and find out how people are struggling to afford in order to maintain a good quality of life. This way, the various ministries can figure out strategies of how to help people who cannot afford rent along with bills and food. The ministries can calculate how they can provide support to those who need it, or remake the benefits strategy to include more groups of people and turn it towards helping rent decrease. Potentially, they could create a similar rent freeze as they did in 2020, but not complete freeze rents, but help them stay at similar levels for longer.