All statistical investigations follow a process called the Statistical Enquiry Cycle.
This cycle contain five sections known as PPDAC:
Problem: Understanding and defining the problem.
Plan: How will we measure and record data.
Data: How will we collect, measure and clean data.
Analysis: Sorting data into readable outputs and looking for patterns.
Conclusion: Interpretation and communication of findings.
There are three types of variables we will be working with which can be separated into two sets:
Quantitative/Numerical Variables: Variables that use numbers
Continuous Variables: These are measured values which usually include fractions or decimals (Height, Weight, Distance)
Discrete Variables: These are counted values which are usually whole numbers (Shoe size, Number of pets)
Qualitative/Categorical: Variables that use words.
Descriptive Variables: These are categorical variables and can be broken down into groups (Eye colour, type of pet)
When completing a Bivariate analysis, we will be primarily comparing two continuous variables.
In practice, however, if the range is large enough, discrete data may also be acceptable.
A scatter plot is often the best way to display bivariate data as it allows us to see both variables at the same time as well as a
line-of-best-fit.
A scatter plot has multiple features:
Y-axis (Along the side): Response Variable
X-Axis (Along the bottom): Explanatory Variable
Line of best fit: A line that showing the relationship between the two variables. Should pass through the middle of the data at both ends and have the same number of points above and below the line.
Data Points/Co-ordinates: the (x,y) values of each data point
Scale: Should be regular and include a break if not starting at zero.
Titles and Labels: Name and units of both variables and title explaining the purpose of the graph.
Your problem statement is the question that you are looking to investigate. You will be given this in your assessment.
This will include:
Two numerical variables
The population
The word 'relationship
Example: What is the relationship between the height and armspan for students at Hornby High School.
We can break down our two variables into their explanatory variable and the response variable. By changing the explanatory variable, we will measure how the response variable changes.
Explanatory Variable: This is the variable that we will control and can change. This variable will always be placed on the X-axis.
Response Variable: This is the uncontrolled variable that we measure in response. This variable is always placed on the Y-axis.
You will need to:
- State what variables you will use
- State what units you will use
- Explain why you have chosen those variables
Example: Height
What: For each student I will measure the top of their head above the floor while they are standing against a wall.
Units: Centimetres (cm).
Why: This will give me a measure of their overall height.
When we are gathering data, we want to ensure the data is as good as possible. The main goal is to achieve data that is accurate, precise.
Accuracy: How close the data is to the true value.
Precision: How close the measurements are to each other.
Reliability: Meaning you can get similar results on repeated trials.
To ensure that our data is both accurate and reliable, we must have step-by-step instructions to measure both the explanatory and response variable. It is important to include as in-depth instructions as possible. There should be specific mention of each part of your table.
Example: Standing time on each foot (seconds).
We will tell each subject to shit their eyes and lift one foot from the floor and keep it off the floor as long as possible. We will use a stopwatch and record how long it is off the floor (in seconds), and whether they were standing on their left or right foot. After they have balanced on each foot, we will ask each person which of their feet is dominant (if they are unsure we will use the first foot used when they start walking).
When gathering data it is important to isolate the variable, minimising any other sources of variation. This way we can ensure that ONLY the change in our explanatory variable, is impacting our response variable.
You want to keep conditions the same each time you collect data.
If possible, take repeat-measurements.
Keep each person on the same task. (don't switch measures)
Use the same measuring tools/instruments.
Statistical Experiments are often subject to limitations - Usually time or budget. As such, it's important to choose a reasonable scope when selecting sample size.
Smaller sample sizes take a shorter time to collect data but give less reliable results.
Larger sample sizes take a longer time to collect data but give more reliable results.
It will be up to you to choose your sample size based on the importance of the research and but allocated budget/time.
For your assessment you can use the following:
Discrete data: n = 50
Continuous data: n = 30
When gathering data, we must ensure our data is unbiased.
This means we remove any personal bias by selecting participants through a system where everyone has an equal chance to be selected.
If any item/person has an increased/decreased chance to be chosen, your data will be biased.
Systematic Sampling: This involves taking every nth member of the population. e.g selecting every 5th person/item.
Random Sampling: Selecting a sample through random means such as a random number generator (RNG) or pulling names from a hat.
When gathering your data you will need to ensure that you are recording your results through an organized system such as a table.
Be sure to include:
Titles
Variables
Units
Names/Identifiers.
It is important to go through our data a remove/re-measure any datapoints that are obvious mistakes.
Examples:
Marks out of 20: 9,13, 15, 27, 18, 4, 20
As 27 is an impossible mark, it can be removed
Price of Cars: $3400, $10999, 13, $14399, $5400, $2100
This is both an unlikely price for a car and does not have the correct units.
You must always state when you have removed a datapoint and justify your decision.
Do not simply remove a datapoint because it is an unusual value or an outlier. Outliers can exist and be an accurate value.
You want to reflect on what improvements you could suggest if you collected another set of data. How could you better control any sources of variation that you didn’t identify when you first wrote your plan?
Example: I could improve my investigation by getting students to remove not just their shoes, but also their socks. This is because I noticed that some students had quite thick socks, while some girls had very thin pantyhose. So, it is possible that the measurements for foot length are not as accurate as they could be.
This can also be included at the end of your report in the conclusion.
How to use NZ Grapher
When analysing the data and looking for patterns and features, we want to first visually examine the graph. To help us do this, we can use either the shade-in or shade-out methods.
Shade-in: Draw a line above and below the data, the shade the area between these lines.
Shade-out: Shade the areas where there is not any data.
The first section of our analysis is identifying whether there is a trend and if so, what kind of trend there is (linear/non-linear).
No-Trend: This can occur when there is no visible pattern or relationship within the data. Our points will simply look scattered along the whole graph.
Linear Trend: You can draw a straight line of best fit through the data and all points being evenly scattered on both sides. The data points tend to increase/decrease in value at a consistent rate along the graph (Constant Gradient).
Non-Linear Trend: You can draw a curved line of best fit through the data and all points being evenly scattered on both sided. The data points tend to be increasing/decreased in value at a increasing/decreasing rate along the graph (Changing Gradient).
Data that has a trend will either increase or decrease as we move along the graph. We call this the direction of the graph.
Positive/Increasing Relationship: As our explanatory variable increases, the response variable tends to also increase.
Negative/decreasing Relationship: As our explanatory variable increases, the response variable tends to decrease.
When we discuss how data changes along the graph, we are always interpreting from left to right (using the explanatory variable).
Example: There is a positive relationship between how long a person can stand on their dominant foot while their eyes are closed and how long they can stand on their non-dominant foot. This means that as the time a person can stand on their dominant foot increases, the time they can stand on their non-dominant food also increases. This does not surprise me as balance is not likely subjective to sides.
Next, we need to analyze the strength of the relationship. Generally, the closer the points are to the line of best fit, the stronger the relationship between the two variables.
Strong: There is very little scatter and the points lie close to the line of best fit.
Moderate: There is a moderate amount of scatter and the points lie somewhat close to the line of best fit.
Weak: There is a large amount of scatter and the points lie further from the line of best fit.
There may also be graphs that look like a funnel. Here you can say that there are parts of the graph that are weak and others that are strong.
Example: The relationship is quite strong because most points are fairly close to the line of best fit, However, I notice that the points are closer to the line of best fit for people who could stand on one foot for less than 40 seconds. Above 40 seconds, the data is more scattered. This means the relationship is initially stronger and continues to weaken along the graph.
Unusual features may be unusual points or clusters of points.
These are points that lie far from most other points.
These may be valid points with other variables contributing to them.
Sometimes these may be due to measurement/recording errors or reversing their coordinate
These lines may still lie along the line of best fit, but much further from other points. Otherwise, they may be far off.
Usual points and clusters must be very obvious. Don't say there are unusual features if there are none.
Example: I notice that there is a cluster of smaller values and one larger value that are significant distances from most of the data. This means that there are two students who have similarly disproportionate height and arm span (172cm to 150cm) compared with the majority of sample students. We can tell they are not the same proportions as the other students as they lie far from the line of best fit.
There is also one student who has a much shorter height and arm span (132cm to 136cm) than the rest of the class. Their height and arm span are proportionate to the rest of the class as they still lie along the line of best fit. This person could perhaps be much younger than the other students or has yet to go through their puberty growth spurt.
To conclude your report, you will first need to refer back to your original problem statement and answer the question by presenting your findings.
There should not be any new information within this section as you are simply using the information you have already presented.
Example: I have found that there is quite a strong relationship between how long a year 11 student at Paradise High School can stand on their dominant foot compared to their non-dominant foot. Students who can stand for a long time on their dominant foot also tend to be able to stand for a long time on their non-dominant foot. I have also found that students tend to be able to stand on their dominant foot longer than their non-dominant foot.
When doing a statistical analysis like this, it is important to recognize that we are only looking at part of the picture. In our analysis, we are only looking at two variables where other variables may be equally or more important. It's important to add sections where you believe this may be the case.
Here are a couple of examples:
There are a number of different factors that might affect the weight of children in NZ. For example, if a child has parents who are both slim and short in stature, then because of the genetic link it is likely that the child is also likely to be slim and short in stature. Equally, a child whose parents have bigger and heavier bones, and a wide/tall build are likely to be taller and heavier. A study published in the UK supports this and discusses how there is a link between people's weight and their genetics, where being slim is a heritable trait.
Another factor that could affect the weight of children is the amount of exercise they do each week. I expect that a child who is more active and spends more time each week exercising would have less body fat than a child who is less active and spends less time exercising each week. The more body fat a child has the higher their weight will be. Kids Health suggests that “kids can reach a healthy weight by eating right and being active”.
Always add references when including information outside of our analysis.
The final section of your report will be to mention any areas that may have gone wrong with your investigation and what further investigating may be required to answer your initial question.
A good example of this would be to look at your data and how it was gathered. If there are things that you would change about your data-gathering process, this is a good section to put them.
Example: I could have more confidence in my results if I took a bigger sample. I would also have more confidence in my results if I sampled from the whole of Year 11, and not just my math class. I would also be interested to know if this relationship is the same for all ages of students and adults. I would also be interested to know if the results are similar when the subjects have their eyes open.