Data investigation and interpretation:
key ideas and important concept knowledge
Key ideas
The overarching key ideas of estimation, benchmarks, visualisation, equality and equivalence, language and strategies need to be considered when developing units of work in data investigation and interpretation. The specific key ideas in data investigation and interpretation are: classification, variation, expectation, distribution, randomness and informal inference.
Classification
Classification involves making decisions about the categorisation of data. Data is often sorted into categories by characteristic.
The following five key ideas are interrelated.
Variation
Variation describes the differences observed around us in every measurable aspect of life, such as age, height, eye colour and temperature. Variation is fundamental and directly connected to the other four key ideas.
Expectation
Expectation is a prediction based on patterns and differences in data.
Distribution
Distribution is a shape or relationship that represents a whole dataset. Common features of the shape of a distribution include centre, symmetry and non-symmetry (i.e. skewness), most frequent values or categories, and spread. Categorical data might be represented by a bar graph that shows how the data is distributed across the categories.
Numerical data might be displayed in a dot plot.
Randomness
Randomness occurs when all possible outcomes of a situation have an equal chance of being selected.
Informal inference
An informal inference is a generalised claim that is formulated from the data collected (Watson n.d.).
Important concept knowledge
Statistical inquiry
Statistics is the process of answering questions
using data. The data may need to be collected or may
already exist. Five interconnected stages make up the
statistical inquiry cycle, as shown in Figure 67 (right).
Problem
Inquiry begins with an issue or defining the problem. The inquiry problem is refined into inquiry questions.
Types of questions include:
• summary (for example, how tall are seven-year-old students?)
• comparison (for example, are seven-year-old girls taller than seven-year-old boys?)
• relationship (for example, is there a relationship between children’s heights and their arm spans?).
Plan
Planning involves considering what data is needed to answer the inquiry question and how it is to be collected. Sometimes the data already exists and can be accessed without the need to collect new data. The critical consideration is whether the data can be used to answer the inquiry question.
Data
Data involves the collection of data and the
representation of the collected data into a
communicable format. Data can be organised
into two types: categorical and numerical.
Categorical data, also known as qualitative data, may be represented by a name, symbol or a number code. Below are some examples of data types.
• Nominal data is a set of data that can be separated into distinct grouping or categories that cannot be organised in a logical sequence.
• Ordinal data is a set of data that can be logically ordered or ranked.
Numerical data, also known as quantitative data, is data that can be expressed as counts (numbers) and specific measures (units).
• Discrete data is a set of data that can take distinct and specific number values.
• Continuous data is a set of data consisting of measurements that can take on any decimal value along a continuous scale.
Outcomes equal events confusion
Related to the equi-probability bias, students often confuse outcomes with events. In the two-dice scenario, the event of getting a total of 7 has six associated outcomes. If the dice are assigned labels first and second, the outcomes are (1,6), (2,5), (3,4), (4,3), (5,2), (6,1). The outcome (3,4) is not the same as the outcome (4,3).
Data collection and display
Data collection tools
Data can be collected using a range of tools. These may include:
• surveys
• questionnaires
• interviews
• observations
• measurements
• experiments (usually associated with probability).
Data displays
Data displays are a tool for investigation, as well as a way to communicate findings. The choice of data display should be directly related to the investigation and the type of data collected.
All data
Categorical data
Pictographs
Each symbol in a pictograph refers to a number within a category.
Pictographs with a key usually have icons that represent more than one item within a category. In the pictograph in Figure 72, each drop represents the blood type of four people.
Numerical data
Dot plot
Each dot on a dot plot represents the data of one item, so the height of each line of dots represents the number of items with that data value.
Dot plots are used to display discrete numerical data, like what’s shown in Figure 76.
Box and whisker plot
Note: Box and whisker plots are introduced in the Year 9 curriculum. The following has been included as supplementary information.
The box and whisker plot for the data of mothers’ ages in Figure 77 gives five measures of the distribution of ages. The extreme ends of the whiskers are the lowest and highest ages. The central measure is the median, and the left and right ends of the box are the lower and upper quartile (LQ and UQ). This means that 50 per cent of the data lies inside the box.
Box and whisker plots can be used for both discrete and continuous numerical data.
Quartiles are the result of splitting a distribution into quarters. The lower quartile (LQ) is the value at which one-quarter of the data values are below. The upper quartile (UQ) is the value at which one-quarter of the data values are above. The interquartile range is the difference between upper and lower quartiles.
Appearance of data
• Symmetrical: the mean, median and mode are close together.
• Skewed: the mean, median and mode are not close together.
Analysis
Data analysis is the process of making sense of the data with respect to the inquiry question. Statisticians use displays, such as graphs and tables, and measures, such as medians and ranges, to look for patterns (consistencies), differences among groups,
and trends (patterns over time).
Data measures
Data measures are calculated to represent a single feature of a whole dataset. Usually the dataset is composed of numbers. Mean and median are measures that represent the centre of a dataset, whereas range and interquartile range (IQR) measure spread.
Measure of variation
The measure of variation refers to the spread of data and includes the:
• range – the difference between the greatest and smallest values in the dataset
• interquartile range – the difference between the upper and lower quartile in the dataset. It represents the middle 50% of the dataset.
Measures of central tendency
Measures of central tendency refer to averages, such as the:
• mean – the sum of all values in the dataset divided by the number of data points (e.g. scores or values)
• median – the middle value in an ordered set of numeric data
• mode – the value that occurs most frequently in the numeric dataset.
The shape of distributions is also used to compare groups and to describe single distributions. A normal distribution is bell shaped. However, distributions can be skewed positively or negatively, can have more than one mode and can even be rectangular.
Conclusion
The conclusion is an answer to the inquiry question that is supported by the data. Context informs the significance attached to findings.
The data is interpreted to develop inferences in relation to the original investigation and the findings are communicated.
Findings may lead to other questions that require further investigation, prompting a new data inquiry cycle.