Data investigation and interpretation:

key ideas and important concept knowledge

Key ideas

The overarching key ideas of estimation, benchmarks, visualisation, equality and equivalence, language and strategies need to be considered when developing units of work in data investigation and interpretation. The specific key ideas in data investigation and interpretation are: classification, variation, expectation, distribution, randomness and informal inference.

Classification

Classification involves making decisions about the categorisation of data. Data is often sorted into categories by characteristic.

The following five key ideas are interrelated.

Variation

Variation describes the differences observed around us in every measurable aspect of life, such as age, height, eye colour and temperature. Variation is fundamental and directly connected to the other four key ideas.

Expectation

Expectation is a prediction based on patterns and differences in data.

Distribution

Distribution is a shape or relationship that represents a whole dataset. Common features of the shape of a distribution include centre, symmetry and non-symmetry (i.e. skewness), most frequent values or categories, and spread. Categorical data might be represented by a bar graph that shows how the data is distributed across the categories.

Numerical data might be displayed in a dot plot.

Randomness

Randomness occurs when all possible outcomes of a situation have an equal chance of being selected.

Informal inference

An informal inference is a generalised claim that is formulated from the data collected (Watson n.d.).

Important concept knowledge

Statistical inquiry



Statistics is the process of answering questions

using data. The data may need to be collected or may

already exist. Five interconnected stages make up the

statistical inquiry cycle, as shown in Figure 67 (right).

Problem

Inquiry begins with an issue or defining the problem. The inquiry problem is refined into inquiry questions.

Types of questions include:

• summary (for example, how tall are seven-year-old students?)

• comparison (for example, are seven-year-old girls taller than seven-year-old boys?)

• relationship (for example, is there a relationship between children’s heights and their arm spans?).

Plan

Planning involves considering what data is needed to answer the inquiry question and how it is to be collected. Sometimes the data already exists and can be accessed without the need to collect new data. The critical consideration is whether the data can be used to answer the inquiry question.

Data

Data involves the collection of data and the

representation of the collected data into a

communicable format. Data can be organised

into two types: categorical and numerical.

Categorical data, also known as qualitative data, may be represented by a name, symbol or a number code. Below are some examples of data types.

Nominal data is a set of data that can be separated into distinct grouping or categories that cannot be organised in a logical sequence.

Ordinal data is a set of data that can be logically ordered or ranked.

Numerical data, also known as quantitative data, is data that can be expressed as counts (numbers) and specific measures (units).

Discrete data is a set of data that can take distinct and specific number values.

Continuous data is a set of data consisting of measurements that can take on any decimal value along a continuous scale.

Outcomes equal events confusion

Related to the equi-probability bias, students often confuse outcomes with events. In the two-dice scenario, the event of getting a total of 7 has six associated outcomes. If the dice are assigned labels first and second, the outcomes are (1,6), (2,5), (3,4), (4,3), (5,2), (6,1). The outcome (3,4) is not the same as the outcome (4,3).

Data collection and display

Data collection tools

Data can be collected using a range of tools. These may include:

• surveys

• questionnaires

• interviews

• observations

• measurements

• experiments (usually associated with probability).

Data displays

Data displays are a tool for investigation, as well as a way to communicate findings. The choice of data display should be directly related to the investigation and the type of data collected.

All data

Categorical data

Pictographs

Each symbol in a pictograph refers to a number within a category.

Pictographs with a key usually have icons that represent more than one item within a category. In the pictograph in Figure 72, each drop represents the blood type of four people.

Numerical data

Dot plot

Each dot on a dot plot represents the data of one item, so the height of each line of dots represents the number of items with that data value.

Dot plots are used to display discrete numerical data, like what’s shown in Figure 76.

Box and whisker plot

Note: Box and whisker plots are introduced in the Year 9 curriculum. The following has been included as supplementary information.

The box and whisker plot for the data of mothers’ ages in Figure 77 gives five measures of the distribution of ages. The extreme ends of the whiskers are the lowest and highest ages. The central measure is the median, and the left and right ends of the box are the lower and upper quartile (LQ and UQ). This means that 50 per cent of the data lies inside the box.

Box and whisker plots can be used for both discrete and continuous numerical data.

Quartiles are the result of splitting a distribution into quarters. The lower quartile (LQ) is the value at which one-quarter of the data values are below. The upper quartile (UQ) is the value at which one-quarter of the data values are above. The interquartile range is the difference between upper and lower quartiles.

Appearance of data

• Symmetrical: the mean, median and mode are close together.

• Skewed: the mean, median and mode are not close together.

Analysis

Data analysis is the process of making sense of the data with respect to the inquiry question. Statisticians use displays, such as graphs and tables, and measures, such as medians and ranges, to look for patterns (consistencies), differences among groups,

and trends (patterns over time).

Data measures

Data measures are calculated to represent a single feature of a whole dataset. Usually the dataset is composed of numbers. Mean and median are measures that represent the centre of a dataset, whereas range and interquartile range (IQR) measure spread.


Measure of variation

The measure of variation refers to the spread of data and includes the:

• range – the difference between the greatest and smallest values in the dataset

• interquartile range – the difference between the upper and lower quartile in the dataset. It represents the middle 50% of the dataset.

Measures of central tendency

Measures of central tendency refer to averages, such as the:

• mean – the sum of all values in the dataset divided by the number of data points (e.g. scores or values)

• median – the middle value in an ordered set of numeric data

• mode – the value that occurs most frequently in the numeric dataset.

The shape of distributions is also used to compare groups and to describe single distributions. A normal distribution is bell shaped. However, distributions can be skewed positively or negatively, can have more than one mode and can even be rectangular.

Conclusion

The conclusion is an answer to the inquiry question that is supported by the data. Context informs the significance attached to findings.

The data is interpreted to develop inferences in relation to the original investigation and the findings are communicated.

Findings may lead to other questions that require further investigation, prompting a new data inquiry cycle.