4: Analysis

How does one group of data compare to the other?

Expected Knowledge

You should know (from previous years of learning) how to find the minimum, lower quartile, median, upper quartile and maximum of a data set.

You should know how to calculate the range and inter-quartile range (IQR).

Plots

After taking a sample in NZ Grapher, make a dot plot and box-and-whisker plot.

If you've used NZ Grapher before, there are two new buttons to click in the options.

Later on, we're going to want to have the "Informal C-I" and "C-I Limits" selected. These are the blue bars centred at the median in the plot below. We don't need to discuss them yet.

Remember to give your plots useful titles!

In all of your Analysis section, remember that you should be comparing two groups. It is not enough to describe the features of the two groups separately.

Paragraph structure

There are many ways that we can write an analysis section. This site suggests one set of sentence starters and uses it throughout.

I notice that ... [compare the visual features.]

My evidence is ... [give numerical evidence for the visual comparison.]

This means that in this sample... [in context]

This might mean that for all.... [discuss the population in context]

These are repeated in each of three key subsections.

Revisions coming...

In an upcoming revision of this website (and the resources), there will be two context sentences.

First - what does it tell us about the SAMPLE?

Second - what might this mean about the POPULATION?

Centre & Shift

The best measure of the centre is usually the median. Because of the way NZ Grapher and iNZight work, we will use the median in this standard.

Visually, the shift is the distance between the medians (the centre line of the middle 50% box).

Example

I notice that the centre of the scores for words with rare letters is to the right of the centre of scores for other words.

My evidence is that the median score of words with rare letters is 20 points, which is 5 more than the median of the score of the other words (15 points).

This means that from this sample, playing a word with a rare letter will get you more points, which makes sense because rare letters are worth a lot of points.

This might mean that in the list of allowable Scrabble words, words with rare letters score more than words without.

Spread

Visually, we look to see if one dot plot is more spread out than the other.

The most reliable measure of spread is usually the interquartile range (IQR). It is more stable than the range (max - min) as it is not affected by outliers.

Example

I notice that the scores for words with rare letters are more spread out that the scores of other words.

My evidence is that the IQR of the scores for words with rare letters is 7 points, which is 2 points more than the IQR for the scores of other words (5 points).

This means that in this sample there is not as much variation in the scores of words that don't have rare letters; 50% of the scores are between 12 and 17. By comparison, having rare letters to use gives more variation in the scores.

This might mean that in the list of allowable Scrabble words, there is more variation in the scores of words with rare letters.

Shape & Skew

Look for evidence of skew - where the tail of the dot plot is more spread out in one direction.

A numerical measure of skew is how different the mean and the median are. The mean is pulled away from the median in the direction of the skew.

We can also compare distances between quartiles on either side.

It can be hard to be sure that a small amount of skew would also be seen in the population when the sample size is under 100.

Example

I notice that the scores for words with rare letters are slightly skewed to the right, while the scores for other words appear symmetrical.

My evidence is that the mean of the scores for words with rare letters is greater than the median of those scores (20.56 > 20 points), while the mean and median of the other scores are similar (14.79 ≈ 15).

This means that in this sample it may be slightly more likely to get some high-scoring words with rare letters. In this sample, they are words with TWO or more rare letters, like "QuinQuivalent" and "bliZZardy".

This might mean that in the list of allowable Scrabble words, there are a few words high-scoring words with rare letters that right-skew the rare-letter word scores.

Unusual Features

If a dot plot is very clearly bimodal (separated into two quite distinct groups), this is worth discussing. Try to find another variable that might explain the grouping. (This could also be discussed under Shape, above). Students often overstate an observation of a bimodal distribution - it is typically quite difficult to see a bimodal pattern in the population based on a sample of less than 100.

One definition of an outlier is a point which is at least 1.5×IQR above the upper quartile or below the lower quartile. It should be glaringly obvious that it is an outlier, and you could be able to trace why that data is so unusual.

Because the analysis of data should focus on describing tendency (what typically happens) and variation (how much difference there is within and between groups), outliers are not very important.

Don't discuss unusual features until after discussing the key features (centre, spread and shape).

Above, a bimodal sample, with two peaks separated by a gap. Below, that sample coloured by a third (categorical) variable, showing a possible explanation for this difference.

Example

I notice that the weights of the hatchbacks in my sample are bimodally distributed. My evidence is that there are two peaks, centred around 1450kg and around 1850kg. This might mean that there are two types of hatchback, smaller and larger types. I would need a larger sample to be sure that this bimodal pattern was also in the whole population.

Worksheets 4, 5, 6 and 7 gives more practice on Analysis.

Inf Worksheet 4.pdf

Inf Worksheet 5.pdf

Inf Worksheet 6.pdf

Inf Worksheet 7.pdf

Google Sites

Report abuse