Measures of Variation

Measures of Variation


Range

The range is the simplest measure of variation to find. It is simply the highest value minus the lowest value.

RANGE = MAXIMUM - MINIMUM

Example question 1: What is the range for the following set of numbers? 10, 99, 87, 45, 67, 43, 45, 33, 21, 7, 65, 98?

Step 1: Sort the numbers in order, from smallest to largest:

7, 10, 21, 33, 43, 45, 45, 65, 67, 87, 98, 99

Step 2: Subtract the smallest number in the set from the largest number in the set:

99 – 7 = 92

The range is 92

That’s it!

Importantly, since the range only uses the largest and smallest values, it is greatly affected by extreme values, that is - it is not resistant to change.

Deviations Scores

The range only involves the smallest and largest numbers. It would be desirable to have a statistic which involved all of the data values. This would help us answer the question: How much variation is there in ALL of the data? To do this, we may want to start by looking at how much each point deviates from the mean. That is, how far is a given point away from the average. For example, if the mean is 3 and the point of interest is 5, this point has a deviation score of 2 (5-3= 2). We could do this for each of our data points and get a list of deviation scores. We could then sum up all of our deviation scores and end up with a single score that represents the total amount of deviation in our data. This is represented by the equation below:

However, there is a MAJOR problem with using the summation of deviation scores as a way to summarize variability. The problem is that the summation of deviation scores around the mean is always zero.

We will show this in an example below: Let's choose the following numbers:

1, 2, 3, 4, and 5

The average of these numbers is 3. Now we calculate the deviation scores:

If we sum of up our deviations (-2 + -1 + 0 +1 +2) we will get 0. Remember, this will always be the case. If you do not believe us, try it! Choose 5 numbers, calculate the mean and then find the deviations scores around the mean. Finally, sum up the deviations. If you do this correctly, you will get 0!

The fact that deviation scores around the mean always sum to 0 is a problem because this gives us no useful information in terms of our variability. To get around this problem, we need to find a way to manipulate our deviation scores so that they do not sum to zero, but rather, something meaningful. There are a couple of ways we could do this, but the conventional way in statistics is to square the deviations around the mean and then sum them up. This sounds fancy, but it is not difficult in practice.

The equation below represents us "squaring the deviations around the mean and then summing them up." To make this easier to communicate, statisticians call this term the "Sum of Squares." Let's calculate the Sum of Squares for our example above and see what value we get to summarize the spread of the data (You will see, this time, the value will no longer be 0!)

If we sum of up our squared deviations (4 + 1 + 0 +1 + 4) we get 10. This is an improvement upon our attempt to calculate the deviation around the mean because we now have a number (the Sum of Squares) that is not 0. In fact, the bigger the Sum of Squares (SS), the more variation there is in the data. This makes sense because if we had many numbers far away from the mean, these numbers would have big deviation scores. Regardless of whether these big deviation scores are negative or positive, once we square these terms to calculate our sum of squares, the terms will become even bigger, positive values and cause our SS to be greater. Conversely, squaring a bunch of small deviation scores (this would happen when there is little variation in our data) and summing them up would result in a small SS.

However, the SS is still not a perfect way to summarize variability. While it is better than the summation of deviation scores, the SS is effected by sample size. That is, the more data points you have, the more deviation scores you will have to square (to calculate the SS), and therefore the bigger your SS will become. We do not want our measure of variability to become larger just because we have more observations of data. To solve this problem, we will look at a new equation called the variance. The variance uses the SS but is not be impacted by sample size. We will explain why below.

Variance

The problem with the SS as a measure of variability is that it does not take into account how many data observations were used to obtain the sum. This means the more data values that are used, the larger our variability measure will appear. This is not good. In order to get around this, we divide the SS by the number of data observations. This creates "an average SS" which we call variance. We can think about the variance as the "average squared deviation from the mean".

Population Variance

See the formula below which calculates what we call the population variance. To calculate the population variance, we divide the SS by the number of values in the population.

Sample Variance (Unbiased Estimate of the Population Variance)

One would expect the sample variance to simply be the population variance with the population mean replaced by the sample mean. However, one of the major uses of statistics is to estimate the corresponding parameter. This formula has the problem that the estimated value (variance) isn't the same as the parameter (mean). To counteract this, the sum of the squares of the deviations is divided by one less than the sample size. This is a bit complicated, but for the purposes of this course, you should know that when we calculate a sample variance, we divide the SS by n-1 rather than by n.

You must think that surely by this point we have a measure of variability that is appropriate, but there is still a slight issue when we use variance to describe our variability. Recall that we use the SS in the numerator of the variance. We needed to use the SS because if we just used deviation scores, the deviation scores would sum to zero. However, since we used the SS in the numerator, it means the our variance measurement is in squared units, rather than the original units of measurement. This makes it a little more difficult to interpret exactly what a variance is telling us about our data. Therefore, to resolve this last issue, we get the units back the same as the original data values by taking the square root of the variance.

Standard Deviation

Once we take the square root of the variance, we have the most commonly used summary of variation in data: the standard deviation. Below are the equations for the population standard deviation and sample standard deviation. Note that in the sample standard deviation, we divide by n-1, just like we did in the sample variance. Just like we thought of the variance as the "average squared deviation from the mean," we can think of the standard deviation as "the average deviation from the mean."

Population Standard Deviation

See the formula below which calculates what we call the population standard deviation. To calculate the population standard deviation, we take the square root of the population variance.

Sample Standard Deviation

See the formula below which calculates what we call the sample standard deviation. To calculate the sample standard deviation, we take the square root of the sample variance.

Most of our data will fall between -1 and +1 standard deviations from the mean. Almost all of our data will fall between -2 and +2 standard deviations from the mean. You will learn about this more in future chapters. What is important now is that you have a way to summarize not only the shape of your data, the central tendency (e.g. the mean), but also the spread, or variability, in your data by using the standard deviation. For example, you may have data with a mean of 8 and a standard deviation of 2. This means that the average score in your data is 8 and that the average distance a score is away from the mean is 2 (a.k.a. that standard deviation). If, referring to the same distribution, someone asked you "what would be the score that is two standard deviations above the mean?", you could calculate this value. The mean is 8. The standard deviation is 2. So 8 + 2 + 2 = 12. Therefore, 12 is the score that is two standard deviations above the mean. Likewise, you can calculate the score that is 3 standard deviations below the mean. Again, the mean is 8 and the standard deviation is 2, so 8 - 2 - 2 - 2 (we used three 2s because we want to know the score 3 standard deviations below the mean) = 2. The score 3 standard deviations below the mean is 2. Knowing how to do these calculations will be important as we proceed in the course.