Stats text 02

Measures of Middle and Spread

2.1 Measures of central tendency: mode, median, mean, midrange

Mode

The mode is the value that occurs most frequently in the data. Spreadsheet programs can determine the mode with the function MODE.

=MODE(data)

In the Fall of 2000 the statistics class gathered data on the number of siblings for each member of the class. One student was an only child and had no siblings. One student had 13 brothers and sisters. The complete data set is as follows:

0,1, 2, 2, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5, 7, 8, 9, 10, 12, 12, 13

The mode is 2 because 2 occurs more often than any other value. Where there is a tie there is no mode.

For the ages of students in that class

18, 19, 19, 20, 20, 21, 21, 21, 21, 22, 22, 22, 22, 23, 23, 24, 24, 25, 25, 26

...there is no mode: there is a tie between 21 and 22, hence there no single most frequent value. Spreadsheets will, however, usually report a mode of 21 in this case. Spreadsheets often select the first mode in a multi-modal tie.

If all values appear only once, then there is no mode. Spreadsheets will display #N/A or #VALUE to indicate an error has occurred - there is no mode. Do not put #N/A for the mode. When you see #N/A the answer is "No mode."

No mode is NOT the same as a mode of zero. A mode of zero means that zero is the most frequent data value. Do not put the number 0 (zero) for "no mode." An example of a mode of zero might be the number of children for students in statistics class.

Median

The median is the central (or middle) value in a sorted data set. If a number sits at the middle of a sorted data set, then it is the median. If the middle is between two numbers, then the median is half way between the two middle numbers.

For the sibling data...

1, 2, 2, 2, 2, 2, 3, 3, 4, 4, |4|, 5, 5, 5, 7, 8, 9, 10, 12, 12, 13

...the median is 4.

Remember that the data must be in order (sorted) before you can find the median. For the data 2, 4, 6, 8 the median is 5: (4+6)/2.

The median function in spreadsheets is MEDIAN.

=MEDIAN(data)

Mean (average)

The mean, also called the arithmetic mean and also called the average, is calculated mathematically by adding the values and then dividing by the number of values (the sample size n).

If the mean is the mean of a population, then it is called the population mean μ. The letter μ is a Greek lower case "m" and is pronounced "mu."

population mean = sum of the population data ÷ the population size

If the mean is the mean of a sample, then it is the sample mean ̅x. The symbol ̅x is pronounced "x bar."

sample mean = sum of the sample data ÷ the sample size n

The sum of the data ∑ x can be determined using the function =SUM(data). The sample size n can be determined using =COUNT(data). Thus =SUM(data)/COUNT(data) will calculate the mean. There is also a single function that calculates the mean. The function that directly calculates the mean is AVERAGE

=AVERAGE(data)

Because the =AVERAGE(data) is shorter than entering =SUM(data)/COUNT(data) the AVERAGE function is always used to calculate the mean.

Resistant measures: One that is not influenced by extremely high or extremely low data values. The median tends to be more resistant than mean.

Population mean and sample mean

If the mean is measured using the whole population then this would be the population mean. If the mean was calculated from a sample then the mean is the sample mean. Mathematically there is no difference in the way the population and sample mean are calculated.

Midrange

The midrange is the midway point between the minimum and the maximum in a set of data.

To calculate the minimum and maximum values, spreadsheets use the minimum value function MIN and maximum value function MAX.

=MIN(data)

=MAX(data)

The MIN and MAX function can take a list of comma separated numbers or a range of cells in a spreadsheet. If the data is in cells A2 to A42, then the minimum and maximum can be found from:

=MIN(A2:A42)

=MAX(A2:A42)

The midrange can then be calculated from:

midrange = (maximum + minimum)/2

In a spreadsheet use the following formula:

=(MAX(data)+MIN(data))/2

Do not forget the parentheses!

2.2 Differences in the Distribution of Data

In addition to measures of the middle, measurements of the spread of data values away from the middle are important in statistical analyses. Spread away from the middle usually involves numeric data values. Perhaps the simplest measures of spread away from the middle involve the smallest value, the minimum, and the largest value, the maximum.

Range

The range is the maximum data value minus the minimum data value. The MIN function returns the smallest numeric value in a data set. The MAX functions returns the largest numeric value in a data set. The difference between the maximum value and the minimum value is called the range.

=MAX(data)−MIN(data)

The range is a useful basic statistic that provides information on the distance between the most extreme values in the data set.

The range does not show if the data if evenly spread out across the range or crowded together in just one part of the range. The way in which the data is either spread out or crowded together in a range is referred to as the distribution of the data. One of the ways to understand the distribution of the data is to calculate the position of the quartiles and making a chart based on the results.

Percentiles, Quartiles, Box and Whisker charts

The median is the value that is the middle value in a sorted list of values. At the median 50% of the data values are below and 50% are above. This is also called the 50th percentile for being 50% of the way "through" the data.

If one starts at the minimum, 25% of the way "through" the data, the point at which 25% of the values are smaller, is the 25th percentile. The value that is 25% of the way "through" the data is also called the first quartile.

Moving on "through" the data to the median, the median is also called the second quartile.

Moving past the median, 75% of the way "through" the data is the 75th percentile also known as the third quartile.

Note that the 0th quartile is the minimum and the fourth quartile is the maximum.

Spreadsheets can calculate the first, second, and third quartile for data using a function, the quartile function.

=QUARTILE(data,type)

Data is a range with data. Type represents the type of quartile. (0 = 0% or minimum (zeroth quartile), 1 = 25% or first quartile, 2 = 50% or second quartile (also the median), 3 = 75% or third quartile and 4 = 100% or maximum (fourth quartile). Thus if data is in the cells A1:A20, the first quartile could be calculated using:

=QUARTILE(A1:A20,1)

There are some complex subtleties to calculating the quartile. For a full and thorough treatment of the subject refer to Eric Langford's Quartiles in Elementary Statistics, Journal of Statistics Education Volume 14, Number 3 (2006).

The minimum, first quartile, median, third quartile, and maximum provide a compact and informative five number summary of the distribution of a data set.

InterQuartile Range

The InterQuartile Range (IQR) is the range between the first and third quartile:

=QUARTILE(Data,3) − QUARTILE(Data,1)

There are some subtleties to calculating the IQR for sets with even versus odd sample sizes, but this text leaves those details to the spreadsheet software functions.

Quartiles, Box and Whisker plots

The above quartile information as pure numbers is very abstract and hard to visualize. A box and whisker plot takes the above quartile information and plots a chart based on the quartiles. The chart below displays four different data sets. The first data set consists of a single value repeated, the second data set consists of values spread uniformly from the minimum to the maximum (uniform), the third data set has values concentrated near the middle of the range (peaked symmetric), and the last data set has most of the values at the minimum or maximum (bimodal).

Box plots display how the data is spread across the range based on the quartile information above.

A box and whisker plot is built around a box that runs from the value at the 25th percentile (first quartile) to the value at the 75th percentile (third quartile). The length of the box spans the distance from the value at the first quartile to the third quartile, this is called the Inter-Quartile Range (IQR). A line is drawn inside the box at the location of the 50th percentile. The 50th percentile is also known as the second quartile and is the median for the data. Half the scores are above the median, half are below the median. Note that the 50th percentile is the median, not the mean.

The basic box plot described above has lines that extend from the first quartile down to the minimum value and from the third quartile to the maximum value. These lines are called "whiskers" and end with a cross-line called a "fence".

Boxplot outliers

If, however, the minimum is more than 1.5 × IQR below the first quartile, then the lower fence is put at 1.5 × IQR below the first quartile and the values below the fence are marked with a round circle. These values are referred to as potential outliers - the data is unusually far from the median in relation to the other data in the set.

Likewise, if the maximum is more than 1.5 × IQR beyond the third quartile, then the upper fence is located at 1.5 × IQR above the 3rd quartile. The maximum is then plotted as a potential outlier along with any other data values beyond 1.5 × IQR above the 3rd quartile.

There are actually two types of outliers. Potential outliers between 1.5 × IQR and 3.0 × IQR beyond the fence . Extreme outliers are beyond 3.0 × IQR. In some statistical programs potential outliers are marked with a circle colored in with the color of the box. Extreme outliers are marked with an open circle - a circle with no color inside.

An example with hypothetical data sets is used to illustrate box plots. The data consists of two samples. Sample one (s1) is a uniform distribution and sample two (s2) is a highly skewed distribution.

Box and whisker plots, variants, with ability to show the mean

To generate box plots the online tool BoxPlotR generates box plots including outliers. The first row should be the data label, the variable to be plotted. Data can be copied and pasted into the second tab using the Paste data option. If copying and pasting multiple columns from a spread sheet, preset the separator to Tab. For advanced users notches for the 95% confidence interval for the median can be displayed. The plot can also display the mean and the 95% confidence interval for the mean. The tool is also able to generate violin and bean plots, and change whisker definitions from Tukey to Spear or Altman for advanced users. If the tool grays out, reload the page and recopy the data.

The box and whisker plot is a useful tool for exploring data and determining whether the data is symmetrically distributed, skewed, and whether the data has potential outliers - values far from the rest of the data as measured by the InterQuartile Range. The distribution of the data often impacts what types of analysis can be done on the data.

The distribution is also important to determining whether a measurement that was done is performing as intended. For example, in education a "good" test is usually one that generates a symmetric distribution of scores with few outliers. A highly skewed distribution of scores would suggest that the test was either too easy or too difficult. Outliers would suggest unusual performances on the test.

2.3 Standard Deviation

The range is a calculation of the "distance" from the minimum to the maximum and does not "capture" where the rest of the data is located between that minimum and maximum. Is the data scattered evenly from the minimum to the maximum? Or is the data concentrated around the mean? Or is the data concentrated at the minimum, maximum, or both (with no data in the middle)? The range does not reveal the structure of the data - how far the data is from the mean, for example.

To capture the spread of the data we use a measure related to the average distance of the data from the mean. We call this the standard deviation. If we have a population, we report this average distance as the population standard deviation. If we have a sample, then our average distance value may underestimate the actual population standard deviation. As a result the formula for sample standard deviation adjusts the result mathematically to be slightly larger. For our purposes these numbers are calculated using spreadsheet functions and in this course we always use the sample standard deviation function.

Sample standard deviation

In spreadsheets there is a single function that performs all of the above operations and calculates the sample standard deviation sx, the STDEV function. The STDEV function is the function that will be used in this course.

=STDEV(data)

In this text the symbol for the sample standard deviation in this text is sx.

In this text the symbol for the population standard deviation is the Greek lower case "s": σ.

The symbol sx usually refers the standard deviation of single variable x data. If there is y data, the standard deviation of the y data is sy. Other symbols that are used for standard deviation include s and σx. Some calculators use the unusual and confusing notations σxn−1 and σxn for sample and population standard deviations.

In this class we always use the sample standard deviation in our calculations. The sample standard deviation is calculated in a way such that the sample standard deviation is slightly larger than the result of the formula for the population standard deviation. This adjustment is needed because a population tends to have a slightly larger spread than a sample. There is a greater probability of outliers in the population data.

Coefficient of variation CV

The Coefficient of Variation is calculated by dividing the standard deviation (usually the sample standard deviation) by the mean.

=STDEV(data)/AVERAGE(data)

Note that the CV can be expressed as a percentage: Group 2 has a CV of 52% while group 3 has a CV of 69%. A deviation of 3.46 is large for a mean of 5 (3.46/5 = 69%) but would be small if the mean were 50 (3.46/50 = 7%). So the CV can tell us how important the standard deviation is relative to the mean.

Rules of thumb regarding spread

As an approximation, the standard deviation for data that has a symmetrical, heap-like distribution is roughly one-quarter of the range. If given only minimum and maximum values for data, this rule of thumb can be used to estimate the standard deviation.

At least 75% of the data will be within two standard deviations of the mean, regardless of the shape of the distribution of the data.

At least 89% of the data will be within three standard deviations of the mean, regardless of the shape of the distribution of the data.

If the shape of the distribution of the data is a symmetrical heap, then as much as 95% of the data will be within two standard deviations of the mean.

Data beyond two standard deviations away from the mean is considered "unusual" data.

Levels of Measurement and their interactions with statistics of middle and spread

Nominal level of measurement

Has a sample size n
May have a mode

Ordinal level of of measurement

Has a sample size n
May have a mode
If the rank order is a scale, then minimum, median, and maximum exist but might not be numeric values

Interval level of measurement

Has a sample size n
May have a mode
Has a median and the median is considered the optimal measure of middle
Has a minimum, maximum, range, and midrange
Depending on the data, may have a "meaningful" mean and standard deviation. Maybe.

Ratio level of measurement

Has a sample size n
May have a mode
Has a median
Has a mean and the mean is usually considered the optimal measure of the middle
Has a minimum, maximum, range, and midrange
Has a standard deviation sx and the standard deviation is usually considered the optimal measure of spread
Has a coefficient of variation

2.4 Variables

A variable is defined as any measurement that can take on different data values. Variables are named containers for data values. In statistics variables are often words such as marble color, leaflet length, or marble position.

In a spreadsheet, variable names are usually put in row one with the data in the rows below row one. A variable can also have units of measure.

Variables are said to be at the type and level of measurement of the data that the variable contains. Thus variables can be qualitative or quantitative, discrete or continuous. Variables can be at the nominal, ordinal, interval, or ratio level of measurement.

Discrete Variables

When there are a countable number of values that result from observations, we say the variable producing the results is discrete. The nominal and ordinal levels of measurement almost always measure a discrete variable.

The following examples are typical values for discrete variables:

true or false (2 values)
yes or no (2 values)
strongly agree | agree | neutral | disagree | strongly disagree (5 values)

The last example above is a typical result of a type of survey called a Likert survey first developed by Renis Likert in 1932.

When reporting the "middle value" for a discrete distribution at the ordinal level it is usually more appropriate to report the median.

Note that if the variable measures only the nominal level of measurement, then only the mode is likely to have any statistical "meaning", the nominal level of measurement has no "middle" per se.

There are instances in which looking at the mean value and standard deviation is useful for looking at comparative performance, but it is not a recommended practice to use the mean and standard deviation on a discrete distribution. That said, there are data sets where extracting relative meaning from interval level data requires using means and standard deviations.

For example, the number of people in cars commuting to work are usually integer values. For 248 cars that passed the college during morning commutes the minimum was one (the driver only) and the maximum was nine. Note that zero is not a possible value here on Pohnpei.

Continuous Variables

When there is a infinite (or uncountable) number of values that may result from observations, we say that the variable is continuous. Physical measurements such as height, weight, speed, and mass, are considered continuous measurements. Bear in mind that our measurement device might be accurate to only a certain number of decimal places. The variable is continuous because better measuring devices should produce more accurate results.

The following examples are continuous variables:

distance
time
mass
length
height
depth
weight
speed
body fat

When reporting the "middle value" for a continuous distribution it is most often appropriate to report the mean and standard deviation.

2.5 Z-score: A Measure of Relative Standing

Z-scores are a useful way to compare or combine scores from data that has different means and standard deviations. Z-scores are an application of the above measures of middle and spread.

Remember that the mean is the result of adding all of the values in the data set and then dividing by the number of values in the data set. The word mean and average are used interchangeably in statistics.

Recall also that the sample standard deviation can be thought of as a mathematical calculation of the average distance of the data from the mean of the data. Note that although I use the words average and mean, the sentence could also be written "the mean distance of the data from the mean of the data."

Z-Scores

Z-scores simply indicate how many standard deviations away from the mean is a particular data value. This is termed "relative standing" as it is a measure of where in the data the particular data value is located relative to the mean as counted in units of standard deviations. The formula for calculating the z-score is:

=(a single data value - mean) ÷ standard deviation

Using the sample mean ̅x and sample standard deviation sx, the formula for a data value x is:

=(x-̅x)/sx

Note the parentheses! When typing in a spreadsheet do not forget the parentheses. Using spreadsheet functions the formula becomes:

=(value−AVERAGE(data))/STDEV(data)

Suppose that a data set has a mean of 50 and a standard deviation sx of 10. The z-score for a value of 65 would be =(65-50)/10 or +1.5. The z-score for a value of 30 would be =(30-50)/10 or -2.0. Again, do not forget the parentheses. The subtraction in the numerator must occur before the division. To force the subtraction to occur ahead of the division, parentheses must be used.

Data that is two standard deviations below the mean will have a z-score of −2, data that is two standard deviations above the mean will have a z-score of +2. Data beyond two standard deviations away from the mean will have z-scores below −2 or above 2. A data value that has a z-score below −2 or above +2 is considered an unusual value, an extraordinary data value. These values may also be outliers on a box plot depending on the distribution. Box plot outliers and extraordinary z-scores are two ways to characterize unusually extreme data values. There is no simple relationship between box plot outliers and extraordinary z-scores.

Why z-scores?

Suppose a test has a mean score of 10 and a standard deviation of 2 with a total possible of 20. Suppose a second test has the same mean of 10 and total possible of 20 but a standard deviation of 8.

On the first test a score of 18 would be rare, an unusual score. On the first test 89% of the students would have scored between 6 and 16 (three standard deviations below the mean and three standard deviations above the mean.

On the second test a score of 18 would only be one standard deviation above the mean. This would not be unusual, the second test had more spread.

Adding two scores of 18 and saying the student had a score of 36 out of 40 devalues what is a phenomenal performance on the first test.

Converting to z-scores, the relative strength of the performance on test one is valued more strongly. The z-score on test one would be (18-10)/2 = 4, while on test two the z-score would be (18-10)/8 = 1. The unusually outstanding performance on test one is now reflected in the sum of the z-scores where the first test contributes a sum of 4 and the second test contributes a sum of 1.

When values are converted to z-scores, the mean of the z-scores is zero. A student who scored a 10 on either of the tests above would have a z-score of 0. In the world of z-scores, a zero is average!

Z-scores also adjust for different means due to differing total possible points on different tests.

Consider again the first test that had a mean score of 10 and a standard deviation of 2 with a total possible of 20. Now consider a third test with a mean of 100 and standard deviation of 40 with a total possible of 200. On this third test a score of 140 would be high, but not unusually high.

Adding the scores and saying the student had a score of 158 out of 220 again devalues what is a phenomenal performance on test one. The score on test one is dwarfed by the total possible on test three. Put another way, the 18 points of test one are contributing only 11% of the 158 score. The other 89% is the test three score. We are giving an eight-fold greater weight to test three. The z-scores of 4 and 1 would add to five. This gives equal weight to each test and the resulting sum of the z-scores reflects the strong performance on test one with an equal weight to the ordinary performance on test three.

Z-scores only provide the relative standing. If a test is given again and all students who take the test do better the second time, then the mean rises and like a tide "lifts all the boats equally." Thus an individual student might do better, but because the mean rose, their z-score could remain the same. This is also the downside to using z-scores to compare performances between tests - changes in "sea level" are obscured. One would have to know the mean and standard deviation and whether they changed to properly interpret a z-score.