A box plot (also called box-and-whisker plot) is another type of graph used to display data. A box plot divides a set of numerical data into quarters. It shows how the data are dispersed around a median, but does not show specific values in the data. It does not show a distribution in as much detail as does a stem plot or a histogram, but it clearly shows where the data is located. This type of graph is often used when the number of data values is large or when two or more data sets are being compared. The center and spread of the distribution are very obvious from the graph. It is easy to see the range of the values as well as how these values are distributed around the middle value. The smaller the box, the more consistent the data values are with the median of the data. The shape of the box plot will give you a general idea of the shape of the distribution, but a histogram or stem plot will do this more accurately. Any outliers will show up as long whiskers.The box in the box plot contains the middle 50% of the data, and each 'whisker' contains 25% of the data.
In order to divide into fourths, it is necessary to find five numbers. This list of five values is called the five number summary. The numbers in the list are {minimum value, Quartile 1, Median, Quartile 3, maximum value}. We have already learned how to find the median of a set of numbers (put in order and find the middle value), and the minimum and maximum are the smallest and largest numbers. Now we will learn how to find the quartiles.
Quartiles
The first step is to list all of the numbers in order from least to greatest. The minimum and maximum are now on the ends of the list and we can count in to find the median--circle these three values. Finding the quartiles is just like finding the median. Quartile 1 is the 'median' of all of the values to the left of the median (do NOT include the median itself). Quartile 3 is the 'median' of all of the values to the right of the median (do not include the median).
Constructing a Box Plot
Now list the five number summary in order {min, Q1, Med, Q3, max). The next step is to mark an axis that covers the entire range of the data. Mark the numbers along the axis before you make the box plot, so that the resulting plot shows the shape of the data. The last step is to place a dot above the axis for the 5 numbers from the five number summary, and then to make a 'box' through the second and fourth dots, mark a line through the middle dot to show the median, and mark 'whiskers' from the box out to the first and fifth dots.
Example 1
You have a summer job working at Paddy’s Pond which is a recreational fishing spot where children can go to catch salmon which have been raised in a nearby fish hatchery and then transferred into the pond. The cost of fishing depends upon the length of the fish caught ($0.75 per inch). Your job is to transfer 15 fish into the pond three times a day. But, before the fish are transferred, you must measure the length of each one and record the results. Below are the lengths (in inches) of the first 15 fish you transferred to the pond. Calculate the five number summary, and construct a box plot for the lengths of these fish.
Range
We have already learned how to find the range of a set of data. The range represents the entire spread of all of the data.
The formula for calculating the range is:
max - min = range
Interquartile Range
The quartiles give us one more measure of spread called the interquartile range. The interquartile range (IQR) is the range between the lower and upper quartile. To find the IQR, subtract the quartile 1 value from the quartile 3 value (Q3 - Q1 = IQR). The IQR represents the spread, or range, of the middle 50% of the data. The IQR is a measure of spread that is used when the median is the measure of central tendency.
The formula for calculating the IQR is: Q3 -Q1 = IQR
Standard Deviation
Another measure of spread that is used in statistics is called the standard deviation. The standard deviation measures the spread around the mean. This value is more difficult to calculate than range or IQR, but the formula used takes all of the data values in the distribution into account. Standard deviation is the appropriate measure of spread when the mean is the measure of center. However, the standard deviation is easily affected by outliers or skewness because every value is calculated in the formula. The symbol for standard deviation of a sample is s (on the graphing calculators it is Sx) and for a population it is σ (sigma).
The standard deviation can be any number zero or greater. It will only be equal to zero if there is no spread (i.e. all values are exactly the same). The more spread out the data is, the larger the standard deviation will be. The standard deviation is most appropriate when you have a very symmetrical, bell-shaped distribution called a normal distribution. We will study this type of distribution in unit 7.
Which Numerical Summary Should We Use?
We have learned several statistics that are measures of central tendency and several that are measures of spread. How do we know which ones to use? The mean and standard deviation go together. And, the median will go with the IQR (or range). The most important thing to remember is that the mean and the standard deviation are both affected by outliers and by skewness in a distribution. So if either of these is present, then the mean and standard deviation are not appropriate. However, it is always an option, and often interesting to calculate all of the statistics and compare them to one another. The general guidelines are:
How to Calculate the Standard Deviation With the Formula
In order to calculate the standard deviation you must have all of the values. Then you follow these steps:
Calculate the mean of the values.
Subtract the mean from each data value. These are the individual deviations.
Each of these deviations is squared.
All of the squared deviations are added up.
This total of the squared deviations is divided by one less than the number of deviations. This is the variance.
Take the square root of the variance. This is the standard deviation.
As you can probably tell, this formula is very time consuming when you have a large set of data. Also, it is easy to make a mistake in your calculations. We will show the process with a small set of data, but generally we will use our calculator to find the standard deviation. See the appendix for the calculator instructions on how to do this.
Example 2
There are five teenage girls on Buhl street that the Miller's often have babysit their three rambunctious sons. There ages are 12, 15, 14, 17, and 19 years old. Find the mean and standard deviation for the ages of the Miller's babysitters.
Solution
Calculate the mean of the values.
Subtract the mean from each data value. These are the individual deviations.
Each of these deviations is squared.
All of the squared deviations are added up.
This total of the squared deviations is divided by one less than the number of deviations. This is the variance.
Take the square root of the variance. This is the standard deviation.
The mean age of the Miller family's babysitters is 15.4 years old and the standard deviation is 2.7019 years.
The standard deviation is tedious to calculate. For any problem where you are asked to calculate the standard deviation, you may use your calculator or a computer to find it.
Example 3
After one month of growing, the heights of 30 parsley seed plants were measured and recorded. The measurements (in inches) are shown in the table below.
a) Calculate the five number summary and construct a box plot to represent the data.
b) Describe the distribution.
c) Calculate the mean and standard deviation.
d) Calculate the median, and IQR
Solution
a) five number summary and box plot:
order the values-- The data organized from smallest to largest is shown in the table below. (You could use your calculator to quickly sort these values)
5# summary-- This time there is an even number of data values so the median will be the mean of the two middle values. (We will not use the median, but we do use the values on either side of it when finding quartiles). The median of the lower half is the number in the 8th position which is 17. The median of the upper half is the number in the 22nd position (or 8th from the top) which is 37. The smallest number is 6 and the largest number is 49.
5# summary = {6, 17, 26, 37, 49} (all are inches)
b) describe--don't forget your S.O.C.C.S! [Figure7]
The heights of these parsley plants ranged from 6 inches to 49 inches after one month. The distribution is very symmetrical and does not contain any outliers. The median height for these parsley plants was 26 inches tall. The middle 50% of the plants were all between 17 inches and 37 inches tall.
c) The mean and standard deviation were calculated using the TI-84+.
inches
inches
d) The median is part of the five number summary. The IQR = Q3 - Q1 = 37 - 17 = 20
inches
inches
Outliers
We have been noticing some values that appear to be outliers, but have not defined a specific distance to be considered an outlier. The common outlier test, used to determine whether or not any of the values are outliers uses the IQR. This outlier test, often called the 1.5*(IQR) Criterion, says that any value that is more than one and one-half times the width of the IQR box away from the box is an outlier.
Example 4
Test the sodium in the McDonald's® sandwiches for outliers. The data can be found in Section 5.5 Exercises, problem #1. Use the 1.5*(IQR) Criterion. Show your steps.
Solution
Calculate the five number summary for the Amount of Sodium (in mg)
First find the IQR:
Test for low outliers:
Test for high outliers:
Check the data to see if we have any outliers:
We have no sandwiches with less than 160 mg sodium, so we have no low outliers.
We have one value that is greater than this cutoff of 1960 mg. The Angus Bacon & Cheese burger has 2070 mg of sodium, so we have one high outlier.