Describe Data‎ > ‎

Boxplot

What is it?

A boxplot is a concise way of visualizing a data distribution.  It uses 5 numbers to summarize “most” of a distribution, and then plots any outliers that it does not cover. The five numbers are

  • The median, showing the value of a typical observation, represented as a line in the interior of the box.

  • The 25th and 75th percentiles, represented as the lower and upper endpoints of the box. The 25th percentile is a number such that 25% of the data is less than that number. Likewise, 75% of the data are less than the 75th percentile (so 25% are above it). These two numbers are chosen so the box represents the spread in the “middle half” of the data. The distance from the 25th to the 75th percentiles is known as the “interquartile range” and abbreviated as IQR.

  • The left of the boxplot extends to the smallest observation that is no more than 1.5 interquartile ranges from the 25th percentile. Thus the arm can be no more than 1.5 times the width of the box, but it is usually shorter because it ends at an observed value. Likewise, the right arm extends no more than 1.5 IQR’s from the 75th percentile.


Why do I care?

Boxplots are a clever way of visualizing the distribution of a set of numerical data using only one dimension. They’re also a good way of identifying indivdual outlying points.

A boxplot can identify skewness in the data, as well as fat tails and individual outliers.

Using only one dimension gives you two benefits.

  1. First, you can use boxplots to compare distributions across a large number of groups by stacking boxplots on top of one another. It is much easier to look at 20 stacked box plots than 20 histograms.

  2. Second, because the width of the boxes does not mean anything, we’re free to make it mean something useful. In the stacked boxplot, the width of the boxes is proportional to the size of the category.

Example

The boxplot below shows the distribution of log10 total compensation for the 800 most highly paid CEO’s in 1994, by industry. You can see that Aerospace/Defense CEO’s were generally highly paid, and Utility CEO’s were generally not compensated at the same levels as other CEO’s on the list. Finance CEO’s did about as well as many others on the list, but there were many more of them.