Describe Data‎ > ‎

Distribution Plot

What is it

The distribution plot overlays several methods of visualizing the distribution of a numeric variable. These include a histogram, a boxplot, and a density plot.  The following distribution plot was obtained from the log10 of a CEO's total compensation in the CEO salary (forbes94) data.

By default, the distribution plot overlays a histogram, a boxplot, and a density plot.
Histogram

A histogram is a bar chart showing the frequency of observations in a connected sequence of intervals. A histogram is the most commonly used method of describing the “shape” of the distribution of numeric data.  The picture you get from a histogram or density plot can depend on the number of bins in the histogram.

Density plot

A density plot (or kernel density estimate) is a smoother version of a histogram. 

It is made by imagining a little “normal distribution” curve above each data point. To find the height of the density curve, add up the height of the normal curve associated with each data point.  The estimated density function you get from a kernel density estimate can depend on the width of the little normal curves placed around each data point (known as the “bandwidth” of the density estimate).

Boxplot

A boxplot uses 5 numbers to summarize “most” of a distribution, and then plots any outliers that it does not cover. The five numbers are

  • The median, showing the value of a typical observation, represented as a line in the interior of the box.

  • The 25th and 75th percentiles, represented as the lower and upper endpoints of the box. The 25th percentile is a number such that 25% of the data is less than that number. Likewise, 75% of the data are less than the 75th percentile (so 25% are above it). These two numbers are chosen so the box represents the spread in the “middle half” of the data. The distance from the 25th to the 75th percentiles is known as the “interquartile range” and abbreviated as IQR.

  • The left of the boxplot extends to the smallest observation that is no more than 1.5 interquartile ranges from the 25th percentile. Thus the arm can be no more than 1.5 times the width of the box, but it is usually shorter because it ends at an observed value. Likewise, the right arm extends no more than 1.5 IQR’s from the 75th percentile.

Why should I care?

Each element of a distribution plot gives you a way to visualize the distribution of your data. 

Numerical summaries of your data (like means or means and standard deviations) are useful, but they can’t tell you things like whether there are discrete clusters of “special” values, whether your data are skewed left or right, or how “fat” the tails of your distribution are. (Actually, there are numerical measurements of skewness and kurtosis (tail fatness), but interpreting them takes some skill.)

Examples

All four of the following data sets could be shifted and scaled to have the same means and standard deviations, but they tell very different stories.

image
image
  1. The top left panel shows the distribution of CEO total compensation with the top 20 most highly paid CEO’s removed.  (You can do this by sorting the CEO data by Total Comp, and creating a new column in the spreadsheet.  Call it NotTop20.  Make the first 20 entries FALSE, and the rest TRUE.  Then add NotTop20 as a Filter).  
  2. The top-right panel shows the distribution of daily returns from the S&P 500.  Where CEO salaries are skewed to the right (the long tail trails off to the right side), the distribution of stock market returns has "fat tails" in both directions.
  3. The bottom left panel shows the distribution of CEO ages when they got their undergraduate degree.  The numbers are rounded to the nearest year, and most CEO's got their undergraduate degrees at 21 or 22.  These are numeric data, but they're discrete.  There are only about 20 unique values.
  4. The bottom right panel shows the distribution of the impact of disease among different countries of the world.  The unit is DALY's (disability adjusted life years).  From the WHO"One DALY can be thought of as one lost year of "healthy" life." lost to disease.  This distribution has an extra "hump" on the right side corresponding to countries with underdeveloped medical capabilities.  
You can find numerical summaries that are attuned to each of these issues (skewness, fat tails (kurtosis), discreteness, and multi-modality), but none of these numbers is as effective as directly visualizing the data distribution.