Describe Data‎ > ‎

Normal Quantile Plot

 What is it?

A normal quantile plot (also known as a quantile-quantile plot or QQ plot) is a graphical way of checking whether your data are normally distributed.

On one axis, you plot your data, sorted smallest to largest. On the other axis you plot the numbers you would expect to see if your data were normally distributed. (You can get these by looking up the appropriate quantiles on the “normal table”).

If the points lie along a straight line then what you see in your data is what you’d expect if the data were normal.  In other words, if your data are normally distributed you should see a nearly straight line.

Why should I care?

The normal table is not a regular plot like a histogram or a boxplot. It is a diagnostic plot. The job of a diagnostic plot is not really to help you visualize your data, but to help you check whether your data fits a particular model. In this case that model is a normal distribution.

If your data are normally distributed, you can use the normal distribution to say how likely it would be for a randomly chosen observation to land in any particular interval. Probably the most common use of the normal distribution is to justify the statement “There’s a 95% chance of being within 1.96 standard deviations from the mean.” If your data are normally distributed then that’s true. If not, then it might not be true.

The normal distribution is a model for your data, and no model is perfect. The nice thing about a normal quantile plot is that it does not make a yes/no decision about whether or not your data are normal. You look at the plot to get a sense of how reasonable the normal curve would be as an approximation to your data. If your application only uses the model as a rough rule of thumb then you can tolerate greater departures from normality than if you were using it for a precision engineering system, for example.

Examples

Our first example is the set of ages from the CEO compensation data set. The histogram of CEO ages looks roughly like a bell curve. The points on the QQ plot drift away from the line a little bit, but only at the ends and only by a year or two. If we were trying to model the age of the oldest or youngest CEO in a large population then the normal distribution probably wouldn’t be appropriate, but if we want to use the normal to model intervals inside 2 standard deviations from the mean that would probably be okay (since the points closely track the line in that region). This a good example of what “nearly but not exactly normal” looks like.

The next example we can look at is the age of the CEO’s when they earned their undergraduate degrees.


We see two patterns here. One is discreteness. The median age upon receiving an undergraduate degree was 22 (as it is in the overall population). Some were 21, and a few were 20 or younger. The small number of potential age values shows up as clumpiness in the normal QQ line. (A similar feature, but with many more clumps appears on the CEO age plot, above). A second feature is that an appreciable fraction of CEO’s were much older than 22 when they obtained their undergraduate degrees, and a few were in their 40’s. This is older than one would expect based on the normal model. These data are said to have a “heavy right tail” or to be “right skewed.”

An even heavier skewness can be seen in the distribution of CEO total compensation.


The CEO compensation numbers are dominated by a single large outlier (who made $202M in 1994).