Outliers

Outliers are data that may not be part of the population. An outlier may be a result from an extreme end of the population’s tail (out there at 5, 6, 7 or more SD), the result of lab error or they may be a result of the sample itself. For example, consider this scenario: a surface soil sample is collected with a lead shot from a rabbit hunter or a battery post (I saw this once). If this sample is from the background area, the data set will have a very high lead result. Graphic presentation of data is a good way to spot possible outliers.

Below is a box whisker plot of rainfall pH data for Mammoth Cave National Park. The data that falls above or below the whiskers is a good indication of outliers. Note, this box plot was done in ProUCL, which is a free program from the EPA and the whiskers are located at the data point that is the closest to, but not greater than, 1.5 times the Inter Quartile Range added to the 75 percentile and subtracted from the 25 percentile. There also is a time series plot of the data. In the time series plots, look for points that seem to be beyond the range of the rest of the data. The outliers can then be formally tested with the Dixon or Rosner test but both assume the underlying data is from a normal distribution. Some regulators like for you to use the Sprent Outlier Test as described in - Peter Sprent, Nigel. C. Smeeton (2007) Applied Nonparametric Statistical Methods Fourth. In the example data set, the Rosner test found two outliers at the 5% significance level (6.13 and 5.93). Only 6.13 was at the 1% significance level. Box Whisker plots, Dixon, or Rosner tests can be conducted in ProUCL. The Sprent Outlier Test isn’t hard to calculated, I set it up in an excel sheet (attached below).

Outliers should only be removed from the data set if:

• The remaining data fits a model

• All parties involved in the data agree

So what does this mean in the real world? If you have “Real Outliers” in your data and you do not remove them they will tend to push up the action limits and prevent you from detecting releases. If you remove data that you think are outliers but are part of the population (aka my 6’2” girlfriend), it will reduce action limits and increase false positives.