Outliers
Outliers in a data set can strongly distort the results of most statistical analyses. What outliers actually are is not always a straightforward matter, however. A much-cited definition from Hawkins (1980) reads:
An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism.
To paraphrase, outliers may be considered to represent phenomena or statistical populations that are distinct from those under study (Figure 1). This page lists a few prominent approaches used to detect and contend with outliers. Note, however, that an entire body of research is dedicated to detecting outliers in different scenarios and the impacts of the various methods of correction.
Figure 1: Schematic of a Gaussian (normal) distribution with an obvious outlier. The chance that the "mechanism" which generated the bulk of the data following a Gaussian distribution also generated the outlier is very low. In fact, if the outlier was taken as the mean of an alternate mechanism with similar properties to the first, its associated distribution would only have a small region of overlap with the first.
Identifying outliers
In experimental or sampling contexts, outlier identification and removal is often based on knowledge of the procedures and the actual measurement events rather than the statistical effects and properties of a given outlier. For example, knowing that a particular sampling tube was contaminated or that a filter was clogged by a large mass of algae provides clear motivation to remove particular data points or entire samples from an analysis. This being said, outliers are still likely occur and several methods exist to identify them.
Graphical data screening
A standard step in an analytical workflow is the screening of data. This involves graphically inspecting the distributions of response and explanatory variables either individually (using e.g. boxplots or quantile-quantile plots) or in pairs (using e.g. scatter plots). While requiring more time, graphical inspection is one of the best techniques to detect and investigate potential outliers and should be preferred to automated techniques. In situations where this is not feasible (e.g. where hundreds or thousands of variables are involved), simple outlier detection rules may be applied to each variable (univariate) or set of variables (bi- or multivariate) to guide subsequent graphical inspection. Programmatic platforms such as R allow users to write custom automated routines.
Outlier labeling rule
The outlier labeling rule is an automated way to detect outliers in normally distributed data. The outlier labeling rule relies on finding the difference between the first and third quartile of the distribution and multiplying it by a parameter, g. The resulting value will be added to the third quartile and subtracted from the first quartile values to define the boundaries of the "true" distribution. Any values outside these boundaries are considered outliers. The value of the tuning parameter, g, was originally set at 1.5, however, subsequent work suggested that g = 2.2 is more accurate (Hoaglin et al.,1986; Hoaglin & Iglewicz, 1987)
Multivariate outliers
In contrast to the univariate case, multivariate outliers emerge when evaluating the joint distributions of multiple variables simultaneously. A multivariate outlier need not be an outlier when considering each variable independently. To illustrate, a microbial community may not have unusual abundances of a virus species or its host bacterial species relative to other communities when these abundances are evaluated independently, however, it may have unusual co-occurrence
Detecting multivariate outliers is a non-trivial task, becomes increasingly difficult with more dimensionality, and often involves the use of specific algorithms. Values which lie outside a multidimensional 'cloud' of data (i.e. your objects plotted in a multidimensional space) are often candidates for outlier status, however, defining the shape and limits of the multivariate distribution under study is often challenging. This is especially true if there many outliers which may form statistical subpopulations (Rocke and Woodruff, 1996). Further, clusters of outliers may in fact be describing an alternative, but important, generating mechanism (sensu Hawkins, 1980) and should be treated with care.
Common approaches to detect multivariate outliers include:
the computation of Mahalanobis distances (Mahalanobis, 1936) between objects and the centre of an elliptical multivariate distribution (click here for general information about dissimilarity and distance). Those objects that have distances that exceed a critical value (i.e. are too far away from the main 'cloud' of data) are labelled as outliers. Critical values are defined by the number of degrees of freedom (i.e. the number of variables) in the data set and the significance threshold (p-value) desired. Note, however, that critical values contain a large degree of arbitrariness and may perform poorly if the data have complex distance structure. An alternative, useful for data that is approximately multivariate normal, is to compare the distances generated to a theoretical χ2 (chi-squared) distribution with the number of degrees of freedom equal to the number of variables in the analysis. A quantile-quantile plot may be used to identify objects that deviate strongly from a χ2 distribution.
the use of Stahel-Donoho estimators (Yohai, 2013; Zuo, 2004). These can tolerate high levels of outlier contamination in a data set (~50%) and function by penalising objects if they are found to be univariate outliers in a series of univariate projections of the original multivariate data.
methods, such as PCout (Filzmoser et al., 2008), based on robust principal components analysis that are optimised for large data sets with thousands of variables are also available and readily available (see Implementations).
Dealing with outliers
Transformations
In some circumstances, the effect of outliers may be reduced or negated by employing transformations to data set. This is one of the least 'invasive' forms of addressing outliers and should be thoroughly explored..
Trimming
Trimming is the simple removal of outliers from the bulk of the data. Simple entry errors, technical artifacts, clear sampling bias or errors, or impossible values of a given variable or variable set are straightforward justifications for simple deletion of outliers. If no such justification is present, however, researchers may consider performing one set of analyses with the outliers included and another with them removed. Both sets of results should be included in the final report and their implications discussed. As objects are removed from the data set, a loss of power will occur after trimming the data.
Winsorising
Winsorising is the act of changing the value of an outlier to the next highest or lowest value in the data set that is not an outlier. The main objective of Winsorising is to make the data set conform to some statistical procedure's assumptions. Typically, this should only be considered when a small number of outliers exist (at most ~ 1 - 2% of the data). The amount of data that has been Winsorised should be clearly reported, e.g. "A total of 3% of our data was Winsorised to allow parametric analysis". Arguably, Winsorising is also suitable to make a "true" (i.e. all values reflect the phenomenon under study) heavy-tailed distribution suitable for analysis; however, interpretability will be more challenging and conclusions carefully qualified. If more than ~ 5% of the data is Winsorised, parametric analyses and hypothesis testing approaches is not likely to yield reliable results. Thus, non-parametric approaches and.resampling methods should be considered.
Further warnings
Unlike the univariate case, multivariate normal data usually involves more than one generating mechanism of comparable importance. Thus, accumulations of apparent outliers may represent a (potentially important) process and should be treated cautiously.
One commonly applied, but highly dubious, rule to detect outliers in normally distributed data is the labeling of any data point that is greater than (or less than) two standard deviations from the mean. It is reasonable to expect that approximately 5% of normally distributed data is greater than 2 standard deviations from the mean. Avoid this approach.
If outliers constitute greater than 5% of a data set, the validity of the entire data set should be reevaluated. While ecological data can be very 'noisy', it is also possible that the experimental or sampling procedure employed has measured more than the phenomena under study.
Implementations
R
Objects generated by the function boxplot() contain information about potential outliers. Given a boxplot object, "mydata.boxplot", outlier information may be accessed by "mydata.boxplot$out". Note that outliers are defined as points that fall outside the interquartile range multiplied by a range parameter (default = 1.5). Changing the range parameter will determine how far the whiskers of a boxplot extend.
The package outlier has multiple routines for detecting a variety of outliers.
The package mvoutlier has routines for detecting multivariate outliers, based on Mahalanobis distances, adjusted quantiles, or robust principal components analysis.
The donostah() function in the robust package can be used to produce Stahel-Donoho estimates of multivariate location and scale.
The CovNASde() function in the rrcovNA package can be used to produce Stahel-Donoho estimates of multivariate location and scatter for incomplete data.
MASAME outlier detection app
Warnings
Filzmoser P, Maronna R, Werner M (2008) Outlier identification in high dimensions. Comput Stat Data Anal. 52(3): 1694–1711.
Hawkins D.. Identification of Outliers. London: Chapman and Hall, 1980. ISBN 0 412 21900.
Hoaglin DC, Iglewicz B, Tukey JW (1986) Performance of some resistant rules for outlier labeling. J Am Stat Assoc. 81: 991-999.
Hoaglin DC, Iglewicz B (1987) Fine tuning some resistant rules for outlier labeling. J Am Stat Assoc. 82: 1147-1149.
Mahalanobis PC (1936) On the generalized distance in statistics. Proceedings of the National Institute of Sciences of India. 2: 49-55.
Rocke DM, Woodruff DL (1996) Identification of Outliers in Multivariate Data. J Am Stat Assoc. 91(435): 1047–1061.
Yohai VJ, Maronna RA (2013) The Behavior of the Stahel-Donoho Robust Multivariate Estimator. J Am Stat Assoc. 90(429): 330–341.
Zuo Y, Cui H, He X (2004) On the Stahel-Donoho estimator and depth-weighted means of multivariate data. Ann Stat. 32(1): 167–188.