Outliers

Outliers in a data set can strongly distort the results of most statistical analyses. What outliers actually are is not always a straightforward matter, however. A much-cited definition from Hawkins (1980) reads:

An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism.

To paraphrase, outliers may be considered to represent phenomena or statistical populations that are distinct from those under study (Figure 1). This page lists a few prominent approaches used to detect and contend with outliers. Note, however, that an entire body of research is dedicated to detecting outliers in different scenarios and the impacts of the various methods of correction.

Figure 1: Schematic of a Gaussian (normal) distribution with an obvious outlier. The chance that the "mechanism" which generated the bulk of the data following a Gaussian distribution also generated the outlier is very low. In fact, if the outlier was taken as the mean of an alternate mechanism with similar properties to the first, its associated distribution would only have a small region of overlap with the first.   

Identifying outliers

In experimental or sampling contexts, outlier identification and removal is often based on knowledge of the procedures and the actual measurement events rather than the statistical effects and properties of a given outlier. For example, knowing that a particular sampling tube was contaminated or that a filter was clogged by a large mass of algae provides clear motivation to remove particular data points or entire samples from an analysis. This being said, outliers are still likely occur and several methods exist to identify them.

Graphical data screening

A standard step in an analytical workflow is the screening of data. This involves graphically inspecting the distributions of response and explanatory variables either individually (using e.g. boxplots or quantile-quantile plots) or in pairs (using e.g. scatter plots). While requiring more time, graphical inspection is one of the best techniques to detect and investigate potential outliers and should be preferred to automated techniques. In situations where this is not feasible (e.g. where hundreds or thousands of variables are involved), simple outlier detection rules may be applied to each variable (univariate) or set of variables (bi- or multivariate) to guide subsequent graphical inspection. Programmatic platforms such as R allow users to write custom automated routines.

Outlier labeling rule

The outlier labeling rule is an automated way to detect outliers in normally distributed data. The outlier labeling rule relies on finding the difference between the first and third quartile of the distribution and multiplying it by a parameter, g. The resulting value will be added to the third quartile and subtracted from the first quartile values to define the boundaries of the "true" distribution. Any values outside these boundaries are considered outliers. The value of the tuning parameter, g, was originally set at 1.5, however, subsequent work suggested that g = 2.2 is more accurate (Hoaglin et al.,1986; Hoaglin & Iglewicz, 1987) 

Multivariate outliers

In contrast to the univariate case, multivariate outliers emerge when evaluating the joint distributions of multiple variables simultaneously. A multivariate outlier need not be an outlier when considering each variable independently. To illustrate, a microbial community may not have unusual abundances of a virus species or its host bacterial species relative to other communities when these abundances are evaluated independently, however, it may have unusual co-occurrence 

Detecting multivariate outliers is a non-trivial task, becomes increasingly difficult with more dimensionality, and often involves the use of specific algorithms. Values which lie outside a multidimensional 'cloud' of data (i.e. your objects plotted in a multidimensional space) are often candidates for outlier status, however, defining the shape and limits of the multivariate distribution under study is often challenging. This is especially true if there many outliers which may form statistical subpopulations (Rocke and Woodruff, 1996). Further, clusters of outliers may in fact be describing an alternative, but important, generating mechanism (sensu Hawkins, 1980) and should be treated with care. 

Common approaches to detect multivariate outliers include:

Dealing with outliers

Transformations

In some circumstances, the effect of outliers may be reduced or negated by employing transformations to data set. This is one of the least 'invasive' forms of addressing outliers and should be thoroughly explored..

Trimming 

Trimming is the simple removal of outliers from the bulk of the data. Simple entry errors, technical artifacts, clear sampling bias or errors, or impossible values of a given variable or variable set are straightforward justifications for simple deletion of outliers. If no such justification is present, however, researchers may consider performing one set of analyses with the outliers included and another with them removed. Both sets of results should be included in the final report and their implications discussed. As objects are removed from the data set, a loss of power will occur after trimming the data.

Winsorising

Winsorising is the act of changing the value of an outlier to the next highest or lowest value in the data set that is not an outlier. The main objective of Winsorising is to make the data set conform to some statistical procedure's assumptions. Typically, this should only be considered when a small number of outliers exist (at most ~ 1 - 2% of the data). The amount of data that has been Winsorised should be clearly reported, e.g. "A total of 3% of our data was Winsorised to allow parametric analysis".  Arguably, Winsorising is also suitable to make a "true" (i.e. all values reflect the phenomenon under study) heavy-tailed distribution suitable for analysis; however, interpretability will be more challenging and conclusions carefully qualified. If more than ~ 5%  of the data is Winsorised, parametric analyses and hypothesis testing approaches is not likely to yield reliable results. Thus, non-parametric approaches and.resampling methods should be considered.

Further warnings

Implementations

MASAME outlier detection app

    Click here to launch...

Warnings