Data transformations

The main idea...

Occasionally, the variables in a "raw" data set have properties that violate an assumption of a statistical procedure (e.g. normally distributed values) or which cannot be compared to other variables due to differences in scale or variability. For example, principal components analysis (PCA) requires that variables be linearly related to one another and on roughly the same scale or will perform poorly. Rather than abandoning an analysis due to inappropriate data structure, it may be possible to transform the variables so they satisfy the conditions in question. A transformation involves the application of a mathematical procedure to every value of a given variable or set of variables to create a new set of values. The new values of the transformed variables should still represent the data, but will be more amenable to analysis or comparison. 

The sections below first describe some basic transformations and then discuss transformations specifically geared towards comparing variables. A set of ecologically-motivated transformations intended to allow Euclidean representation of ecological dissimilarities by methods such as PCA and redundancy analysis (RDA) are also summarised.

Before you  begin transforming your data, ensure there is a defined and well-supported reason to do so. Common rationale includes linearising, normalising, or standardising data in order to respect a method's assumptions.

Figure 1: Schematics illustrating a linear and square root transformation. a) A linear transformation where the variable "y" is transformed into " y' " through a translation "b" and an expansion "m" transformation. This can be expressed by the linear equation y' = my + b. This transformation may be used to place two or more linearly related variables on the same scale. In this illustration, both "b" and "m" are positive leading to a translation to the right and an expansion, respectively. b) A square root transformation. Larger values of a variable "y" are affected more strongly than smaller values. This transformation is useful when positive data shows a positive skew and a more Gaussian distribution is desired. Hollow circles indicate former positions of values along an axis. 

Basic transformations

A few basic but popular data transformations are described below. The main motivations for applying these transformations include placing variables on similar scales, simplifying calculations, meeting distributional assumptions (such as normality), and dealing with heteroscedasticity

Equation 1: The power transformation expressed as a piecewise function. Resorting to a log transformation when λ = 0 allows the power transformation to remain continuous for all non-negative real numbers.

Equation 2: The Box-Cox transformation. This transformation is used in the Box-Cox procedure to estimate a value of λ which best transforms the variable to meet some criterion such as normality or linearity (see Figure 2 for illustration). The natural logarithm of original values is taken when λ = 0.

Figure 2: Box Cox plots for determining optimal λ values for a) normalising and b) linearising transformations. a) λ is chosen such that it maximises the correlation of a Box-Cox-transfromed variable, X, with a comparable normal distribution, N(μ,σ). In this illustration, a square root transformation (λ = 0.5) appears to be a good choice. b) λ is chosen such that it maximises the correlation between the variable being transformed, X, and another variable, Y. In the illustrated case, squaring the variable (λ = 2) appears to be a good linearising transformation. If the variables X and Y were negatively correlated, the λ corresponding to the minimum (i.e. most negative) correlation would be chosen.

Transformations in aid of comparability

Transformation can also promote the comparability of variables that have different magnitudes, variability, or scale such as those that describe different quantities (e.g. pH and enzyme rates). The transformations described below, discussed in more detail by Legendre and Legendre (1998), are applied to two or more variables in order to place them on comparable scales. Which transformation is appropriate to your data will depend on whether you need to correct for differences in magnitude, variability, or both between the variables in question.

 Equation 3: Z-scoring a variable "y".

Ecologically motivated transformations 

Presented in Legendre and Gallagher (2001), the transformations listed below are closely related to several (dis)similarity and distance measures and have their collective basis in ecological theory. These transformations may be applied prior to analyses such as principal components analysis (PCA) or redundancy analysis (RDA) of, for example, abundance data. These analyses use simple Euclidean distances in their ordinations which are often not appropriate for count data. Hence, these transformations may improve the effectiveness of many analyses in representing ecological relationships. Formulae, further explanation, and examples are available in Legendre and Gallagher (2001).

Warnings

Implementations

References