Data transformations
The main idea...
Occasionally, the variables in a "raw" data set have properties that violate an assumption of a statistical procedure (e.g. normally distributed values) or which cannot be compared to other variables due to differences in scale or variability. For example, principal components analysis (PCA) requires that variables be linearly related to one another and on roughly the same scale or will perform poorly. Rather than abandoning an analysis due to inappropriate data structure, it may be possible to transform the variables so they satisfy the conditions in question. A transformation involves the application of a mathematical procedure to every value of a given variable or set of variables to create a new set of values. The new values of the transformed variables should still represent the data, but will be more amenable to analysis or comparison.
The sections below first describe some basic transformations and then discuss transformations specifically geared towards comparing variables. A set of ecologically-motivated transformations intended to allow Euclidean representation of ecological dissimilarities by methods such as PCA and redundancy analysis (RDA) are also summarised.
Before you begin transforming your data, ensure there is a defined and well-supported reason to do so. Common rationale includes linearising, normalising, or standardising data in order to respect a method's assumptions.
Figure 1: Schematics illustrating a linear and square root transformation. a) A linear transformation where the variable "y" is transformed into " y' " through a translation "b" and an expansion "m" transformation. This can be expressed by the linear equation y' = my + b. This transformation may be used to place two or more linearly related variables on the same scale. In this illustration, both "b" and "m" are positive leading to a translation to the right and an expansion, respectively. b) A square root transformation. Larger values of a variable "y" are affected more strongly than smaller values. This transformation is useful when positive data shows a positive skew and a more Gaussian distribution is desired. Hollow circles indicate former positions of values along an axis.
Basic transformations
A few basic but popular data transformations are described below. The main motivations for applying these transformations include placing variables on similar scales, simplifying calculations, meeting distributional assumptions (such as normality), and dealing with heteroscedasticity.
Equation 1: The power transformation expressed as a piecewise function. Resorting to a log transformation when λ = 0 allows the power transformation to remain continuous for all non-negative real numbers.
Equation 2: The Box-Cox transformation. This transformation is used in the Box-Cox procedure to estimate a value of λ which best transforms the variable to meet some criterion such as normality or linearity (see Figure 2 for illustration). The natural logarithm of original values is taken when λ = 0.
Figure 2: Box Cox plots for determining optimal λ values for a) normalising and b) linearising transformations. a) λ is chosen such that it maximises the correlation of a Box-Cox-transfromed variable, X, with a comparable normal distribution, N(μ,σ). In this illustration, a square root transformation (λ = 0.5) appears to be a good choice. b) λ is chosen such that it maximises the correlation between the variable being transformed, X, and another variable, Y. In the illustrated case, squaring the variable (λ = 2) appears to be a good linearising transformation. If the variables X and Y were negatively correlated, the λ corresponding to the minimum (i.e. most negative) correlation would be chosen.
Transformations in aid of comparability
Transformation can also promote the comparability of variables that have different magnitudes, variability, or scale such as those that describe different quantities (e.g. pH and enzyme rates). The transformations described below, discussed in more detail by Legendre and Legendre (1998), are applied to two or more variables in order to place them on comparable scales. Which transformation is appropriate to your data will depend on whether you need to correct for differences in magnitude, variability, or both between the variables in question.
Equation 3: Z-scoring a variable "y".
Ecologically motivated transformations
Presented in Legendre and Gallagher (2001), the transformations listed below are closely related to several (dis)similarity and distance measures and have their collective basis in ecological theory. These transformations may be applied prior to analyses such as principal components analysis (PCA) or redundancy analysis (RDA) of, for example, abundance data. These analyses use simple Euclidean distances in their ordinations which are often not appropriate for count data. Hence, these transformations may improve the effectiveness of many analyses in representing ecological relationships. Formulae, further explanation, and examples are available in Legendre and Gallagher (2001).
Warnings
Choose transformations according to need, rather than as a matter of course. Applying transformations that are too "harsh" (i.e. stronger than needed to prepare data for a particular analysis) may distort results and harm interpretation.
If a numerical interpretation of the results is desired, it is necessary to back-transform values after conducting an analysis in order to correctly interpret the results.
Ecological data that has been transformed using an ecologically motivated function can often be interpreted in a straightforward manner, however, transformations which simply aim to correct for some property in the data should be considered carefully during interpretation (Legendre and Legendre, 1998).
Some transformations, such as power transformations, require values to be positive. Adding a constant to achieve this is acceptable.
Treat negative values with caution. Ensure that your transformation adequately represents differences between negative values. If this is not possible, translating values into positive numbers by the addition of a constant scalar quantitiy may be advisable.
Implementations
R
scale() in the base package allows translational and expansion-based scaling.
decostand() in the vegan package contains several transformation functions
boxcox() in the MASS package generates a plot of values of λ against the log-likelihood (derived from a linear model)
References
Box GEP, Cox DR (1964) An analysis of transformations. J R Stat Soc B 26(2):211-252.
Legendre P, Gallagher ED (2001) Ecologically meaningful transformations for ordination of species data. Oecologia. 129(2): 271-280
Legendre P, Legendre L. Numerical Ecology. 2nd ed. Amsterdam: Elsevier, 1998. ISBN 978-0444892508.