Principal Components Analysis

The main idea...

Principal components analysis (PCA) is a method to summarise, in a low-dimensional space, the variance in a multivariate scatter of points. In doing so, it provides an overview of linear relationships between your objects and variables. This can often act as a good starting point in multivariate data analysis by allowing you to note trends, groupings, key variables, and potential outliers. Further, if you have a data set with many variables and relatively few objects (i.e. a "large p, small n" table), PCA can help collapse these many variables into a few principal components (PCs), which can be used in further analyses. 

Consider a data table with 40 objects (rows) and 5 variables (x1...x5; columns). A 5-dimensional scatter plot (i.e. a plot with 5 orthogonal axes) with each object's coordinates in the form (x1, x2, x3, x4, x5) is impossible to visualise and interpret. Roughly speaking, PCA attempts to express most of the variability in this 5-dimensional space by rotating it in such a way that a lower-dimensional representation will bring out most of the variability of the higher-dimensional space. A new set of axes (known as principal components) is created as a basis of the lower-dimensional representation. An illustration showing this for a simpler, 3-dimensional space is shown in Figure 1.

Figure 1: An intuitive sketch of PCA's aims. Panel a shows a 3-dimensional scatter plot in which the variability between the six points in box i is obscured. Panel b shows a rotation of the original axes to maximise the variability in a two dimensional space. The first principal component would be constructed in the direction of maximum scatter (i.e. maximum variability; dashed line). Subsequent PCs would be constructed in the same manner, however, must be orthogonal to (have no correlation with) all other PCs. The original variables would be rescaled as needed and may be represented in a biplot (Figure 2)

Linear combinations of the original variables are used to build the principal components (PCs). The first PC is placed through the scatter of points so as to maximise the amount of variation along it (Figure 1b). The same criterion applies to each subsequent PC calculated; however, each PC must be orthogonal to every other PC, that is, the covariance between each PC is strictly zero. If a few PCs capture most (70-90 %) of the variance in the original scatter, the PCA has been very successful in representing the variability in your data; however, ecological data sets are rarely summarised so well. Smaller amounts of total variance captured (30-40%) can also be informative. There are several methods to estimate the number of informative PCs generated by a PCA, such as the "broken stick model" and the Kaiser-Guttman criterion. See Jackson's (1993) discussion for insight into their effectiveness.

PCA thus aims to reduce the number of variables in large data sets and thereby assist interpretation. This is most often an initial step which will advise further analyses. PCs themselves can be extracted from a PCA result and used as new variables in subsequent analyses such as multiple regression. If this is done, the analyst must carefully consider what these PCs represent. 

Pre-analysis

Results and interpretation

Typical results delivered by implementations of PCA report the following results:

 

Figure 2: A PCA biplot. Points represent objects (rows). Red vectors represent the original variables (columns) used to build the PCs. The interpretation of the ordination depends on the type of scaling used. See text for description.

Reading a PCA biplot

The results of a PCA analyses are typically visualised using a biplot (Figure 2). The interpretation of this biplot depends on the scaling chosen. Properties of these scalings are presented below: In general, consider type I scaling if the distances between objects are of particular value and type II scaling if the correlative relationships between variables are of more interest. Further interpretation is also discussed below and more detail is available in Legendre and Legendre (1998) and ter Braak (1994).

Figure 3: Schematics highlighting a) the projection of ordinated objects onto a vector and b) the angles between vectors. The projection of an ordinated point onto a variable vector, as shown for point i in panel a, approximates the variable's value realised for that object. Hence, visual inspection suggests object i can be expected to have higher values of variable 1 relative to most other objects. Object ii, however, can be expected to have lower values of variable 1 relative to other objects. Note that the dashed line is not typically shown in a biplot and is shown here for clarity. When using type II scaling, cosines of angles between vectors (panel b) approximate the covariance between the variables they represent. In this case, ∠a is approaching 90, which suggests that variables "1" and "2" show very little covariance (i.e. they are almost orthogonal, just as independent axes are). ∠b is less than 90, suggesting positive covariance between variables "2" and "3" while ∠c is approaching 180, suggesting strong negative covariance between variables "2" and "4" (i.e. the directions of increase of variables "2" and "4" oppose one another). Variable 5 is non-quantitative and is represented by a centroid. A right-angled projection onto variable 4 suggests the two are positively linked.

Type I Scaling - Distance biplot

Type II Scaling - Covariance/Correlation biplot

Key assumptions

Warnings

Implementations

MASAME PCA app

    Click here to launch...

References