The main idea...
Path analysis, a form of structural equation modelling, (Wright, 1921; 1923; 1934) is a regression-based approach to determine how well set of variables and the hypothesised relationships between them is compatible (i.e. fits well with) a given data set. Legendre and Legendre (1998) state that this has the virtue of compelling researchers to state their hypothesis explicitly. A variety of regression-type methods may be used to detect the relationships between variables of interest. Examples of methods that may be used in path analysis include multiple linear regression (MLR) and its multivariate analogue, redundancy analysis (RDA), This page will describe an MLR approach as an example; however, using path analysis with multivariate approaches simply involves the use of matrices rather than single variables.
The approach...
Prior to conducting path analysis, a model should be defined. If the variance of a variable is expected to be explained (at least partially) by other variables in the model, that variable may be considered as an endogenous variable. In contrast, if the variance of a variable is not expected to be explained by the other variables in the model, it should be considered as an exogenous variable. While the direction of relationships involving endogenous variables or one endogenous and one exogenous variable can be asserted, no such assertion can be made between exogenous variables. Extraneous variables, which are not part of the model, may also be included to represent unexplained variance. Models that are subject to path analysis should be as simple as possible: the hypothesised relationships between variables should be kept to a minimum relative to the number of measurable variables (see Identification below).
When using MLR, a sequential approach to path analysis is often employed. Each endogenous variable is treated as a response variable with all variables that are expected to have a direct effect (and not an effect mediated by other variables) on it included as explanatory variables. The regression coefficients (β1 ... βn) are used as path coefficients (see Results and interpretation).
Figure 1: An Illustration of a path analysis diagram with two exogenous and two endogenous variables. Variables A and B are exogenous, while variables C and D are endogenous. The residual variance (i.e. variance not explained by other variables in the model) of each endogenous variable is represented by a path to an extraneous variable (Ri). The associated regression equations for the endogenous variables, C and D, are:
C = pCAA + pCBB + RC
D = pDAA + pDBB + pDCB + RD
Note that all higher-order causal variables (i.e. variables that influence a given endogenous variable) are included in the regression equation of a given endogenous variable. All values in this figure are for illustrative purposes only.
Results and interpretation
A typical result of path analysis is a path diagram (Figure 1), which is similar to a directed graph. Variables are represented as nodes connected by arrows which represent the existence and strength of a relationship between them. These arrows are referred to as "paths" and point from the causal variable to the effect variable. Single-headed arrows are drawn to and from endogenous variables. Double-headed arrows are drawn when the direction of a relationship between two variables cannot be asserted (i.e. between exogenous variables). Each path has a path coefficient (see below).
Evaluating models
To evaluate a path analysis model, complete with theoretical path coefficients, one would compare how well the variable correlations predicted by the model match the correlations determined empirically. Models that better predict the empirical correlations are generally preferred, although ecological knowledge should always mediate in model selection.
To arrive at the predicted correlations between a causal and effect variable, one sums the direct path coefficients between them. If the causal variable influences the effect variable indirectly (i.e. through one or more mediating variables) the products of all path coefficients from the causal to the effect variable are added to the total correlation value. In Figure 1, the correlation between variables A and D would be rAD = pDA + pDC × pCA.
The actual comparison between the predicted and calculated correlations is often done by squaring the difference between the predicted correlations and the calculated correlations, determining the mean of the results, and taking the square-root of that mean. This is known as the root mean square residual (RMSR). Other appropriate statistics, such as the chi-squared statistic, may also be used.
Identification
Following path analysis, parameters (e.g. path coefficients) and the model itself may have different levels of 'identification'. Simply put, the level of identification a parameter or model has depends on how well the data available (i.e. known information) can support the model's parameter estimates (unknown information). In general, it is desirable to have more known information than unknown information. In the MLR case, the known information is primarily (but not necessarily exclusively) comprised of the variances and covariances of variables that have been empirically measured.
If multiple parameter estimates (unknowns) can reproduce the known relationships between variables, the information available is insufficient to identify a single, best estimate. Parameters that may have more than one solution are said to be under-identified. A model with under-identified parameters is itself, under-identified and cannot be used to determine which of the estimated parameters fit the data better than others.
In the case of an identified parameter (e.g. an identified path coefficient), a single optimal parameter estimate will fit the known information. Thus, identification also implies that there is sufficient data (and that the data fits all the necessary assumptions) to arrive at a given parameter estimate. If all the parameters in a model are identified, the model itself can be said to be identified. If the known relationships between variables are 'just' sufficient enough to allow each parameter in the model to be identified, the model is considered 'just identified'. Any less known information or any more unknown information would lead to under-identification (see below). Just-identified models can only fit the data available in a single way and thus are not necessarily very interesting for model testing purposes.
Over-identified models contain more known information than unknown information. This means that there are degrees of freedom available after the analysis (i.e. not all the degrees of freedom have been consumed in parameter estimation) and that the best fit of the model to the data is more informative than is the case with just-identified models. In general, researchers should aim to perform path analysis with over-identified models.
Key assumptions
The assumptions of path analysis are similar to those of MLR (or, in the multivariate case, RDA). Additionally:
Each variable should be centred on its mean (i.e. be transformed to have a mean of zero). This simplifies the regression equations by eliminating intercepts.
There are causal relationships between variables included in the model. Some correlative (i.e. not causal) relationships are permitted, but these should be kept to a minimum if possible.
There is a causal order between most variables. For example, the model specifies that A has an effect on C, but not the inverse. Some variables may remain unanalysed (see above).
The model is causally closed. That is, all residual variances are independent from one another. Thus, a given variable's residual variance (i.e. variance not explained by variables in the model) should be explained by extraneous variables that are not shared with other variables. In Figure 1, this would mean that the variable(s) represented by RC must not effect variable D and that RC and RD have a covariance of 0. If variables are believed to share extraneous variables, they should be considered as exogenous variables rather than endogenous variables.
The amount of data for each variable must be comparable.
Variables are quantitative or have been recoded as, e.g., dummy variables.
Warnings
The number of variables should be kept small (< 5) as 1) with more variables, the number of possible models increases quickly and 2) overly complex models may prove very difficult to interpret.
The method used to detect relationships between variables must be appropriate to these variables' distributions and properties. Transformations may be used to prepare variables appropriately.
A "blind" approach to defining the initial model is not recommended. Evidence, previous knowledge, or sound reasoning should be the basis for such a model.
The results of path analysis cannot 'prove' the existence of causal relationships, but only provide an estimate on how well a data set supports a causal model.
Variables which are believed to have a common cause should be specified as exogenous variables. Any correlation between these variables may be due to this shared cause, resulting in misleadingly high path coefficients.
The results of path analysis are not the same as those of a partial regression analysis. Path analysis is concerned with identifying direct and indirect effects between variables, while partial regression (and related partial methods) are concerned with estimating the effect of one variable on another while controlling for (i.e. removing) the influence of other variables in a particular model.
Walkthroughs featuring path analysis
Implementations
R
References
Legendre P, Legendre L. Numerical Ecology. 2nd ed. Amsterdam: Elsevier, 1998. ISBN 978-0444892508.
Wright S (1921) Correlation and causation. J Agr Res. 20:557–585.
Wright S (1923) The theory of path coefficients a reply to Niles’s criticism. Genetics. 8:239–255.
Wright S (1934) The method of path coefficients. Ann Math Stat. 5:161–215.
Shipley, B. Cause and Correlation in Biology: a User’s Manual to Path Analysis, Structural Equations and Causal Inference. Cambridge: Cambridge University Press, 2000.