Trang chủ‎ > ‎IT‎ > ‎Data Mining‎ > ‎

Canonical Correlation Analysis

Introduction

Canonical correlation analysis is a method for exploring the relationships between two multivariate sets of variables (vectors), all measured on the same individual.

Consider, as an example, variables related to exercise and health. On one hand you have variables associated with exercise, observations such as the climbing rate on a stair stepper, how fast you can run, the amount of weight lifted on bench press, the number of push-ups per minute, etc. But you also might have health variables such as blood pressure, cholesterol levels, glucose levels, body mass index, etc. So two types of variables are measured and the relationships between the exercise variables and the health variables are to be studied.

As a second example consider variables measured on environmental health and environmental toxins. A number of environmental health variables such as frequencies of sensitive species, species diversity, total biomass, productivity of the environment, etc. may be measured on one hand; on the other a second set of variables such as environmental toxins which might include the concentrations of heavy metals, pesticides, dioxin, etc. are measured.

For a third example consider a group of sales representatives, on whom we have recorded several sales performance variables along with several measures of intellectual and creative aptitude. We may wish to explore the relationships between the sales performance variables and the aptitude variables.

One approach to studying relationships between the two sets of variables is to use canonical correlation analysis which describes the relationship between the first set of variables and the second set of variables. We do not necessarily think of one set of variables as independent and the other as dependent, though that may potentially be another approach.

Learning objectives & outcomes

Upon completion of this lesson, you should be able to do the following:

  • Carry out a canonical correlation analysis using SAS (Minitab does not have this functionality);
  • Assess how many canonical variate pairs should be considered;
  • Interpret canonical variate scores;
  • Describe the relationships between variables in the first set with variables in the second set.
1. Setting the stage of Canonical Correlation Analysis

What motivates canonical correlation analysis?

It is possible to create pairwise scatter plots with variables in the first set (e.g., exercise variables), and variables in the second set (e.g., health variables). But if the dimension of the first set is p and that of the second set is q, there will be pq such scatter plots, it  may be difficult, if not outright impossible, to look at all of these graphs together and be able to interpret the results.

Similarly, you could compute all correlations between variables from the first set (e.g., exercise variables), and then compute all the correlations between the variables in the second set (e.g., health variables). But with pq a large number, problem of interpretation arises.

Canonical Correlation Analysis allows us to summarize the relationships into lesser number of statistics while preserving the main facets of the relationships. In a way, the motivation for canonical correlation is very similar to principal component analysis. It is another dimension reduction technique.

Canonical Variates

Let's begin with the notation:

We have two sets of variables X and Y.

Suppose we have p variables in set 1: X=X1X2Xp 

and suppose we have q variables in set 2: Y=Y1Y2Yq 

We select X and Y based on the number of variables that exist in each set so that p ≤ q.  This is done for computational convenience.

Just as done in principal components analysis we look at linear combinations of the data.  We define a set of linear combinations named U and VU will correspond to the linear combinations from the first set of variables, X, and V will correspond to the second set of variables, Y. Each member of U will be paired with a member of V. For example, U1 below is a linear combination of the X variables and V1 is the corresponding linear combination of the q Y variables.

Similarly, U2 is a linear combination of the X variables, and V2 is the corresponding linear combination of the q Y variables. And, so on....

U1U2UpV1V2Vp======a11X1+a12X2++a1pXpa21X1+a22X2++a2pXpap1X1+ap2X2++appXpb11Y1+b12Y2++b1qYqb21Y1+b22Y2++b2qYqbp1Y1+bp2Y2++bpqYq

Thus define

(Ui,Vi)

as the ith canonical variate pair. (U1V1) is the first canonical variate pair, similarly (U2V2) would be the second canonical variate pair and so on... With p ≤ there are p canonical covariate pair.

We are to find linear combinations that maximize the correlations between the members of each canonical variate pair.

We can compute the variance of Ui variables using the following expression:

var(Ui)=k=1pl=1paikailcov(Xk,Xl)

The coeffcients ai1 through aip that appear in the double sum are the same coefficients that appear in the definition of Ui . The covariances between the kth and lth X-variables are multiplied by the corresponding coefficients aik and ail for the variate Ui .

Similar calculations can be made for the variance of Vj as shown below:

var(Vj)=k=1pl=1qbjkbjlcov(Yk,Yl)

Then calculate the covariance between Ui and Vj as:

cov(Ui,Vj)=k=1pl=1qaikbjlcov(Xk,Yl)

The correlation between Ui and Vj is calculated using the usual formula. We take the covariance between those two variables and divide it by the square root of the product of the variances:

cov(Ui,Vj)var(Ui)var(Vj)

The canonical correlation is a specific type of correlation. The canonical correlation for the ith canonical variate pair is simply the correlation between Ui and Vi:

ρi=cov(Ui,Vi)var(Ui)var(Vi)

This quantity is to be maximized. We want to find linear combinations of the X's and linear combinations of the Y's that maximize the above correlation.

Canonical Variates Defined

Let us look at each of the p canonical variates pair one by one.

First canonical variate pair: (U1V1):

The coefficients a11,a12,,a1p and b11,b12,,b1q are to be selected so as to maximize the canonical correlation ρ1 of the first canonical variate pair. This is subject to the constraint that variances of the two canonical variates in that pair are equal to one.

var(U1)=var(V1)=1

This is required so that unique values for the coefficients are obtained.

Second canonical variate pair: (U2V2)

Similarly we want to find the coefficients a21,a22,,a2p and b21,b22,,b2q that maximize the canonical correlation ρ2 of the second canonical variate pair, (U2V2). Again, we will maximize this canonical correlation subject to the constraints that the variances of the individual canonical variates are both equal to one. Furthermore, we require the additional constraints that (U1U2), and (V1V2) have to be uncorrelated. In addition, the combinations (U1V2) and (U2V1) must be uncorrelated. In summary, our constraints are:

var(U2)=var(V2)=1,

cov(U1,U2)=cov(V1,V2)=0,

cov(U1,V2)=cov(U2,V1)=0.

Basically we require that all of the remaining correlations equal zero.

This procedure is repeated for each pair of canonical variates. In general, ...

ith canonical variate pair: (UiVi)

We want to find the coefficients ai1,ai2,,aip and bi1,bi2,,biq that maximizes the canonical correlation ρi subject to the similar constraints that

var(Ui)=var(Vi)=1,

cov(U1,Ui)=cov(V1,Vi)=0,

cov(U2,Ui)=cov(V2,Vi)=0,

cov(Ui1,Ui)=cov(Vi1,Vi)=0,

cov(U1,Vi)=cov(Ui,V1)=0,

cov(U2,Vi)=cov(Ui,V2)=0,

cov(Ui1,Vi)=cov(Ui,Vi1)=0.

Again, requiring all of the remaining correlations to be equal zero.

Next, let's see how this is carried out in SAS...

2 - Example: Sales Data

The data to be analyzed comes from a firm that surveyed a random sample of = 50 of its employees in an attempt to determine what factors influence sales performance. Two collections of variables were measured:

  • Sales Performance:
    • Sales Growth
    • Sales Profitability
    • New Account Sales
  • Test Scores as a Measure of Intelligence
    • Creativity
    • Mechanical Reasoning
    • Abstract Reasoning
    • Mathematics

There are p = 3 variables in the first group relating to Sale Performance and q = 4 variables in the second group relating to the Test Scores.

Canonical Correlation Analysis is carried out in SAS using a canonical correlation procedure that is abbreviated as cancorr. We will look at how this is carried out in the SAS Program sales.sas.

SAS Program

3. Test for Relationship Between Canonical Variate Pairs

The very first thing to determine is if there is any relationship between the two sets of variables at all. Perhaps the two sets of variables are completely unrelated to one another and independent!

To test for independence between the Sales Performance and the Test Score variables first consider a multivariate multiple regression model where we are predicting, in this case, Sales Performance variables from the Test Score variables. In this general case, we are going to have p multiple regressions, each multiple regression predicting one of the variables in the first group ( X variables) from the q variables in the second group (Y variables).

X1X2Xp===β10+β11Y1+β12Y2++β1qYq+ϵ1β20+β21Y1+β22Y2++β2qYq+ϵ2βp0+βp1Y1+βp2Y2++βpqYq+ϵp

In our example, we have multiple regressions predicting the p = 3 sales variables from the q = 4 test score variables. We wish to test the null hypothesis that these regression coefficients (except for the intercepts) are all equal to zero. This would be equivalent to the null hypothesis that the first set of variables is independent from the second set of variables.

H0:βij=0;  i=1,2,,p;j=1,2,,q

This is carried out using Wilk's lambda. The results of this are found on page 1 of the output of the SAS Program.

SAS Output

SAS reports the Wilks’ lambda Λ = 0.00215; F = 87.39; d.f. = 12, 114; p < 0.0001. Wilks' lambda is ratio of two variance-covariance matrices (raised to a certain power). If the values of these statistics are too large (small p-value), it indicates rejection of the null hypothesis. Here we reject the null hypothesis that there is no relationship between the two sets of variables, and can conclude that the two sets of variables are dependent. Note also that, the above null hypothesis is equivalent to testing the null hypothesis that all p canonical variate pairs are uncorrelated, or

H0:ρ1=ρ2==ρp=0

Since Wilk's lambda is significant, and since the canonical correlations are ordered from largest to smallest, we can conclude that at least ρ10.

We may also wish to test the null hypothesis that maybe the second or the third canonical variate pairs are correlated. We can do this in successive tests. Next, test whether the second and third canonical variate pairs are correlated...

H0:ρ2=ρ3=0

We can look again at the SAS output above in the second row for the likelihood ratio test statistic and find L' = 0.19524; F = 18.53; d.f. = 6, 88; p < 0.0001. From this test we can conclude that the second canonical variate pair is correlated, ρ20.

Finally, we can test the significance of the third canonical variate pair.

H0:ρ3=0

Again, we look at the SAS output above, this time in the third row for the the likelihood ratio test statistic and find L' = 0.8528; F = 3.88; d.f. = 2, 45; p = 0.0278. This is also significant, so we can conclude that the third canonical variate pair is correlated.

All three canonical variate pairs are significantly correlated and dependent on one another. This suggests that we would want to go ahead and summarize for all three pairs. In practice, these tests would be carried out successively until you find a non-significant result. Once a non-significant result is found you would stop. If this happens with the first canonical variate pair it suggests that there is no evidence of any relationship between the two sets of variables and the analysis may be stopped.

If the first pair shows significance, then you move on to the second canonical variate pair. If this second pair is not significantly correlated then you would stop. If it was significant you would continue to the third pair, proceeding in this iterative manner through the pairs of canonical variates testing until you find non-significant results.

4 - Obtain Estimates of Canonical Correlation

Now that we have tested the hypotheses of independence and have rejected them, the next step is to obtain estimates of canonical correlation.

The estimated canonical correlations are found at the top of page 1 in the SAS output as shown below:

SAS Output

The squared values of the canonical variate pairs, found in the last column, can be interpreted much in the same way as r2 values are interpreted.

We see that 98.9% of the variation in U1 is explained by the variation in V1, and 77.11% of the variation in U2 is explained by V2, but only 14.72% of the variation in U3 is explained by V3. These first two are very high canonical correlation and implies that only the first two canonical correlations are important.

One can actually see this from the plot that the SAS program generated. Here is the scatter plot for the first canonical variate pair, the first canonical variate for sales is plotted against the first canonical variate for scores.

SAS Plot

The program has also drawn the regression line to see how well the data fits. The plot of the second canonical variate pair is a bit more scattered:

SAS Plot

But is still a reasonably good fit. A plot of the third pair would show little of the same kind of fit. One may make a decision here and refer to only the first two canonical variate pairs from this point on based on the observation that the third squared canonical correlation value is so small.

5 - Obtain the Canonical Coefficients

Page 2 of the SAS output provides the estimated canonical coefficients (aij) for the sales variables which are provided in the following table.

SAS Output

Thus, using the coefficient values in the first column, the first canonical variable for sales can be determined using the following formula:

U1=0.0624Xgrowth+0.0209Xprofit+0.0783Xnew

Likewise, the estimated canonical coefficients (bij) for the test scores are located in the next table in the SAS output:

SAS Output

Thus, using the coefficient values in the first column, the first canonical variable for test scores can be determined using a similar formula:

V1=0.0697Ycreate+0.0307Ymech+0.0896Yabstract+0.0628Ymath

In both cases, the magnitudes of the coefficients give the contributions of the individual variables to the corresponding canonical variable. However, just like in principal components analysis, these magnitudes also depend on the variances of the corresponding variables. Unlike principal components analysis however, standardizing the data has no impact on the canonical correlations.

6 - Interpret Each Component

To interpret each component, we must compute the correlations between each variable and the corresponding canonical variate.

a. The correlations between the sales variables and the canonical variables for Sales Performance are found at the top of the fourth page of the SAS output in the following table:

SAS Output

Looking at the first canonical variable for sales, we see that all correlations are uniformly large. Therefore, you can think of this canonical variate as an overall measure of Sales Performance. For the second canonical variable for Sales Performance, none of the correlations is particularly large, and so, this canonical variable yields little information about the data. Again, we had decided earlier not to look at the third canonical variate pairs.

A similar interpretation can take place with the Test Scores.

b. The correlations between the test scores and the canonical variables for Test Scores are also found in the SAS output:

SAS Output

Since all correlations are large for the first canonical variable, this can be thought of as an overall measure of test performance as well, however, it is most strongly correlated with mathematics test scores. Most of the correlations with the second canonical variable are small. There is some suggestion that this variable may be negatively correlated with abstract reasoning.

c. Putting (a) and (b) together, we see that the best predictor of sales performance is mathematics test scores as this indicator stands out most.

7 - Reinforcing the Results

These results can be further reinforced by looking at the correlations between each set of variables and the opposite group of canonical variates.

a. The correlations between the sales variables and the first canonical variate for test scores are found on page 4 of the SAS output and have been inserted below:

SAS Output

We can see that all three of these correlations are strong and show a pattern similar to that with the canonical variate for sales. The reason for this is obvious: The first canonical correlation is very high.

b. The correlations between the test and the first canonical variate for sales have also been inserted here from the SAS output:

SAS Output

Note that these also show a pattern similar to that with the canonical variate for test scores. Again, this is because the first canonical correlation is very high.

c. These results confirm that sales performance is best predicted by mathematics test scores.

8 - Summary

In this lesson we learned about:

  • How to test for independence between two sets of variables;
  • How to determine the number of significant canonical variate pairs;
  • How to compute the canonical variates from the data;
  • How to interpret each member of a canonical variate pair using its correlations with the member variables;
  • How to use the results of canonical correlation analysis to describe the relationships between two sets of variables.

Next, complete the homework problems that will give you a chance to put what you have learned to use...

Comments