EcoPlate Analysis (3)

5) Fundamentals of multivariate analysis 1

Here I will explain the basis of statistics on multidimensional data often encountered in ecology such as EcoPlate data (32 dimensional vectors) and species composition of communities = "multivariate data". You can refer to the part of Chapter 8 and 10 of the Tree diversity analysis manual in the website Tree Diversity Analysis . I encourage you to read this manual carefully for more consistent information.

In this session, I would like to explain the "image" of multivariate analysis by using a part (osaka_summary_integ_csv) of the ecoplate data made at Ecoplate Analysis (2). There is a mathematical theory behind the analytical method that can simplify the data and draw beautiful diagrams. In order to understand it, at least knowledge of linear algebra is necessary but here all mathematical explanations such as matrix calculation and eigenvalues are not explained.

Visualization of complex data

Let's start with osaka_summary_integ.csv, for example, open by Excel, (in the meantime save it again in Excel format). Thirty-one values ​​in each row (from V1 to V31) indicate the metabolic capacity for 31 samples of each carbon substrate in each sample. The problem is how to visualize variations between samples taking into consideration the characteristics of the entire 31 values for each sample. But that's easy. You can draw a bar chart with Excel. If you draw a graph with all 32 samples from top to bottom, it looks as follows.

This is a kind of very busy graph. By the way, according to metadata_osaka.csv, samples 1-18 are from higashihori, 19-32 are samples of honmachi. Is there a difference in carbon decomposition ability between higashihori and honmachi? The goal of this section is to learn how to answer this question. Specifically, we will explain two things: 1) how to visualize the difference so that you can understand the difference at a glance, and 2) a way to test whether the difference is a random event or not .

If you use 31-dimensional data immediately, you can not get an intuitive image and think about visualization method using only the first two columns V1, V2. First of all, I will try to make a bar chart similar to the figure above.

Next let's draw a scatter plot using R. Basically you can draw a scatter plot of x - y with plot (x, y). There are various options in the plot () function and you can change the design of the drawing. However, here we simply distinguish the color of each point by the value of metadata_osaka$place_ sample ("higashihori" or "honmachi"), and use the option cex to specify the size of each point.

as.numeric(metadata_osaka$place_sample)
plot(osaka_summary_integ_ave$V1,osaka_summary_integ_ave$V2,col=2*as.numeric(metadata_osaka$place_sample),cex=2)

The result of this script is as follows (I changed Width and Height for better appearance). It can be seen from this graph that 1) the values ​​of V 1 and V 2 are distributed in a pattern extending from the lower left to the upper right rather than being randomly distributed on the two dimensional plot, and 2) the higashihori samples (Red circle) and those from honmachi (blue triangle) are not mixed together. These two features are obvious from the graph. This is the power of the two-dimensional scatter diagram . It is amazing!

However, recall that the dimension of the original data is 31. We cannot draw a 31 dimensional scatter plot and you cannot imagine it in mind unless you are a genius mathematician or a physicist. It would be nice if 31-dimensional data could be downsized into two-dimensional scatter plot. The method of satisfying this desire is called "((unsupervised) dimension reduction" in statistics or machine learning.

Since humans can understand two-dimensional scatter plots intuitively, it is not necessary to make it simpler any more, but first of all I will explain the core of the idea of ​​this method with the above two dimensional scatter diagram. As shown in the distribution trend shown in the figure, by looking at the pattern from the lower left to the upper right as described above, setting the following 1-dimensional axis can reduce two dimensions to one dimension.

If you use your mind's eye, you can see the axis (black bold arrow) somehow like the one below. It is important to note here that the direction of this axis does not match the direction of the normal regression line. When a line is drawn perpendicularly from each point as shown by the dotted line in the figure, the direction is determined so that the sum of the distances between the axis and each point is minimized. I will omit mathematical explanation based on linear algebra since it is available for many of statistics textbooks and websites.

Moving the points on this black axis (= take a mapping), it becomes a one dimensional figure like the one below. In this dimension reduction from two-dimensional to one-dimensional, this axis is called "Principal Component", and such analysis is principal component analysis (PCA). We will skip the details of the script here as we will explain it again later, but this graph can be drawn by executing the following script.

library(vegan)
osaka_test0.d<-vegdist(osaka_summary_integ_ave[,1:2], method="euclidian")
PCoA_osaka0 <- cmdscale(osaka_test0.d, k = 1)
x<-PCoA_osaka0[,1]
y<-rep(0:0, length=32)
plot(x,y, cex=2, col=2*as.numeric(metadata_osaka$place_sample),pch=as.numeric(metadata_osaka$place_sample))   

The problem is that it becomes easier to understand through reducing the 2D diagram to 1D, but the information is lost. There were three red circles enclosed in light blue in the above two-dimensional scattergram, but it became two points only close to the leftmost triangle sample in the one-dimensional scattergram that is based on the principal component. In two dimensions, the two points which are certainly distinguished certainly overlap on this principal component, which means that information is lost accordingly. To take into account this program, PCA usually calculates what percentage of the original information is left when dimensions are reduced, but here we will not explain further.

Now, I think that it is difficult to grasp the image only by reducing the dimension from two to one, so this time I would like to confirm the power of PCA using three data of V1, V2, V3 of osaka_summary_integ_ave. Simply draw a 3D scatter plot with the following script. You can easily find that the samples are not homogeneously distributed all over the 3-dimensional space.

#3D scatter plot
library(scatterplot3d)
scatterplot3d(osaka_summary_integ_ave$V1,osaka_summary_integ_ave$V2,osaka_summary_integ_ave$V3,color=2*as.numeric(metadata_osaka$place_sample),pch=1, cex.symbols=2,angle=50)

If you use your mind's eye again, you will see all of the samples are placed on in a light blue planar shape like the one below.

Your understanding based on your mind's eye can be systematically done by PCA. Let's write PCA's script properly, reduce and plot three-dimensional data to two-dimensional data (The explanation on this script is given later). The graph should look like the figure below.

osaka_test.d<-vegdist(osaka_summary_integ_ave[,1:3], method="euclidian")
PCoA_osaka <- cmdscale(osaka_test.d, k = 2)
x<-PCoA_osaka[,1]
y<-PCoA_osaka[,2]
plot(x,y, cex=2, col=2*as.numeric(metadata_osaka$place_sample))   

This is the core idea of PCA (principal component analysis).

Calculation of Ecological Dissimilarity: Quantify "being similar / not similar / close / far"

Well, why is it easy to understand a two dimensional scatter diagram? We have the feeling that points nearby imply similar, and the feeling that things more distant are less similar. So, what does it mean being "similar / not similar" or "close / far"? We need to think about this more deeply for analyzing multivariate data in ecology, such as ecoplate. Let's consider a simplified example here. Suppose that there are four samples (A, B, C, D) and give hypothetical values ​​of A (3.0, 3.5), B (0.0, 0.5), C (1.0, 0.0), and D (6.0, 6.0), representing the capacity of microbial communities to decompose the substrates V1 and V2.

################Ecological Distance
A<-c(3.0,3.5)
B<-c(0.0,0.5)
C<-c(1.0,0.0)
D<-c(6.0,6.0)
test_sample<-as.data.frame(rbind(rbind(rbind(A,B),C),D))
colnames(test_sample)<-c("V1","V2")
View(test_sample)

Simple script for drawing 2D scatter diagram is as follows.

#Simple 2d scatterplot
plot(test_sample, cex=0.5,xlim=c(0,7),ylim=c(0,7))
text(test_sample$V1, test_sample$V2, labels=rownames(test_sample), cex=0)

The graph will be given by below.

As you can see from the above figure, you can find that the distance between B and C is the shortest, which is interpreted as the "most similar" capability of carbon substrates. More specifically, to calculate the distance between samples, we use the vegdist() function in the vegan package. In general, what we call "distance" in everyday conversation is "Euclidean Distance", and we can calculate the distance with the option method="euclidean".

#examples of ecological dissimilarity
vegdist(test_sample, method="euclidean")  #Euclidean 

The result is as follows.

    A                   B                  C
B  4.242641                  
C  4.031129    1.118034         
D  3.905125    8.139410   7.810250

We can regard the calculation above is as the process to evaluates the similarity of carbon substrate capability between samples, through using the Euclidean distance. Then, "(Euclidean) distance is greater when the similarity is smaller ", or "(Euclidean) distance is larger when the degree of dissimilarity is larger". The problem here is whether the evaluation of similarity by this Euclidean distance is appropriate from the viewpoint of our biological and/or chemical understanding. When examining the values ​​of A (3.0, 3.5), B (0.0, 0.5), C (1.0, 0.0), and D (6.0, 6.0) one by one, you can see A and D are able to decompose substrates V1 and V2 together although its strength and magnitude are different. Meanwhile, B can decompose V1 only, C V2 only. At this time, a reasonable interpretation should be that the similarity between A and D is high (dissimilarity is small) and the similarity between B and C is small (dissimilarity is large). However, since the Euclidean distance is calculated with the definition as below, absolute values ​​of each of the elements (V1, V2) of A and D have large influence on the calculation so that the dissimilarity becomes large.

There are a number of "non-Euclidean" distances defined (see Chapter 8 of Tree diversity analysis manual for more details). Many of them do not fulfill part of the metric condition ( see " wikipedia " for " metricability " explanation ), but it can more strongly reflect the biological/chemical interpretation mentioned above.

Here, let me first calculate the dissimilarity after converting the data of the A-D values to binary values, by considering the presence/absence only. This is called Jaccard distance and can be calculated with the following script. An important technical precautions is that Jaccard distance is also defined for continuous value data in general, so you need to specify the binary option as binary=TRUE; otherwise unintended values will be given.

#Non-Euclidean (& non-metric or semimetric) dissimilarity
vegdist(test_sample, method="jaccard", binary=TRUE) #Presence/absence

The result is given as follows.

    A       B      C
B  0.5        
C  0.5   1.0    
D  0.0   0.5   0.5

As can be seen, Jaccard distance between A and D is the smallest, zero, while the distance between B and C is the largest value 1.0, since there are no shared substrates that can be decomposed by both B and C. As a matter of mathematical precaution, jaccard distance does not satisfy a part of the metricity because the distance between A and D is zero despite of different points (c.f. Euclidean distance is a metric measure because the distance between points (samples) is zero only when the points are identical). For reference, the definition of jaccard distance in the case of binary is shown below.

Let's calculate the Bray-Curtis distance, which shows the behavior similar to the Jaccard distance and reflects the continuous value of the data as well. Like the Jaccard distance, the distance B between C becomes maximum while the distance between A and B is at the minimum. The difference from Jaccard distance is that e.g. the distance between A and B is different from that between D and B.

> vegdist(test_sample, method="bray") #Bray-curtis
      A                   B                     C
B   0.8571429                    
C   0.7333333   1.0000000          
D   0.2972973   0.9200000    0.8461538

For reference, the definition of Bray-Curtis distance is shown below.

Basics of Principal Coordinate Analysis (PCoA)

If you want to visualize the dissimilarity between samples by a non-Euclidean distance introduced above, you do not use principal component analysis (PCA). This is because principal component analysis uses the Pythagoras's theorem in the process of calculating the amount of information loss; this theorem is applicable to the world where the Euclidean distance is defined. An alternative to the principal component analysis is a method called Principal Coordinate Analysis (abbreviated as PCoA). Although details of the calculation method will not be explained, PCoA does not start from the original data unlike PCA, but uses information on the distance between each sample (calculated by the distance matrix: vegdist). In PCoA, each sample is repositioned in Euclidean space (where Euclidean distance is defined) while maintaining the distance (defined by non-Euclidean distance) between samples.

Technical note

Note that PCoA is a process of mapping the distance between each sample defined by non-Euclidean distance to the world of Euclidean distance, so it should be noted that some distortion naturally arises. In other words, in order to convert very high-dimensional data into a two-dimensional scatter diagram, the distance (Euclidean distance) between each sample on a two-dimensional scattergram is not equal to the distance value based on the original non-Euclidean distance. However, when n-dimensional data is placed in the n-dimensional Euclidean space, the distance value defined by the non-Euclidean distance is completely reproduced as the Euclidean distance on the n-dimensional Euclidean space without any distortion. In this sense, PCoA is a method that maintains distance, so it is also called metric multidimensional scaling (metric MDS). The method of breaking the spell of metricity and converting the distance difference to rank of dissimilarity is called non-metric multidimensional scale method, non-metric MDS, i.e., NMDS. But here I will not explain it. The following weblog explains NMDS very well.

https://jonlefcheck.net/2012/10/24/nmds-tutorial-in-r/

Now, let's visualize the data A, B, C and D using PCoA.

The first plot is based on Jaccard distance which considers presence/absence only. The necessary steps include (1) computing a distance matrix with the vegdist() function, (2) specifying the distance matrix as an argument in cmdscale() and specifying k = 2 as the option to perform two-dimensional PCoA visualization. After that, you need to (3) obtain the coordinate value (x, y) on the two-dimensional Euclidean space of each point, from the result of PCoA, and then (5) draw a two-dimensional scatter plot by combining plot() and text(). The script is as follows.

#PCoA based on Jaccard distance
test_ja.d<-vegdist(test_sample, method="jaccard", binary=TRUE) #calculate Jaccard dissimilarity
PCoA_test_ja <- cmdscale(test_ja.d, k = 2)  #execute PCoA
x<-PCoA_test_ja[,1]   #extract the position of each sample (A,B,C,D)
y<-PCoA_test_ja[,2]
plot(x,y, cex=0.5,xlim=c(-0.5,0.5),ylim=c(0,0.1))  #2D scatter plot
text(x,y,label=rownames(test_sample))   #set the symbols

The results of visualization are as follows. The difference in value in the Y axis direction is quite small. Furthermore, since the distance between A and D is zero, it overlaps.

Next is a plot using Bray-Curtis distance. Just change the above script slightly and it looks like the following.

#PCoA based on Bray-Curtis distance
test_bc.d<-vegdist(test_sample, method="bray") #calculate Bray-Curtis dissimilarity
PCoA_test_bc <- cmdscale(test_bc.d, k = 2)  #execute PCoA
x<-PCoA_test_bc[,1]   #extract the position of each sample (A,B,C,D)
y<-PCoA_test_bc[,2]
plot(x,y, cex=0.5,xlim=c(-0.7,0.4),ylim=c(-0.7,0.4))  #2D scatter plot
text(x,y,label=rownames(test_sample))   #set the symbols

The figure is as follows. It is quite different from the result based on Jaccard distance. However, basically the relationship of the magnitude of distance is similar.

Finally, it is a diagram of PCoA based on Euclidean distance. If you set the aspect ratio of the figure to 1: 1, you will see that it is almost the same points distribution as the original two-dimensional scatter diagram.

#PCoA based on Euclidean distance
test_eu.d<-vegdist(test_sample, method="euclidean") #calculate Euclidean dissimilarity
PCoA_test_eu <- cmdscale(test_eu.d, k = 2)  #execute PCoA
x<-PCoA_test_eu[,1]   #extract the position of each sample (A,B,C,D)
y<-PCoA_test_eu[,2]
plot(x,y, cex=0.5,xlim=c(-3,6),ylim=c(-3,6))  #2D scatter plot
text(x,y,label=rownames(test_sample))   #set the symbols

Although the result above is a PCoA scatter diagram based on the Euclidean distance, in fact this is exactly the same as the result of the principal component analysis (PCA). To do PCA with the function rda() included in R vegan package, we usually prepare the following script. For visualization with the option scaling = 1, the distance between samples corresponds to the Euclidean distance between samples.

##########Finally, show PCA with scale 1 (see Biodiversity manual Chapter 10)
PCA_test<-rda(test_sample)
plot1<-ordiplot(PCA_test, scaling=1, type="text")

Although the direction of the x axis is opposite to that of the upper PCoA, you can see that the relative position of each point is almost the same.

There are a lot of details such as what the PC axis shows, explanation power and so on, but this section will end here we will end here without touching.