Here we use ecoplate data to visualization by PCoA and test the difference between treatments (here, sampling places).
In EcoPlate analysis (3) we briefly explained Ecological Dissimilarity. More systematic review especially on the meaning of different dissimilarity measures is provided by Anderson et al. (2011) Navigating the multiple meanings of beta diversity: a roadmap for the practicing ecologist . Ecology Letters 14: 19-28 . In particular, you need to check Figure 5 in this article.
First of all, as a preparation for further processes, you can load he names of 31 carbon substrates in the ecoplate into the script.
#Prepare the list of consistance name of chemicals included in ecoplate
chem_name<-c("Pyruvic-Acid-Methyl-Ester", "Tween-40", "Tween-80","alpha-Cyclodextrin", "Glycogen", "D-Cellobiose","alpha-D-Lactose", "beta-Methyl-D-Glucoside", "D-Xylose", "i-Erythritol", "D-Mannitol","N-Acetyl-D-Glucosamine", "D-Glucosaminic-Acid", "Glucose-1-Phosphate","alpha-Glycerol-Phosphate","D-Galactonic-Acid-gamma-Lactone", "D-Galacturonic-Acid", "2-Hydroxy-Benzoic-Acid", "4-Hydroxy-Benzoic-Acid", "gamma-Hydroxybutyric-Acid", "Itaconic-Acid", "alpha-Ketobutyric-Acid", "D-Malic-Acid", "L-Arginine", "L-Asparagine", "L-Phenylalanine", "L-Serine", "L-Threonine", "Glycyl-L-Glutamic-Acid", "Phenylethyl-amine", "Putrescine")
Then let's name each of the rows and columns with osaka_summary_integ_max (obtained by temporal integration and taking the maximum value of triplicates) as an example.
rownames(osaka_summary_integ_max)<-metadata_osaka$sample_ID
colnames(osaka_summary_integ_max)<-chem_name
Next, we convert the negative value to zero as preprocessing. A negative value means that the well coloration was lower than the control, so we convert it to zero and interpret that there was no decomposition ability for the corresponding substrate at all.
#first convert negative values into zero (negative values imply that the color development was weaker than control)
osaka_summary_integ_max[osaka_summary_integ_max<0]<-0
Next, calculate Euclidean distance and Bray-Curtis distance, respectively.
#calculate Euclidean distance
osaka_test_eu.d<-vegdist(osaka_summary_integ_max, method="euclidean")
#calculate Bray-Curtis distance
osaka_test_bc.d<-vegdist(osaka_summary_integ_max, method="bray")
When you run the script to calculate the Bray-Curtis distance, a warning message is displayed on the console. This is due to the presence of samples (rows) where all values are zero.
> osaka_test_bc.d<-vegdist(osaka_summary_integ_max, method="bray")
Warning message:
In vegdist(osaka_summary_integ_max, method = "bray") :
you have empty rows: their dissimilarities may be meaningless in method “bray”
So delete that line from the data and metadata (the sample on the 2nd row).
#Exclude samples with all zero values from data and metadata
osaka_summary_integ_max<-osaka_summary_integ_max[-2,]
metadata_osaka<-metadata_osaka[-2,]
Then again calculate the Bray-Curtis distance. Warnings should not come out now.
#calculate Bray-Curtis distance
osaka_test_bc.d<-vegdist(osaka_summary_integ_max, method="bray")
Next time you can try to calculate another distance. As you can see from the Excel graph at the beginning of the EcoPlate analysis (3) , you can see that the values after Sample 13 are large overall. In order to easily remove the influence of the absolute values from these samples, raw data should be converted to percentage data. In the vegan package, a function called decostand() is available, and raw data can be standardized by converting it to a percentage by specifying the method as method="total"
.
#Calculate proportion
osaka_summary_integ_max_prop<-decostand(osaka_summary_integ_max, method="total")
Then calculate Euclidean distance with this standardized data.
#calculate Euclidean distance on proportion
osaka_test_prop_eu.d<-vegdist(osaka_summary_integ_max_prop, method="euclidean")
To execute PCoA using these distances, you can write the following code and visualize it to the graph one by one.
#PCoA for each distance
PCoA_osaka_eu <- cmdscale(osaka_test_eu.d, k = 2)
x<-PCoA_osaka_eu[,1]
y<-PCoA_osaka_eu[,2]
plot(x,y, cex=2, col=2*as.numeric(metadata_osaka$place_sample),pch=as.numeric(metadata_osaka$place_sample))
PCoA_osaka_prop_eu <- cmdscale(osaka_test_prop_eu.d, k = 2)
x<-PCoA_osaka_prop_eu[,1]
y<-PCoA_osaka_prop_eu[,2]
plot(x,y, cex=2, col=2*as.numeric(metadata_osaka$place_sample),pch=as.numeric(metadata_osaka$place_sample))
PCoA_osaka_bc <- cmdscale(osaka_test_bc.d, k = 2)
x<-PCoA_osaka_bc[,1]
y<-PCoA_osaka_bc[,2]
plot(x,y, cex=2, col=2*as.numeric(metadata_osaka$place_sample),pch=as.numeric(metadata_osaka$place_sample))
The result is as follows.
This result clearly demonstrated that the distribution of samples varies depending on the distance definitions. First of all, there are two causes of the difference between the results using Euclidian distance and the result using Bray-Curtis distance (see the paper introduced at the beginning). First, joint absences contribute to lower dissimilarity (higher similarity) in Euclidean distance while such items are ignored in Bray-Curtis distance. Next, as confirmed in the EcoPlate analysis (3) , while Euclidean distance strongly reflects the absolute value of raw data, Bray-Curtis distance reflects more composition and relative abundance. The influences of the absolute values are weakened in Euclidean distance on proportion. Therefore, Euclidean on proportion results are closer to Bray-Curtis on raw data.
Next, I will talk about statistical est. Looking at the above graph of PCoA, there seems to be some difference between the sampling locations: higashihori (circles) vs honmachi (triangles). However, because there are variations between the samples within each sampling location, it is impossible to judge whether such variations have occurred by chance although there is actually no difference between the places. In addition, it is not clear how many percentage of variations among data (= variations within higashihori + variations within honmachi + variations between higashihori and honmachi) are variations between sampling places (or generally between processes). In other situations, you may want to know how much variation can be explained by different values of continuous value variables, such as temperature, for example, rather than categorical variables like places.
If the variable you want to explain is not multivariate like the data of the ecoplate, for example, in the case of univariate such as the number of species or primary production, you can do analysis of variance (ANOVA) or regression analysis. When the target is multivariate, we use Redundancy Analysis (RDA) or distance-based Redundancy Analysis (db-RDA), which is the combination of principal component analysis (PCA) or principal coordinate analysis (PCoA) and regression analysis. To see the difference between category variables, we often use the method PERMANOVA (NPMANOVA) and PERDISP, but I will not explain here for simplicity and use distance-based RDA . Please refer to the web links for details. Also, FAQ of vegan package is also very helpful.
Since db-RDA is based on PCoA instead of PCA, it can be used for non-Euclidean distance. To run db-RDA, use the function capscale(). In () of the function, you need to write "statistical model expression". As an example we use data processed with Euclidean distance. Note that the goal here is to determine whether the difference in carbon decomposition capability of microbial communities between samples (= Euclidean distance: osaka_test_eu.d) is explained by sample locations (metadata_osaka$place_sample). Here, the statistical model expression is written in the form of "response variable ~ explanatory variable" as follows (or, "dependent variable ~ independent variable").
osaka_test_eu.d~metadata_osaka$place_sample
To execute db-RDA and store the result in the new object osaka_test_eu_rda, execute the following script.
##distance-based RDA analysis
osaka_test_eu_rda<-capscale(osaka_test_eu.d~metadata_osaka$place_sample)
You can use the part of the results stored in osaka_test_eu_eda to calculate the percentage of variation explained by explanatory variables as follows. The calculated value is 0.4530333.
#The variance explained (constrained) by explanatory variable(s)
osaka_test_eu_rda$CCA$tot.chi/(osaka_test_eu_rda$CCA$tot.chi+osaka_test_eu_rda$CA$tot.chi)
You can also use the eigenvalues that come out in PCoA calculation as a part of db-RDA to calculate the proportion explained, although the value is not strictly the same as above.
#Another way of calculation
summary(eigenvals(osaka_test_eu_rda))
This result will be displayed as a list of eigen values on the console, but the proportion explained by explanatory variables (sample location here) is the CAP1 value of Cumulative Proportion (i.e., 0.4530). In general, when there are n explanatory variables, n axes constrained by explanatory variables are identified, so you only have to look at the value of Cumulative Proportion just under CAPn.
> summary(eigenvals(osaka_test_eu_rda))
Importance of components:
CAP1 MDS1 MDS2
Eigenvalue 0.2696 0.2208 0.02764
Proportion Explained 0.4530 0.3710 0.04644
Cumulative Proportion 0.4530 0.8240 0.87049
Test by Permutation
Well, in the above result, I found that 45% variation can be explained by the difference in sample locations. Then, is it not possible to obtain this value from the null model that "there is no difference in ecoplate patterns between sample locations"? If a value such as 45% or 55% is also obtained from the null model assuming that there is no difference, the null model of "no difference" can not be rejected, based on the observed value 45%. In the analysis of variance (ANOVA), which is an analysis of univariate, we take the ratio of variations due to explanatory variables per degree of freedom and other variations per degree of freedom, and call it the F value. The larger F value indicates that the effect of explanatory variables is larger. In the classical ANOVA, the probability distribution of the F value can be calculated mathematically on the assumption with the normal distribution. Therefore, using the measured value of the F value and its probability distribution, you can calculate the probability P that a F value larger than the measured F value is generated from the null model. When this probability P is sufficiently low (for example P <0.01), we will reject the null model, interpreting that the possibility of data being generated from the null model is sufficiently low. This is the so-called P-Value based statistical test (or frequentist inference). Also for our case, it is possible to calculate a pseudo F value for the result of the above db-RDA. According to the following script the pseudo F value is about 24.01.
#Check pseudo-F value (see F value in ANOVA)
df_re<-length(osaka_summary_integ_max) - 1 - 1 #degree of freedom
osaka_test_eu_rda$CCA$tot.chi/(osaka_test_eu_rda$CA$tot.chi/df_re)
The problem here is that multivariate, which we generally deal with in ecology, does not follow multidimensional normal distribution. Therefore the probability distribution according to pseudo-F value is not known. Under such circumstances, we generally regenerate data randomly from the data at hand and test it by the method of "permutation". Let me explain briefly with a simple example at the expense of accuracy. In order to generate data based on the null model "there is no difference between sample location", it is necessary to shuffle the sample location label (metadata_osaka$place_sample) for the data (osaka_summary_integ_max). The label shuffling can be done using the function called sample() as follows.
####Example of permutation (shuffling)
set.seed(123) #fix the random seed
metadata_osaka_perm<-metadata_osaka
leng<-length(metadata_osaka[,1])
z<-sample(1:leng, leng, replace=FALSE) #re-sampling without replacement
z
for(i in 1:leng) metadata_osaka_perm$place_sample[i]<-metadata_osaka$place_sample[z[i]]
Label replacement is done with the last for loop in the script above. To execute PCoA and db-RDA using Euclidean distance based on this replaced label and calculate F value, execute the following script.
#plot permutated data by PCoA
PCoA_osaka_eu <- cmdscale(osaka_test_eu.d, k = 2)
x<-PCoA_osaka_eu[,1]
y<-PCoA_osaka_eu[,2]
plot(x,y, cex=2, col=2*as.numeric(metadata_osaka_perm$place_sample),pch=as.numeric(metadata_osaka_perm$place_sample))
#db-RDA for permutated data
osaka_test_eu_perm_rda<-capscale(osaka_test_eu.d~metadata_osaka_perm$place_sample)
#Check pseudo-F value (see F value in ANOVA)
osaka_test_eu_perm_rda$CCA$tot.chi/(osaka_test_eu_perm_rda$CA$tot.chi/df_re)
After recording the value of F value displayed on the console, repeat the script from z<-sample(1:leng, leng, replace=FALSE) #re-sampling without replacement
, for shuffling the label again and try to calculate F value. The results are summarized in the graph below. You can see the label is replaced and the pseudo F value is shown on the graph. You can also see that the F value of data after Permutation is much smaller than the observed value (original data).
Please note that the above explanation on the permutation omits the details for easily understanding the concept (i.e., the random shuffling is too naive for appropriate null model), but one you can understand the concept with this, use the function anova() to repeat permutation as many times as you like.You can calculate the number of times the F value from permutation exceeds the original F value. The script is very simple and looks as follows. The result is displayed on the console.
#permutation test
anova(osaka_test_eu_rda, permutations=1999, by='terms')
The following is the result shown on the console. You will see that the probability Pr(>F)
with which the F value exceeds the original F value is very low, 5e-4.
> anova(osaka_test_eu_rda, permutations=1999, by='terms')
Permutation test for capscale under reduced model
Terms added sequentially (first to last)
Permutation: free
Number of permutations: 1999
Model: capscale(formula = osaka_test_eu.d ~ metadata_osaka$place_sample)
Df SumOfSqs F Pr(>F)
metadata_osaka$place_sample 1 0.26962 24.02 5e-04 ***
Residual 29 0.32553
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Just as calculated for Euclidean distance, you can also run db-RDA and permutation test on Euclidean distance versus percentage with Bray-Curtis distance as follows.
#db-RDA analysis
osaka_test_prop_eu_rda<-capscale(osaka_test_prop_eu.d~metadata_osaka$place_sample)
osaka_test_prop_eu_rda$CCA$tot.chi/(osaka_test_prop_eu_rda$CCA$tot.chi+osaka_test_prop_eu_rda$CA$tot.chi)
anova(osaka_test_prop_eu_rda, permutations=1999, by='terms')
osaka_test_bc_rda<-capscale(osaka_test_bc.d~metadata_osaka$place_sample)
osaka_test_bc_rda$CCA$tot.chi/(osaka_test_bc_rda$CCA$tot.chi+osaka_test_bc_rda$CA$tot.chi)
anova(osaka_test_bc_rda, permutations=1999, by='terms')
As a result, it can be seen that the explanatory proportion (= statistical power) of variation between sample locations when Euclidean distance on proportion, Bray-Curtis distance is used are 12.6% and 29.1%, respectively. This is all for an example of multivariate analysis using ecoplate as a model system. This is of-course applicable to any multivariate in ecological sciences.
This last section will explain how to convert the multivariate (31 carbon substrate decomposition capability in ecoplate) into a univariate, "multifunctionality". There are several calculation methods of multifunctionality, but please refer to our paper (T. Miki, T. Yokokawa, K. Matsui * (2014) Proceedings of The Royal Society B vol.281 no.1776 20132498 ) and references therein.
Here, we use one of the popular method called quantile-based multifunctionality. We arrange each function (= each substrate value) in order of ranking from smallest to largest. Then, consider X% from the smallest have no effective decomposition capability and binarize it. The core function is quantile().
For example, after copying the data of osaka_summary_integ_max to a new data frame, you can use e.g. the fourth substrate chem_name[4], to calculate the value of 90% from the bottom (smallest), by the following script .
#Convert data into binary values by quantile (quantile-based multifunctionality)
file_summary<-osaka_summary_integ_max
i<-4
chem_name[4]
thres<-0.9
quantile(file_summary[,i], thres)
If you use this method in a for loop, you can binarize the whole data.
mf<-list() #prepare the output list
thres<-0.9
for(i in 1:length(file_summary)) mf[[i]] <-(file_summary[,i] > quantile(file_summary[,i], thres)) # return true if the color value qunatile is greater than threshold
M_mf<-as.data.frame(mf) #converted into dataframe
row.names(M_mf)<-metadata_osaka$sample_ID
colnames(M_mf)<-chem_name
M_mf[M_mf==TRUE]<-1 #converting T,F to 1 and 0; ts is now dataframe to represent the binary matrix (E_BC in Figure 3)
View(M_mf)
For this binarized data, you can calculate the multifunctionality for each sample between 0 and 31 as an integer by calculating the total value sum of each row using the function called apply().
#calculate multifunctionality
MF_osaka<-apply(M_mf,1,sum)
After separating the data to visualize the differences between the sample locations using a boxplot, you can use boxplot().
#separate data for boxplot
higashihori_MF<-MF_osaka[metadata_osaka$place_sample=="higashihori"]
honmachi_MF<-MF_osaka[metadata_osaka$place_sample=="honmachi"]
boxplot(higashihori_MF, honmachi_MF, names=c("higashihori","honmachi"))
The result is as follows. Although the difference is not statistically tested, you can see that honmachi's multifunctionality tends to be higher. This is also one of the commonly used analyzes on multivariate variables, especially on ecosystem function parameters.