Folder : projects/lmelissa_hostAdaptation/bayPass
The executable file(g_baypass) for the program is in this folder : ./baypass_2.1/sources/g_baypass
Folder allpop: 25 populations
allpop_genofile = allelecount input file
environfile = environment covariable input file
bfmeansWithScafPosition_all = final file for futher analysis with scaffold, positions and bf means from all the runs
all the *.out files are the output files for baypass runs 0-3
popplant.txt = populations and there associated plants used in this analysis
job...sh = example bash scripts to run baypass on cluster
Folder west: 17 populations in this set
westpop_genofile = allele count input file
west_environfile = environment covariable input file
bfmeansWithScafPosition_west = final file for further analysis with scaffold, positions and bf means from all the runs
all the *.out files are the output files for baypass runs 0-3
job...sh = example bash scripts to run baypass on cluster
Folder east: 8 populations in this set
eastpop_genofile = allele count input file
east_environfile = environment covariable input file
bfmeansWithScafPosition_east = final file for further analysis with scaffold, positions and bf means from all the runs
all the *.out files are the output files for baypass runs 0-3
The main bayPass folder contains the following files:
runBayPassFork.pl = this is to run multiple runs on the cluster at the same time and is an alternate when the walltime on the UofU cluster runs out. This is what I used for re running the analysis on March 15th 2017.
runBayPass.sh = bash script which runs the perl script above. I submitted this script to the cluster for running the analysis.
locuslist.txt = contains scaffold, positions for all the snps in the analysis
readmefile = just contains all the list of files in the folder
folders figures, jan17finalAnalysis, scripts can be removed. These have files from the old runs. I will delete all this eventually.
Here are the steps to run this analysis and then use this data to prepare final files:
STEP 1
First we need the genofile as the input genotype file for each population for this analysis. To prepare this file I used the Lmelissa_inputfile.txt which I prepared for the Treemix file. This file includes 27 populations (25 melissa + 2 anna). Therefore, I edited this file to have only 25 melissa populations which will be used for all further analysis. After editing the file with all 25 populations for bayPass analysis is saved as allPop_correct. As I had to rerun this analysis for eastern and western populations: I created two more input files called west_genofile (17 populations) and east_genofile (8 populations). I just selected the populations from the main file in R and created seperate files for these two population sets.
R code for this:
#all populations
main = read.table("Lmelissa_inputfile.txt", header=T)
main = main[,-8] #drop FCR
main = main[,-25] #drop YBG
write.table(main, file="allpop_correct", quote=FALSE, row.names=FALSE) #this file is now saved as allpop_genofile in the folder allpop
#western populations
west = main[,-c(2,9,15,17,19,20,22,24)] #dropped eastern populations
write.table(west, file="westpop_correct", quote=FALSE, row.names=FALSE) #this file is now saved as west_genofile in the folder west
#repeat to get a file for eastern populations
Finally baypass needs environment covariable files to calculate bayes factors. Precisely, for our analysis this file had information about the associated host-plant (lupine or alfalfa) for each population. Alfalfa is marked as 1 and lupine/other legumes as 0. These files are saved as environfile.txt, west_environfile.txt, east_environfile.txt for each set of analysis.
So finally I ran 4 chains of each baypass analysis for all populations, eastern and western populations. These were through bash scripts saved in the respective folders.
STEP 2
Once the jobs are run and the analysis is complete, I needed to combine the bayesfactors from each of the runs. Mainly, I took averages of the bayesfactors from the four runs for each population set. There will be the following files created for each run (just replace eastpop with all or westpop).
eastpopD_DIC.out
eastpopD_mat_omega.out
eastpopD_summary_betai_reg.out -- this is the file with bayesfactors
eastpopD_summary_beta_params.out
eastpopD_summary_lda_omega.out
eastpopD_summary_pij.out
eastpopD_summary_pi_xtx.out
Refer to the finalcode_filelog.R for the code for this. This is how I start the main analysis and prepare files for all further analysis.
I combine average bayesfactors with scaffold and positions to do further analysis. This is done for all, east and west and final files are saved as bfmeansWithScafPositions_all,west,east. This is the file which will be used for all further analysis.
R CODE to prepare final data files for analysis
###create locuslist which contains scaffolds and positions###
scafpos = read.table("p_ABC.txt", header=F)
main = scafpos[,1]
write.table(main, file="locuslist.txt", row.names=FALSE, col.names=FALSE, quote=FALSE)
############### TAKE BF MEANS AND PREPARE FILE FOR FURTHER ANALYSIS ###################################
HERE IS AN EXAMPLE FOR THE WEST POPULATIONS
#read in the input files for each run
w0 = read.table("west0_summary_betai_reg.out", header=T)
w1 = read.table("west1_summary_betai_reg.out", header=T)
w2 = read.table("west2_summary_betai_reg.out", header=T)
w3 = read.table("west3_summary_betai_reg.out", header=T)
#get correlations between bayesfactors from different runs
cor.test(w0$BF.dB., w1$BF.dB.) #0.68 #0.65 for all pop, 0.54 for east pop
cor.test(w0$BF.dB., w2$BF.dB.)
cor.test(w0$BF.dB., w3$BF.dB.)
cor.test(w0$BF.dB., w4$BF.dB.)
cor.test(w2$BF.dB., w3$BF.dB.)
cor.test(w1$BF.dB., w3$BF.dB.)
cor.test(w1$BF.dB., w2$BF.dB.)
w01 = (w0$BF.dB. + w1$BF.dB.) / 2
w23 = (w2$BF.dB. + w3$BF.dB.) / 2
cor.test(w01, w23) #west= 0.81, all = 0.79, all = 0.69
#get bayesfactor means from all 4 runs
bfmeans = (w0$BF.dB. + w1$BF.dB. + w2$BF.dB. + w3$BF.dB.) / 4
length(bfmeans) #206,028 SNPs
#read in the locus list containing scaffold and positions for the snps
llist = read.table("../locuslist.txt", header=TRUE)
#combine locus list and bayesfactor means to get the final file
w_final = cbind(llist, bfmeans)
head(w_final)
dim(w_final)
write.table(w_final, file = "bfmeansWithScafPosition_west", col.names=TRUE, row.names=FALSE, sep= "\t", quote=FALSE)
#repeat the same for east runs and all populations
STEP 3
#####correlations between east and west ###################
all = read.table("bfmeansWithScafPosition_all", header=TRUE)
west = read.table("bfmeansWithScafPosition_west", header=TRUE)
east = read.table("bfmeansWithScafPosition_east", header=TRUE)
########### doing quantiles to find top SNPs ###########################
ecomb = east$bfmeans
wcomb = west$bfmeans
cor.test(ecomb,wcomb) ##0.06 #p-value < 2.2e-16
bndsw = quantile(wcomb,probs=c(0.9,0.99,0.999,0.9999)) #quantiles for west populations (top 0.1, 0.001,0.0001 %)
bndse = quantile(ecomb,probs=c(0.9,0.99,0.999,0.9999)) #quantiles for east populations
sum(wcomb > bndsw[1]) #20603
sum(wcomb > bndsw[2]) #number of SNPs 2061
sum(wcomb > bndsw[3]) #207
sum(wcomb > bndsw[4]) #21
sum(ecomb > bndse[1]) #20603
sum(ecomb > bndse[2]) #2061
sum(ecomb > bndse[3]) #207
sum(ecomb > bndse[4]) #21
#finding out which SNPs are common between east and west in the top quantiles
length(which(wcomb > bndsw[1] & ecomb > bndse[1])) #3062
length(which(wcomb > bndsw[2] & ecomb > bndse[2])) #58
#number of SNPs shared in the top 0.1%
wtop = which(wcomb > quantile(wcomb, probs=0.9,na.rm=T))
etop = which(ecomb > quantile(ecomb, probs=0.9,na.rm=T))
for (i in 1:10000){
x<-round(0.1*206028,0)
wnull = sample(1:206028,x,replace=FALSE)
enull = sample(1:206028,x,replace=FALSE)
null[i] = sum(wnull %in% enull)
}
obs = sum(wtop %in% etop) #3062
mean(null) #2060.397
p = mean(null >= obs) #0
xfold = obs / mean(null) #1.486121
#number of SNPs shared in the top 0.01%
#finding out expected by chance vs observed number of SNPs
null <- rep(NA,10000)
for (i in 1:10000){
x<-round(0.01*206028,0)
wnull = sample(1:206028,x,replace=FALSE)
enull = sample(1:206028,x,replace=FALSE)
null[i] = sum(wnull %in% enull)
}
p = mean(null >= obs) #0
xfold = obs / mean(null) #2.824336
mean(null) #20.5602
##plot for evolution on June 18th 2017
bndsw = quantile(wcomb,probs=c(0.0009,0.009,0.09,0.9,0.99,0.999,0.9999))
bndse = quantile(ecomb,probs=c(0.0009,0.009,0.09,0.9,0.99,0.999,0.9999))
length(which(wcomb > bndsw[1] & ecomb > bndse[1])) #205656
length(which(wcomb > bndsw[2] & ecomb > bndse[2])) #202331
length(which(wcomb > bndsw[3] & ecomb > bndse[3])) #170591
length(which(wcomb > bndsw[4] & ecomb > bndse[4])) #3062
length(which(wcomb > bndsw[5] & ecomb > bndse[5])) #58
length(which(wcomb > bndsw[6] & ecomb > bndse[6])) #0
length(which(wcomb > bndsw[7] & ecomb > bndse[7])) #0
#plotting the null vs observed
setEPS() postscript("eastWest_hist.eps", width =9, height = 9) par(mar=c(6,8,8,6),mgp=c(4,1,0)) hist(null, col="slateblue1",xlab="Number of shared SNPs",ylab ="Density",cex.lab=2,cex.axis=1.5, main =" ", xlim =c(0, 60), ylim =c(0, 2000))
abline(v=58)
dev.off()
#pdf plot
pdf("eastWest_hist.pdf", width =9, height = 9)> par(mar=c(6,8,8,6),mgp=c(4,1,0))
hist(null, col="slateblue1",xlab="Number of shared SNPs",ylab ="Density",cex.lab=2,cex.axis=1.5, main =" ", xlim =c(0, 60), ylim =c(0, 2000))
abline(v=58) dev.off()
###### SCATTERPLOTS #################################################
### I need to modify this code and create final graphs...need to play with this
final_dat = bf_scaf #contains columns : scaffold, position, bfmeans, afmeans
plot(final_dat$scaffold,final_dat$bfmeans,col="lightblue",xlab="physical distance(bp)",ylab="log10(Bayesfactors)",main="Environmental correlations and local adaptation")
mns<-tapply(X=final_dat$bfmeans,INDEX=final_dat$scaffold,mean)
#lines(final_dat$scaffold,mns,lwd=2,col="orange")
#lines(final_dat$scaffold,log10(mns),lwd=2,col="orange")
plot(final_dat$scaffold,final_dat$bfmeans,col="darkgray",xlab="physical distance(bp)",ylab="log10(Bayesfactors)",main=" ",cex.lab=1.4,cex.axis=1)
lines(as.numeric(names(mns)),mns,lwd=2,col="red")
title(main="Environmental correlations and local adaptation (all pop)",cex.main=1.6, adj=0.5)
dev.copy(png,"bfVsscaffold_all.png")
dev.off()
#plot for west
plot(bf_scaf_west$scaffold,bf_scaf_west$bayesfactors,col="darkgray",xlab="physical distance(bp)",ylab="log10(Bayesfactors)",main=" ",cex.lab=1.4,cex.axis=1)
mns<-tapply(X=bf_scaf_west$bayesfactors,INDEX=bf_scaf_west$scaffold,mean)
lines(as.numeric(names(mns)),mns,lwd=2,col="red")
title(main="Environmental correlations and local adaptation (west pop)",cex.main=1.6, adj=0.5)
dev.copy(pdf,"bfVsscaffold_west.pdf")
dev.off()
dev.copy(png,"bfVsscaffold_west.png")
dev.off()
#plot for east
plot(bf_scaf_east$scaffold,bf_scaf_east$bayesfactors,col="darkgray",xlab="physical distance(bp)",ylab="log10(Bayesfactors)",main=" ",cex.lab=1.4,cex.axis=1)
mns<-tapply(X=bf_scaf_east$bayesfactors,INDEX=bf_scaf_east$scaffold,mean)
lines(as.numeric(names(mns)),mns,lwd=2,col="red")
title(main="Environmental correlations and local adaptation (east pop)",cex.main=1.6, adj=0.5)
dev.copy(png,"bfVsscaffold_east.png")
dev.off()
dev.copy(png,"bfVsscaffold_east.pdf")
dev.off()
#plotting east vs west (This is the main plot for parallel evolution result)
plot(ecomb,wcomb)
hist(ecomb)
hist(wcomb)
hist(ecomb)