bayPass

Folder : projects/lmelissa_hostAdaptation/bayPass

The executable file(g_baypass) for the program is in this folder : ./baypass_2.1/sources/g_baypass

Folder allpop: 25 populations

allpop_genofile = allelecount input file
environfile = environment covariable input file
bfmeansWithScafPosition_all = final file for futher analysis with scaffold, positions and bf means from all the runs
all the *.out files are the output files for baypass runs 0-3
popplant.txt = populations and there associated plants used in this analysis
job...sh = example bash scripts to run baypass on cluster

Folder west: 17 populations in this set

westpop_genofile = allele count input file
west_environfile = environment covariable input file
bfmeansWithScafPosition_west = final file for further analysis with scaffold, positions and bf means from all the runs
all the *.out files are the output files for baypass runs 0-3
job...sh = example bash scripts to run baypass on cluster

Folder east: 8 populations in this set

eastpop_genofile = allele count input file
east_environfile = environment covariable input file
bfmeansWithScafPosition_east = final file for further analysis with scaffold, positions and bf means from all the runs
all the *.out files are the output files for baypass runs 0-3

The main bayPass folder contains the following files:

runBayPassFork.pl = this is to run multiple runs on the cluster at the same time and is an alternate when the walltime on the UofU cluster runs out. This is what I used for re running the analysis on March 15th 2017.
runBayPass.sh = bash script which runs the perl script above. I submitted this script to the cluster for running the analysis.
locuslist.txt = contains scaffold, positions for all the snps in the analysis
readmefile = just contains all the list of files in the folder
folders figures, jan17finalAnalysis, scripts can be removed. These have files from the old runs. I will delete all this eventually.

Here are the steps to run this analysis and then use this data to prepare final files:

STEP 1

First we need the genofile as the input genotype file for each population for this analysis. To prepare this file I used the Lmelissa_inputfile.txt which I prepared for the Treemix file. This file includes 27 populations (25 melissa + 2 anna). Therefore, I edited this file to have only 25 melissa populations which will be used for all further analysis. After editing the file with all 25 populations for bayPass analysis is saved as allPop_correct. As I had to rerun this analysis for eastern and western populations: I created two more input files called west_genofile (17 populations) and east_genofile (8 populations). I just selected the populations from the main file in R and created seperate files for these two population sets.

R code for this:

#all populations

main = read.table("Lmelissa_inputfile.txt", header=T)

main = main[,-8] #drop FCR

main = main[,-25] #drop YBG

write.table(main, file="allpop_correct", quote=FALSE, row.names=FALSE) #this file is now saved as allpop_genofile in the folder allpop

#western populations

west = main[,-c(2,9,15,17,19,20,22,24)] #dropped eastern populations

write.table(west, file="westpop_correct", quote=FALSE, row.names=FALSE) #this file is now saved as west_genofile in the folder west

#repeat to get a file for eastern populations

Finally baypass needs environment covariable files to calculate bayes factors. Precisely, for our analysis this file had information about the associated host-plant (lupine or alfalfa) for each population. Alfalfa is marked as 1 and lupine/other legumes as 0. These files are saved as environfile.txt, west_environfile.txt, east_environfile.txt for each set of analysis.

So finally I ran 4 chains of each baypass analysis for all populations, eastern and western populations. These were through bash scripts saved in the respective folders.

STEP 2

Once the jobs are run and the analysis is complete, I needed to combine the bayesfactors from each of the runs. Mainly, I took averages of the bayesfactors from the four runs for each population set. There will be the following files created for each run (just replace eastpop with all or westpop).

eastpopD_DIC.out

eastpopD_mat_omega.out

eastpopD_summary_betai_reg.out -- this is the file with bayesfactors

eastpopD_summary_beta_params.out

eastpopD_summary_lda_omega.out

eastpopD_summary_pij.out

eastpopD_summary_pi_xtx.out

Refer to the finalcode_filelog.R for the code for this. This is how I start the main analysis and prepare files for all further analysis.

I combine average bayesfactors with scaffold and positions to do further analysis. This is done for all, east and west and final files are saved as bfmeansWithScafPositions_all,west,east. This is the file which will be used for all further analysis.

R CODE to prepare final data files for analysis

###create locuslist which contains scaffolds and positions###

scafpos = read.table("p_ABC.txt", header=F)

main = scafpos[,1]

write.table(main, file="locuslist.txt", row.names=FALSE, col.names=FALSE, quote=FALSE)

############### TAKE BF MEANS AND PREPARE FILE FOR FURTHER ANALYSIS ###################################

HERE IS AN EXAMPLE FOR THE WEST POPULATIONS

#read in the input files for each run

w0 = read.table("west0_summary_betai_reg.out", header=T)

w1 = read.table("west1_summary_betai_reg.out", header=T)

w2 = read.table("west2_summary_betai_reg.out", header=T)

w3 = read.table("west3_summary_betai_reg.out", header=T)

#get correlations between bayesfactors from different runs

cor.test(w0$BF.dB., w1$BF.dB.) #0.68 #0.65 for all pop, 0.54 for east pop

cor.test(w0$BF.dB., w2$BF.dB.)

cor.test(w0$BF.dB., w3$BF.dB.)

cor.test(w0$BF.dB., w4$BF.dB.)

cor.test(w2$BF.dB., w3$BF.dB.)

cor.test(w1$BF.dB., w3$BF.dB.)

cor.test(w1$BF.dB., w2$BF.dB.)

w01 = (w0$BF.dB. + w1$BF.dB.) / 2

w23 = (w2$BF.dB. + w3$BF.dB.) / 2

cor.test(w01, w23) #west= 0.81, all = 0.79, all = 0.69

#get bayesfactor means from all 4 runs

bfmeans = (w0$BF.dB. + w1$BF.dB. + w2$BF.dB. + w3$BF.dB.) / 4

length(bfmeans) #206,028 SNPs

#read in the locus list containing scaffold and positions for the snps

llist = read.table("../locuslist.txt", header=TRUE)

#combine locus list and bayesfactor means to get the final file

w_final = cbind(llist, bfmeans)

head(w_final)

dim(w_final)

write.table(w_final, file = "bfmeansWithScafPosition_west", col.names=TRUE, row.names=FALSE, sep= "\t", quote=FALSE)

#repeat the same for east runs and all populations

STEP 3

#####correlations between east and west ###################

all = read.table("bfmeansWithScafPosition_all", header=TRUE)

west = read.table("bfmeansWithScafPosition_west", header=TRUE)

east = read.table("bfmeansWithScafPosition_east", header=TRUE)

########### doing quantiles to find top SNPs ###########################

ecomb = east$bfmeans

wcomb = west$bfmeans

cor.test(ecomb,wcomb) ##0.06 #p-value < 2.2e-16

bndsw = quantile(wcomb,probs=c(0.9,0.99,0.999,0.9999)) #quantiles for west populations (top 0.1, 0.001,0.0001 %)

bndse = quantile(ecomb,probs=c(0.9,0.99,0.999,0.9999)) #quantiles for east populations

sum(wcomb > bndsw[1]) #20603

sum(wcomb > bndsw[2]) #number of SNPs 2061

sum(wcomb > bndsw[3]) #207

sum(wcomb > bndsw[4]) #21

sum(ecomb > bndse[1]) #20603

sum(ecomb > bndse[2]) #2061

sum(ecomb > bndse[3]) #207

sum(ecomb > bndse[4]) #21

#finding out which SNPs are common between east and west in the top quantiles

length(which(wcomb > bndsw[1] & ecomb > bndse[1])) #3062

length(which(wcomb > bndsw[2] & ecomb > bndse[2])) #58

#number of SNPs shared in the top 0.1%

wtop = which(wcomb > quantile(wcomb, probs=0.9,na.rm=T))

etop = which(ecomb > quantile(ecomb, probs=0.9,na.rm=T))

for (i in 1:10000){

x<-round(0.1*206028,0)

wnull = sample(1:206028,x,replace=FALSE)

enull = sample(1:206028,x,replace=FALSE)

null[i] = sum(wnull %in% enull)

}

obs = sum(wtop %in% etop) #3062

mean(null) #2060.397

p = mean(null >= obs) #0

xfold = obs / mean(null) #1.486121

#number of SNPs shared in the top 0.01%

#finding out expected by chance vs observed number of SNPs

null <- rep(NA,10000)

for (i in 1:10000){

x<-round(0.01*206028,0)

wnull = sample(1:206028,x,replace=FALSE)

enull = sample(1:206028,x,replace=FALSE)

null[i] = sum(wnull %in% enull)

}

p = mean(null >= obs) #0

xfold = obs / mean(null) #2.824336

mean(null) #20.5602

##plot for evolution on June 18th 2017

bndsw = quantile(wcomb,probs=c(0.0009,0.009,0.09,0.9,0.99,0.999,0.9999))

bndse = quantile(ecomb,probs=c(0.0009,0.009,0.09,0.9,0.99,0.999,0.9999))

length(which(wcomb > bndsw[1] & ecomb > bndse[1])) #205656

length(which(wcomb > bndsw[2] & ecomb > bndse[2])) #202331

length(which(wcomb > bndsw[3] & ecomb > bndse[3])) #170591

length(which(wcomb > bndsw[4] & ecomb > bndse[4])) #3062

length(which(wcomb > bndsw[5] & ecomb > bndse[5])) #58

length(which(wcomb > bndsw[6] & ecomb > bndse[6])) #0

length(which(wcomb > bndsw[7] & ecomb > bndse[7])) #0

#plotting the null vs observed

setEPS() postscript("eastWest_hist.eps", width =9, height = 9) par(mar=c(6,8,8,6),mgp=c(4,1,0)) hist(null, col="slateblue1",xlab="Number of shared SNPs",ylab ="Density",cex.lab=2,cex.axis=1.5, main =" ", xlim =c(0, 60), ylim =c(0, 2000))

abline(v=58)

dev.off()

#pdf plot

pdf("eastWest_hist.pdf", width =9, height = 9)> par(mar=c(6,8,8,6),mgp=c(4,1,0))

hist(null, col="slateblue1",xlab="Number of shared SNPs",ylab ="Density",cex.lab=2,cex.axis=1.5, main =" ", xlim =c(0, 60), ylim =c(0, 2000))

abline(v=58) dev.off()

###### SCATTERPLOTS #################################################

### I need to modify this code and create final graphs...need to play with this

final_dat = bf_scaf #contains columns : scaffold, position, bfmeans, afmeans

plot(final_dat$scaffold,final_dat$bfmeans,col="lightblue",xlab="physical distance(bp)",ylab="log10(Bayesfactors)",main="Environmental correlations and local adaptation")

mns<-tapply(X=final_dat$bfmeans,INDEX=final_dat$scaffold,mean)

#lines(final_dat$scaffold,mns,lwd=2,col="orange")

#lines(final_dat$scaffold,log10(mns),lwd=2,col="orange")

plot(final_dat$scaffold,final_dat$bfmeans,col="darkgray",xlab="physical distance(bp)",ylab="log10(Bayesfactors)",main=" ",cex.lab=1.4,cex.axis=1)

lines(as.numeric(names(mns)),mns,lwd=2,col="red")

title(main="Environmental correlations and local adaptation (all pop)",cex.main=1.6, adj=0.5)

dev.copy(png,"bfVsscaffold_all.png")

dev.off()

#plot for west

plot(bf_scaf_west$scaffold,bf_scaf_west$bayesfactors,col="darkgray",xlab="physical distance(bp)",ylab="log10(Bayesfactors)",main=" ",cex.lab=1.4,cex.axis=1)

mns<-tapply(X=bf_scaf_west$bayesfactors,INDEX=bf_scaf_west$scaffold,mean)

lines(as.numeric(names(mns)),mns,lwd=2,col="red")

title(main="Environmental correlations and local adaptation (west pop)",cex.main=1.6, adj=0.5)

dev.copy(pdf,"bfVsscaffold_west.pdf")

dev.off()

dev.copy(png,"bfVsscaffold_west.png")

dev.off()

#plot for east

plot(bf_scaf_east$scaffold,bf_scaf_east$bayesfactors,col="darkgray",xlab="physical distance(bp)",ylab="log10(Bayesfactors)",main=" ",cex.lab=1.4,cex.axis=1)

mns<-tapply(X=bf_scaf_east$bayesfactors,INDEX=bf_scaf_east$scaffold,mean)

lines(as.numeric(names(mns)),mns,lwd=2,col="red")

title(main="Environmental correlations and local adaptation (east pop)",cex.main=1.6, adj=0.5)

dev.copy(png,"bfVsscaffold_east.png")

dev.off()

dev.copy(png,"bfVsscaffold_east.pdf")

dev.off()

#plotting east vs west (This is the main plot for parallel evolution result)

plot(ecomb,wcomb)

hist(ecomb)

hist(wcomb)

hist(ecomb)

Page updated

Google Sites

Report abuse