heaPreparing input files: Took a lot of time, bloody!!!
SNPSFILE: used the population allele frequency file which I prepared for Treemix.
The file contains allele frequencies for each SNP. The input file for this was the Lmelissa_inputfile. Then prepared the Lmelissa_snpsfile (containing all the snps) and sub_snpsfile (contains a subset of snps). Ran perl script on this to get input files in appropriate order for the bayen analysis. Perl script is saved as sub_snpsfile.pl( This is the file I spent 3 days writing).
Received out put files after running this script called snpsfileout and sub_snpsfileout.
Then input these two file sin Bayen but gave segmentation fault for the fixed snps. So we removed the fixed snps and for this we used the perl script sbsnps_bayen.pl.
After this we got the FINAL input file for the analysis called sub_snpsfile_bayen.txt and snpsfile_bayen.
Substitunante the space in the file with tabs. FILE SHOULD BE TAB DELIMITED
Ran the analysis for matrix estimation using the following conditions:
./bayenv2 -i snpsfile_bayen.txt -p 25 -k 100000 -r 63479 > snps_matrix.out
./bayenv2 -i sub_snpsfile_bayen.txt -p 25 -k 100000 -r 63479 > matrix_subsnps.out
SNPFILE: To create the snpfile ZG did the following in R for the matrix:
m<-read.table("onematrix.txt",header=F)
dim(m)
[1] 25 25
m<-as.matrix(m)
> image(m)
> corM<-cov2cor(m)
> image(corM)
(This was for the subsnps. This also gave us the matrix result ie the correlation matrix.)
<similarly for all the snps created matrix called onematrixallsnps.txt
Used these files individually to run analysis for Environmental correlations. ZG wrote the script to get each snpfile saved as wrap_qsub. Script had the following conditions for the analysis:
./bayenv2 -i SNPFILE -m Matrixfile -e environfile.txt -p 25 -k 100000 -n 5 -t -r 429
I created the environfile which has 1 for the alfalfa associated populations and 0 for non-alfalfa populations.
DATE: MONDAY, 05/03/2014
Bayes factor for subsets of SNPs saved in files:
bayesfactor1_2 # FILE CONTAINING BAYES FACTOR FROM sub_bf_environ.stndenvironfile.bf and sub_bf_environ.stndenvironfile2.bf
bayesfactor2_3 # FILE CONTAINING BAYES FACTOR FROM sub_bf_environ.stndenvironfile3.bf and sub_bf_environ.stndenvironfile2.bf
x3<-read.table("sub_bf_environ.stndenvironfile3.bf")
> order.snps3<-order(x3$V1)
> m2<-merge(x2,x3,by="V1")
> write.table(m1, file = "bayesfactors2_3")
> x4<-read.table("bf_environ.stndenvironfile.bf.bf")
> order.snps4<-order(x4$V1)
> x5<-read.table("bf_environ.stndenvironfile2.bf")
> order.snps5<-order(x5$V1)
> x6<-read.table("bf_environ.stndenvironfile3.bf")
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 5388 did not have 2 elements
> x6<-read.table("bf_environ.stndenvironfile3.bf")
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 7941 did not have 2 elements
> x6<-read.table("bf_environ.stndenvironfile3.bf")
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 48986 did not have 2 elements
> x6<-read.table("bf_environ.stndenvironfile3.bf")
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 49760 did not have 2 elements
> x6<-read.table("bf_environ.stndenvironfile3.bf")
> order.snps6<-order(x6$V1)
> m4<-merge(x4,x5,by="V1")
> m5<-merge(x4,x6,by="V1")
> m6<-merge(x5,x6,by="V1")
> write.table(m4, file = "bayesfactor_bf_1_2") #FILE CONTAINING BAYES FACTOR FROM bf_environ.stndenvironfile.bf.bf and bf_environ.stndenvironfile2.bf
> write.table(m5, file = "bayesfactor_bf_1_3") #FILE CONTAINING BAYES FACTOR FROM bf_environ.stndenvironfile.bf.bf and bf_environ.stndenvironfile3.bf
> write.table(m6, file = "bayesfactor_bf_2_3") #FILE CONTAINING BAYES FACTOR FROM bf_environ.stndenvironfile2.bf and bf_environ.stndenvironfile3.bf
PLOTTING BAYES FACTORS FOR x1 (first run with subset of SNPs) and x4 (first run with all the SNPs)
bfdifferent<-merge(x1,x4,by= "V1")
> x<-which(is.infinite(as.numeric(log10(bfdifferent[,2])))==TRUE |is.infinite(as.numeric(log10(bfdifferent[,3])))==TRUE )
> cor.test(as.numeric(log10(bfdifferent[-x,2])),as.numeric(log10(bfdifferent[-x,3])),na.rm=TRUE)
bfdifferent<-merge(x1,x4,by= "V1")
> x<-which(is.infinite(as.numeric(log10(bfdifferent[,2])))==TRUE |is.infinite(as.numeric(log10(bfdifferent[,3])))==TRUE )
> cor.test(as.numeric(log10(bfdifferent[-x,2])),as.numeric(log10(bfdifferent[-x,3])),na.rm=TRUE)
bfdifferent<-merge(x1,x4,by= "V1")
> x<-which(is.infinite(as.numeric(log10(bfdifferent[,2])))==TRUE |is.infinite(as.numeric(log10(bfdifferent[,3])))==TRUE )
> cor.test(as.numeric(log10(bfdifferent[-x,2])),as.numeric(log10(bfdifferent[-x,3])),na.rm=TRUE)
plot(as.numeric(log10(bfdifferent[-x,2])),as.numeric(log10(bfdifferent[-x,3])))
plot(as.numeric(log10(bfdifferent[-x,2])),as.numeric(log10(bfdifferent[-x,3])))
abline(h=1)
abline(h=2,col="red")
sum(as.numeric(log10(bfdifferent[-x,2])))>2)
PREPARING FILE TO PLOT BAYES FACTORS AND SNPS ON THE CHROMOSOMES
-Variants file is called melissaVariants2.vcf and is located in the folder variants.
-Then we create a file called locuslist which includes
tail -n 1 melissaVariants2.vcf | cut -f 1,2
grep ^scaf melissaVariants2.vcf | cut -f 1,2 > locuslist.txt
locuslist.txt is the file with scaffold number and snp positions. I then copied this file to the bayen folder.
Final plot:
llist1<-read.table("locus_bf_list1",header=T)
> llist1[1:5,]
SNPnumber scaffold position bayesfactor
1 snp1 0 33977 0.38782
2 snp10 0 85293 0.31027quit
3 snp100 0 234297 0.24776
4 snp1000 1 469899 0.32274
5 snp10000 26 85104 0.27292
> plot(llist1$scaffold,log10(llist1$bayesfactor))
> plot(llist1$scaffold,log10(llist1$bayesfactor),col="darkgray")
> mns<-tapply(X=llist1$bayesfactor,INDEX=llist1$scaffold,mean)
> lines(llist1$scaffold,log10(mns),lwd=2)
Error in xy.coords(x, y) : 'x' and 'y' lengths differ
> mns[1:20]
0 1 2 3 4 5
4.199893e+00 4.478412e+01 4.643954e+05 1.244099e+02 1.469420e+02 2.336378e+00
6 7 8 9 10 11
3.750209e+00 8.670420e+00 7.251987e-01 3.087236e+00 4.186749e+02 1.103631e+02
12 13 14 15 16 17
4.412503e+00 2.087918e+00 3.510642e+01 7.847626e-01 5.062094e-01 9.071697e+00
18 19
1.047043e+00 1.464359e+01
> colnames(mns)[1:4]
Error in dn[[2L]] : subscript out of bounds
> names(mns)[1:4]
[1] "0" "1" "2" "3"
> lines(as.numeric(names(mns)),log10(mns),lwd=2)
png("bayenplot.png", width = 1000, height = 600)> plot(llist1$scaffold,log10(llist1$bayesfactor),col="lightblue",xlab="physical distance(bp)", ylab="log10(BayesFactor)", main="Environmental correlations and Local adaptation")
> lines(as.numeric(names(mns)),log10(mns),lwd=2,col="violet")
> dev.off()
scposData<-rep(NA,dim(data)[1])
for(i in 1:length(scposData)){scposData[i]<-paste(data[i,2],data[i,3],sep=":")}
Work on 04 May 2015.
I ran the correlations for 3 runs which are stored in files bf_environ4,5,8 respectively.
The correlations for run 4 and 5 is 0.81
4 and 8 is 0.80
5 and 8 is 0.99
WORK DONE ON 30-05-2015
> data4<-read.table("bf_environ_4")
> data5<-read.table("bf_environ_5")
> data8<-read.table("bf_environ_8")
> data4[1:5]
Error in `[.data.frame`(data4, 1:5) : undefined columns selected
> data4[1:5,]
V1 V2
1 snp499000 0.71134
2 snp499001 0.44009
3 snp499002 0.39610
4 snp499003 0.87263
5 snp499004 0.31622
> merge(data4,data5,data8, by="V1")
Error in fix.by(by.x, x) :
'by' must specify one or more columns as numbers, names or logical
> bf_data<-cbind(data4,data5,data8)
> bf_data[.1:10]
Error in `[.data.frame`(bf_data, 0.1:10) : undefined columns selected
> bf_data[1:10]
Error in `[.data.frame`(bf_data, 1:10) : undefined columns selected
> bf_data[,1:10]
Error in `[.data.frame`(bf_data, , 1:10) : undefined columns selected
> bf_data[1:10,]
V1 V2 V1 V2 V1 V2
1 snp499000 0.71134 snp599000 0.71170 snp899000 0.75889
2 snp499001 0.44009 snp599001 0.45758 snp899001 0.43053
3 snp499002 0.39610 snp599002 0.53041 snp899002 0.55069
4 snp499003 0.87263 snp599003 1.12150 snp899003 1.22160
5 snp499004 0.31622 snp599004 0.44122 snp899004 0.46102
6 snp499005 0.25561 snp599005 0.26503 snp899005 0.26008
7 snp499006 3.59260 snp599006 1.52400 snp899006 1.64470
8 snp499007 0.26476 snp599007 0.37762 snp899007 0.40011
9 snp499008 0.68469 snp599008 0.81425 snp899008 0.72734
10 snp499009 0.31873 snp599009 0.54864 snp899009 0.58655
> data4_o<-order(data4$V1)
> data5_o<-order(data5$V1)
> data8_o<-order(data8$V1)
> bf_data<-cbind(data4_o,data5_o,data8_o)
> bf_data[1:10,]
data4_o data5_o data8_o
[1,] 92214 92214 92214
[2,] 92223 92223 92223
[3,] 92308 92308 92308
[4,] 102798 102798 102798
[5,] 20349 20349 20349
[6,] 978 978 978
[7,] 979 979 979
[8,] 980 980 980
[9,] 981 981 981
[10,] 982 982 982
> head(bf_data,10)
data4_o data5_o data8_o
[1,] 92214 92214 92214
[2,] 92223 92223 92223
[3,] 92308 92308 92308
[4,] 102798 102798 102798
[5,] 20349 20349 20349
[6,] 978 978 978
[7,] 979 979 979
[8,] 980 980 980
[9,] 981 981 981
[10,] 982 982 982
> bf_data<-cbind(data4,data5,data8)
> head(bf_data,10)
V1 V2 V1 V2 V1 V2
1 snp499000 0.71134 snp599000 0.71170 snp899000 0.75889
2 snp499001 0.44009 snp599001 0.45758 snp899001 0.43053
3 snp499002 0.39610 snp599002 0.53041 snp899002 0.55069
4 snp499003 0.87263 snp599003 1.12150 snp899003 1.22160
5 snp499004 0.31622 snp599004 0.44122 snp899004 0.46102
6 snp499005 0.25561 snp599005 0.26503 snp899005 0.26008
7 snp499006 3.59260 snp599006 1.52400 snp899006 1.64470
8 snp499007 0.26476 snp599007 0.37762 snp899007 0.40011
9 snp499008 0.68469 snp599008 0.81425 snp899008 0.72734
10 snp499009 0.31873 snp599009 0.54864 snp899009 0.58655
> head(data4,10)
V1 V2
1 snp499000 0.71134
2 snp499001 0.44009
3 snp499002 0.39610
4 snp499003 0.87263
5 snp499004 0.31622
6 snp499005 0.25561
7 snp499006 3.59260
8 snp499007 0.26476
9 snp499008 0.68469
10 snp499009 0.31873
> cor.test(as.numeric(log10(bf_data[,2])), as.numeric(log10(bf_data[,4])))
Pearson's product-moment correlation
data: as.numeric(log10(bf_data[, 2])) and as.numeric(log10(bf_data[, 4]))
t = 587.7844, df = 178175, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.8106667 0.8138265
sample estimates:
cor
0.8122525
> cor.test(as.numeric(log10(bf_data[,2])), as.numeric(log10(bf_data[,6])))
Pearson's product-moment correlation
data: as.numeric(log10(bf_data[, 2])) and as.numeric(log10(bf_data[, 6]))
t = 567.8787, df = 178175, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.8009127 0.8042176
sample estimates:
cor
0.8025713
> cor.test(as.numeric(log10(bf_data[,4])), as.numeric(log10(bf_data[,6])))
Pearson's product-moment correlation
data: as.numeric(log10(bf_data[, 4])) and as.numeric(log10(bf_data[, 6]))
t = 8763.767, df = 178175, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.9988313 0.9988528
sample estimates:
cor
0.9988421
> rowM<-rowMeans(bf_data[, -ncol(bf_data)])
Error in rowMeans(bf_data[, -ncol(bf_data)]) : 'x' must be numeric
> bf_data1<-c(bf_data$2, bf_data$4, bf_data$6)
Error: unexpected numeric constant in "bf_data1<-c(bf_data$2"
> bf_data1<-cbind(bf_data$2, bf_data$4, bf_data$6)
Error: unexpected numeric constant in "bf_data1<-cbind(bf_data$2"
> bf_data1<-bf_data$2, bf_data$4, bf_data$6
Error: unexpected numeric constant in "bf_data1<-bf_data$2"
> bf_data1<-cbind(bf_data[,2], bf_data[,4], bf_data[,6])
> head(bf_data1,10)
[,1] [,2] [,3]
[1,] 0.71134 0.71170 0.75889
[2,] 0.44009 0.45758 0.43053
[3,] 0.39610 0.53041 0.55069
[4,] 0.87263 1.12150 1.22160
[5,] 0.31622 0.44122 0.46102
[6,] 0.25561 0.26503 0.26008
[7,] 3.59260 1.52400 1.64470
[8,] 0.26476 0.37762 0.40011
[9,] 0.68469 0.81425 0.72734
[10,] 0.31873 0.54864 0.58655
> rowM<-rowMeans(bf_data1[, -ncol(bf_data)])
> head(rowM,10)
[1] 0.7273100 0.4427333 0.4924000 1.0719100 0.4061533 0.2602400 2.2537667
[8] 0.3474967 0.7420933 0.4846400
> loclist<-read.table("locuslist.txt")
> head(loclist,10)
V1 V2
1 scaffold_0 33945
2 scaffold_0 33977
3 scaffold_0 33997
4 scaffold_0 34052
5 scaffold_0 50620
6 scaffold_0 50681
7 scaffold_0 68856
8 scaffold_0 68873
9 scaffold_0 68917
10 scaffold_0 85281
> loc_bf_data<-cbind(loclist,rowM)
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 206047, 178177
> llist1
llist1
> llist1
llist1
> dim(bf_data1)
[1] 178177 3
> loc_bf_data<-cbind(head(loclist,178177),rowM)
> head(loc_bf_data,10)
V1 V2 rowM
1 scaffold_0 33945 0.7273100
2 scaffold_0 33977 0.4427333
3 scaffold_0 33997 0.4924000
4 scaffold_0 34052 1.0719100
5 scaffold_0 50620 0.4061533
6 scaffold_0 50681 0.2602400
7 scaffold_0 68856 2.2537667
8 scaffold_0 68873 0.3474967
9 scaffold_0 68917 0.7420933
10 scaffold_0 85281 0.4846400
>mns<-tapply(X=loc_bf_data$rowM, INDEX=loc_bf_data$V1,mean)
> png("bayenplot_final.png", width = 1000, height = 600)
> plot(loc_bf_data$V2,as.numeric(log10(loc_bf_data$rowM)),col="darkblue",xlab="physical distance(bp)",ylab="Bayesfactors", main="Environmental correlations and local adaptation")
> lines(as.numeric(names(mns)),log10(mns),lwd=2,col="red")
> dev.off()
Final file contaning locus data and means of bayesfactors stored as final_locus.bf.data.
Final plot is stored as bayenplot_final.
WORK ON 26 Aug 2015
Copy of Zach's code for summarising bf correlations in R
for(j in 2:5){abcd[abcd[,j]==0,j]<-NA}
> summary(abcd[,2])
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.000e+00 0.000e+00 0.000e+00 2.967e+13 1.000e+00 5.272e+18 465
> cor.test(as.numeric(log10(abcd[,2])),as.numeric(log10(abcd[,5])))
Pearson's product-moment correlation
data: as.numeric(log10(abcd[, 2])) and as.numeric(log10(abcd[, 5]))
t = 828.1367, df = 177528, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.8903140 0.8922271
sample estimates:
cor
0.8912745
> plot(as.numeric(log10(abcd[,2])),as.numeric(log10(abcd[,5])))
> for(i in 2:4){for(j in i:5){
+ out<-cor.test(as.numeric(log10(abcd[,i])),as.numeric(log10(abcd[,j])))
+ cat(i," ",j,"cor = ",out$estimate,")
+
> }}
Error: unexpected '}' in "}"
> for(i in 2:4){for(j in i:5){
+ out<-cor.test(as.numeric(log10(abcd[,i])),as.numeric(log10(abcd[,j])))
+ cat(i," ",j,"cor = ",out$estimate,"\n")
+ }}
2 2 cor = 1
2 3 cor = 0.8384145
2 4 cor = 0.8902656
2 5 cor = 0.8912745
3 3 cor = 1
3 4 cor = 0.7861543
3 5 cor = 0.7860265
4 4 cor = 1
4 5 cor = 0.8515075
> table(apply(is.na(abcd[,2:5])==FALSE,1,sum))
2 3 4
1 828 177349
> bf<-apply(abcd[,2:5],1,mean,na.rm=TRUE)
> plot(log10(bf))
> mainabcd<-for(j in 2:5){abcd[abcd[,j]==0,j]<-NA}
Error in `[<-.data.frame`(`*tmp*`, abcd[, j] == 0, j, value = NA) :
missing values are not allowed in subscripted assignments of data frames
> head(bf)
This is the code I used in R to get the means of all the columns (ie runs) into one file:
> main<-read.table("output_abcd_union")
> head(main)
V1 V2 V3 V4 V5
1 89370 0.22294 0.21967 0.22526 0.50086
2 89371 0.31939 0.29088 0.30221 0.27803
3 89372 0.20314 0.21959 0.22228 0.24279
4 89373 0.65891 0.78320 0.66023 20.73500
5 89374 0.36655 0.58751 1.05070 0.85967
6 89375 0.26230 0.26529 0.24904 0.23820
> abcd<-for(j in 2:5){main[main[,j]==0,j]<-NA}
> head(abcd)
NULL
> cor.test(as.numeric(log10(main[,2])), as.numeric(log10(main[,3])))
Pearson's product-moment correlation
data: as.numeric(log10(main[, 2])) and as.numeric(log10(main[, 3]))
t = 648.4746, df = 177711, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.8370279 0.8397902
sample estimates:
cor
0.8384145
> cor.test(as.numeric(log10(abcd[,2])), as.numeric(log10(abcd[,3])))
Error in log10(abcd[, 2]) : non-numeric argument to mathematical function
> cor.test(as.numeric(log10(abcd[,2])), as.numeric(log10(abcd[,3])))
Error in log10(abcd[, 2]) : non-numeric argument to mathematical function
> cor.test(as.numeric(log10(main[,2])), as.numeric(log10(main[,3])))
Pearson's product-moment correlation
data: as.numeric(log10(main[, 2])) and as.numeric(log10(main[, 3]))
t = 648.4746, df = 177711, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.8370279 0.8397902
sample estimates:
cor
0.8384145
> cor.test(as.numeric(log10(main[,2])), as.numeric(log10(main[,4])))
Pearson's product-moment correlation
data: as.numeric(log10(main[, 2])) and as.numeric(log10(main[, 4]))
t = 823.6122, df = 177530, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.8892967 0.8912266
sample estimates:
cor
0.8902656
> cor.test(as.numeric(log10(main[,2])), as.numeric(log10(main[,5])))
Pearson's product-moment correlation
data: as.numeric(log10(main[, 2])) and as.numeric(log10(main[, 5]))
t = 828.1367, df = 177528, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.8903140 0.8922271
sample estimates:
cor
0.8912745
> rowM<-rowMeans(main[,-ncol(main)])
> head(rowM)
[1] 22342.67 22342.98 22343.16 22343.78 22344.00 22343.94
> dim(rowM)
NULL
> DF<-data.frame(ID=main[,1], BFmeans=rowMeans(main[,-1]))
> head(DF)
ID BFmeans
1 89370 0.2921825
2 89371 0.2976275
3 89372 0.2219500
4 89373 5.7093350
5 89374 0.7161075
6 89375 0.2537075
> DF<-data.frame(SNPpositions=main[,1], BFmeans=rowMeans(main[,-1]))
> write.table(DF, file="bayesfactorsmeans", quotes=FALSE, sep=" ", colnames=TRUE, ro> write.table(DF, file="bayesfactorsmeans", quote=FALSE, sep=" ", col.names=TRUE, row.names=FALSE)
bayenv2 -i snpfiles_allruns/snp5184082 -e stndenvironfile.txt -m onematrix.txt -f -t -r 1050 -o outX -n 1 -p 25 -k 10000