bayesenv_coop

heaPreparing input files: Took a lot of time, bloody!!!

SNPSFILE: used the population allele frequency file which I prepared for Treemix.

The file contains allele frequencies for each SNP. The input file for this was the Lmelissa_inputfile. Then prepared the Lmelissa_snpsfile (containing all the snps) and sub_snpsfile (contains a subset of snps). Ran perl script on this to get input files in appropriate order for the bayen analysis. Perl script is saved as sub_snpsfile.pl( This is the file I spent 3 days writing).

Received out put files after running this script called snpsfileout and sub_snpsfileout.

Then input these two file sin Bayen but gave segmentation fault for the fixed snps. So we removed the fixed snps and for this we used the perl script sbsnps_bayen.pl.

After this we got the FINAL input file for the analysis called sub_snpsfile_bayen.txt and snpsfile_bayen.

Substitunante the space in the file with tabs. FILE SHOULD BE TAB DELIMITED

Ran the analysis for matrix estimation using the following conditions:

./bayenv2 -i snpsfile_bayen.txt -p 25 -k 100000 -r 63479 > snps_matrix.out

./bayenv2 -i sub_snpsfile_bayen.txt -p 25 -k 100000 -r 63479 > matrix_subsnps.out

SNPFILE: To create the snpfile ZG did the following in R for the matrix:

m<-read.table("onematrix.txt",header=F)

dim(m)

[1] 25 25

m<-as.matrix(m)

> image(m)

> corM<-cov2cor(m)

> image(corM)

(This was for the subsnps. This also gave us the matrix result ie the correlation matrix.)

<similarly for all the snps created matrix called onematrixallsnps.txt

Used these files individually to run analysis for Environmental correlations. ZG wrote the script to get each snpfile saved as wrap_qsub. Script had the following conditions for the analysis:

./bayenv2 -i SNPFILE -m Matrixfile -e environfile.txt -p 25 -k 100000 -n 5 -t -r 429

I created the environfile which has 1 for the alfalfa associated populations and 0 for non-alfalfa populations.

DATE: MONDAY, 05/03/2014

Bayes factor for subsets of SNPs saved in files:

bayesfactor1_2 # FILE CONTAINING BAYES FACTOR FROM sub_bf_environ.stndenvironfile.bf and sub_bf_environ.stndenvironfile2.bf

bayesfactor2_3 # FILE CONTAINING BAYES FACTOR FROM sub_bf_environ.stndenvironfile3.bf and sub_bf_environ.stndenvironfile2.bf

x3<-read.table("sub_bf_environ.stndenvironfile3.bf")

> order.snps3<-order(x3$V1)

> m2<-merge(x2,x3,by="V1")

> write.table(m1, file = "bayesfactors2_3")

> x4<-read.table("bf_environ.stndenvironfile.bf.bf")

> order.snps4<-order(x4$V1)

> x5<-read.table("bf_environ.stndenvironfile2.bf")

> order.snps5<-order(x5$V1)

> x6<-read.table("bf_environ.stndenvironfile3.bf")

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :

line 5388 did not have 2 elements

> x6<-read.table("bf_environ.stndenvironfile3.bf")

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :

line 7941 did not have 2 elements

> x6<-read.table("bf_environ.stndenvironfile3.bf")

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :

line 48986 did not have 2 elements

> x6<-read.table("bf_environ.stndenvironfile3.bf")

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :

line 49760 did not have 2 elements

> x6<-read.table("bf_environ.stndenvironfile3.bf")

> order.snps6<-order(x6$V1)

> m4<-merge(x4,x5,by="V1")

> m5<-merge(x4,x6,by="V1")

> m6<-merge(x5,x6,by="V1")

> write.table(m4, file = "bayesfactor_bf_1_2") #FILE CONTAINING BAYES FACTOR FROM bf_environ.stndenvironfile.bf.bf and bf_environ.stndenvironfile2.bf

> write.table(m5, file = "bayesfactor_bf_1_3") #FILE CONTAINING BAYES FACTOR FROM bf_environ.stndenvironfile.bf.bf and bf_environ.stndenvironfile3.bf

> write.table(m6, file = "bayesfactor_bf_2_3") #FILE CONTAINING BAYES FACTOR FROM bf_environ.stndenvironfile2.bf and bf_environ.stndenvironfile3.bf

PLOTTING BAYES FACTORS FOR x1 (first run with subset of SNPs) and x4 (first run with all the SNPs)

bfdifferent<-merge(x1,x4,by= "V1")

> x<-which(is.infinite(as.numeric(log10(bfdifferent[,2])))==TRUE |is.infinite(as.numeric(log10(bfdifferent[,3])))==TRUE )

> cor.test(as.numeric(log10(bfdifferent[-x,2])),as.numeric(log10(bfdifferent[-x,3])),na.rm=TRUE)

bfdifferent<-merge(x1,x4,by= "V1")

> x<-which(is.infinite(as.numeric(log10(bfdifferent[,2])))==TRUE |is.infinite(as.numeric(log10(bfdifferent[,3])))==TRUE )

> cor.test(as.numeric(log10(bfdifferent[-x,2])),as.numeric(log10(bfdifferent[-x,3])),na.rm=TRUE)

bfdifferent<-merge(x1,x4,by= "V1")

> x<-which(is.infinite(as.numeric(log10(bfdifferent[,2])))==TRUE |is.infinite(as.numeric(log10(bfdifferent[,3])))==TRUE )

> cor.test(as.numeric(log10(bfdifferent[-x,2])),as.numeric(log10(bfdifferent[-x,3])),na.rm=TRUE)

plot(as.numeric(log10(bfdifferent[-x,2])),as.numeric(log10(bfdifferent[-x,3])))

abline(h=1)

abline(h=2,col="red")

sum(as.numeric(log10(bfdifferent[-x,2])))>2)

PREPARING FILE TO PLOT BAYES FACTORS AND SNPS ON THE CHROMOSOMES

-Variants file is called melissaVariants2.vcf and is located in the folder variants.

-Then we create a file called locuslist which includes

tail -n 1 melissaVariants2.vcf | cut -f 1,2

grep ^scaf melissaVariants2.vcf | cut -f 1,2 > locuslist.txt

locuslist.txt is the file with scaffold number and snp positions. I then copied this file to the bayen folder.

Final plot:

llist1<-read.table("locus_bf_list1",header=T)

> llist1[1:5,]

SNPnumber scaffold position bayesfactor

1 snp1 0 33977 0.38782

2 snp10 0 85293 0.31027quit

3 snp100 0 234297 0.24776

4 snp1000 1 469899 0.32274

5 snp10000 26 85104 0.27292

> plot(llist1$scaffold,log10(llist1$bayesfactor))

> plot(llist1$scaffold,log10(llist1$bayesfactor),col="darkgray")

> mns<-tapply(X=llist1$bayesfactor,INDEX=llist1$scaffold,mean)

> lines(llist1$scaffold,log10(mns),lwd=2)

Error in xy.coords(x, y) : 'x' and 'y' lengths differ

> mns[1:20]

0 1 2 3 4 5

4.199893e+00 4.478412e+01 4.643954e+05 1.244099e+02 1.469420e+02 2.336378e+00

6 7 8 9 10 11

3.750209e+00 8.670420e+00 7.251987e-01 3.087236e+00 4.186749e+02 1.103631e+02

12 13 14 15 16 17

4.412503e+00 2.087918e+00 3.510642e+01 7.847626e-01 5.062094e-01 9.071697e+00

18 19

1.047043e+00 1.464359e+01

> colnames(mns)[1:4]

Error in dn[[2L]] : subscript out of bounds

> names(mns)[1:4]

[1] "0" "1" "2" "3"

> lines(as.numeric(names(mns)),log10(mns),lwd=2)

png("bayenplot.png", width = 1000, height = 600)> plot(llist1$scaffold,log10(llist1$bayesfactor),col="lightblue",xlab="physical distance(bp)", ylab="log10(BayesFactor)", main="Environmental correlations and Local adaptation")

> lines(as.numeric(names(mns)),log10(mns),lwd=2,col="violet")

> dev.off()

scposData<-rep(NA,dim(data)[1])

for(i in 1:length(scposData)){scposData[i]<-paste(data[i,2],data[i,3],sep=":")}

Work on 04 May 2015.

I ran the correlations for 3 runs which are stored in files bf_environ4,5,8 respectively.

The correlations for run 4 and 5 is 0.81

4 and 8 is 0.80

5 and 8 is 0.99

WORK DONE ON 30-05-2015

> data4<-read.table("bf_environ_4")

> data5<-read.table("bf_environ_5")

> data8<-read.table("bf_environ_8")

> data4[1:5]

Error in `[.data.frame`(data4, 1:5) : undefined columns selected

> data4[1:5,]

V1 V2

1 snp499000 0.71134

2 snp499001 0.44009

3 snp499002 0.39610

4 snp499003 0.87263

5 snp499004 0.31622

> merge(data4,data5,data8, by="V1")

Error in fix.by(by.x, x) :

'by' must specify one or more columns as numbers, names or logical

> bf_data<-cbind(data4,data5,data8)

> bf_data[.1:10]

Error in `[.data.frame`(bf_data, 0.1:10) : undefined columns selected

> bf_data[1:10]

Error in `[.data.frame`(bf_data, 1:10) : undefined columns selected

> bf_data[,1:10]

Error in `[.data.frame`(bf_data, , 1:10) : undefined columns selected

> bf_data[1:10,]

V1 V2 V1 V2 V1 V2

1 snp499000 0.71134 snp599000 0.71170 snp899000 0.75889

2 snp499001 0.44009 snp599001 0.45758 snp899001 0.43053

3 snp499002 0.39610 snp599002 0.53041 snp899002 0.55069

4 snp499003 0.87263 snp599003 1.12150 snp899003 1.22160

5 snp499004 0.31622 snp599004 0.44122 snp899004 0.46102

6 snp499005 0.25561 snp599005 0.26503 snp899005 0.26008

7 snp499006 3.59260 snp599006 1.52400 snp899006 1.64470

8 snp499007 0.26476 snp599007 0.37762 snp899007 0.40011

9 snp499008 0.68469 snp599008 0.81425 snp899008 0.72734

10 snp499009 0.31873 snp599009 0.54864 snp899009 0.58655

> data4_o<-order(data4$V1)

> data5_o<-order(data5$V1)

> data8_o<-order(data8$V1)

> bf_data<-cbind(data4_o,data5_o,data8_o)

> bf_data[1:10,]

data4_o data5_o data8_o

[1,] 92214 92214 92214

[2,] 92223 92223 92223

[3,] 92308 92308 92308

[4,] 102798 102798 102798

[5,] 20349 20349 20349

[6,] 978 978 978

[7,] 979 979 979

[8,] 980 980 980

[9,] 981 981 981

[10,] 982 982 982

> head(bf_data,10)

data4_o data5_o data8_o

[1,] 92214 92214 92214

[2,] 92223 92223 92223

[3,] 92308 92308 92308

[4,] 102798 102798 102798

[5,] 20349 20349 20349

[6,] 978 978 978

[7,] 979 979 979

[8,] 980 980 980

[9,] 981 981 981

[10,] 982 982 982

> bf_data<-cbind(data4,data5,data8)

> head(bf_data,10)

V1 V2 V1 V2 V1 V2

1 snp499000 0.71134 snp599000 0.71170 snp899000 0.75889

2 snp499001 0.44009 snp599001 0.45758 snp899001 0.43053

3 snp499002 0.39610 snp599002 0.53041 snp899002 0.55069

4 snp499003 0.87263 snp599003 1.12150 snp899003 1.22160

5 snp499004 0.31622 snp599004 0.44122 snp899004 0.46102

6 snp499005 0.25561 snp599005 0.26503 snp899005 0.26008

7 snp499006 3.59260 snp599006 1.52400 snp899006 1.64470

8 snp499007 0.26476 snp599007 0.37762 snp899007 0.40011

9 snp499008 0.68469 snp599008 0.81425 snp899008 0.72734

10 snp499009 0.31873 snp599009 0.54864 snp899009 0.58655

> head(data4,10)

V1 V2

1 snp499000 0.71134

2 snp499001 0.44009

3 snp499002 0.39610

4 snp499003 0.87263

5 snp499004 0.31622

6 snp499005 0.25561

7 snp499006 3.59260

8 snp499007 0.26476

9 snp499008 0.68469

10 snp499009 0.31873

> cor.test(as.numeric(log10(bf_data[,2])), as.numeric(log10(bf_data[,4])))

Pearson's product-moment correlation

data: as.numeric(log10(bf_data[, 2])) and as.numeric(log10(bf_data[, 4]))

t = 587.7844, df = 178175, p-value < 2.2e-16

alternative hypothesis: true correlation is not equal to 0

95 percent confidence interval:

0.8106667 0.8138265

sample estimates:

cor

0.8122525

> cor.test(as.numeric(log10(bf_data[,2])), as.numeric(log10(bf_data[,6])))

Pearson's product-moment correlation

data: as.numeric(log10(bf_data[, 2])) and as.numeric(log10(bf_data[, 6]))

t = 567.8787, df = 178175, p-value < 2.2e-16

alternative hypothesis: true correlation is not equal to 0

95 percent confidence interval:

0.8009127 0.8042176

sample estimates:

cor

0.8025713

> cor.test(as.numeric(log10(bf_data[,4])), as.numeric(log10(bf_data[,6])))

Pearson's product-moment correlation

data: as.numeric(log10(bf_data[, 4])) and as.numeric(log10(bf_data[, 6]))

t = 8763.767, df = 178175, p-value < 2.2e-16

alternative hypothesis: true correlation is not equal to 0

95 percent confidence interval:

0.9988313 0.9988528

sample estimates:

cor

0.9988421

> rowM<-rowMeans(bf_data[, -ncol(bf_data)])

Error in rowMeans(bf_data[, -ncol(bf_data)]) : 'x' must be numeric

> bf_data1<-c(bf_data$2, bf_data$4, bf_data$6)

Error: unexpected numeric constant in "bf_data1<-c(bf_data$2"

> bf_data1<-cbind(bf_data$2, bf_data$4, bf_data$6)

Error: unexpected numeric constant in "bf_data1<-cbind(bf_data$2"

> bf_data1<-bf_data$2, bf_data$4, bf_data$6

Error: unexpected numeric constant in "bf_data1<-bf_data$2"

> bf_data1<-cbind(bf_data[,2], bf_data[,4], bf_data[,6])

> head(bf_data1,10)

[,1] [,2] [,3]

[1,] 0.71134 0.71170 0.75889

[2,] 0.44009 0.45758 0.43053

[3,] 0.39610 0.53041 0.55069

[4,] 0.87263 1.12150 1.22160

[5,] 0.31622 0.44122 0.46102

[6,] 0.25561 0.26503 0.26008

[7,] 3.59260 1.52400 1.64470

[8,] 0.26476 0.37762 0.40011

[9,] 0.68469 0.81425 0.72734

[10,] 0.31873 0.54864 0.58655

> rowM<-rowMeans(bf_data1[, -ncol(bf_data)])

> head(rowM,10)

[1] 0.7273100 0.4427333 0.4924000 1.0719100 0.4061533 0.2602400 2.2537667

[8] 0.3474967 0.7420933 0.4846400

> loclist<-read.table("locuslist.txt")

> head(loclist,10)

V1 V2

1 scaffold_0 33945

2 scaffold_0 33977

3 scaffold_0 33997

4 scaffold_0 34052

5 scaffold_0 50620

6 scaffold_0 50681

7 scaffold_0 68856

8 scaffold_0 68873

9 scaffold_0 68917

10 scaffold_0 85281

> loc_bf_data<-cbind(loclist,rowM)

Error in data.frame(..., check.names = FALSE) :

arguments imply differing number of rows: 206047, 178177

> llist1

llist1

> llist1

llist1

> dim(bf_data1)

[1] 178177 3

> loc_bf_data<-cbind(head(loclist,178177),rowM)

> head(loc_bf_data,10)

V1 V2 rowM

1 scaffold_0 33945 0.7273100

2 scaffold_0 33977 0.4427333

3 scaffold_0 33997 0.4924000

4 scaffold_0 34052 1.0719100

5 scaffold_0 50620 0.4061533

6 scaffold_0 50681 0.2602400

7 scaffold_0 68856 2.2537667

8 scaffold_0 68873 0.3474967

9 scaffold_0 68917 0.7420933

10 scaffold_0 85281 0.4846400

>mns<-tapply(X=loc_bf_data$rowM, INDEX=loc_bf_data$V1,mean)

> png("bayenplot_final.png", width = 1000, height = 600)

> plot(loc_bf_data$V2,as.numeric(log10(loc_bf_data$rowM)),col="darkblue",xlab="physical distance(bp)",ylab="Bayesfactors", main="Environmental correlations and local adaptation")

> lines(as.numeric(names(mns)),log10(mns),lwd=2,col="red")

> dev.off()

Final file contaning locus data and means of bayesfactors stored as final_locus.bf.data.

Final plot is stored as bayenplot_final.

WORK ON 26 Aug 2015

Copy of Zach's code for summarising bf correlations in R

for(j in 2:5){abcd[abcd[,j]==0,j]<-NA}

> summary(abcd[,2])

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's

0.000e+00 0.000e+00 0.000e+00 2.967e+13 1.000e+00 5.272e+18 465

> cor.test(as.numeric(log10(abcd[,2])),as.numeric(log10(abcd[,5])))

Pearson's product-moment correlation

data: as.numeric(log10(abcd[, 2])) and as.numeric(log10(abcd[, 5]))

t = 828.1367, df = 177528, p-value < 2.2e-16

alternative hypothesis: true correlation is not equal to 0

95 percent confidence interval:

0.8903140 0.8922271

sample estimates:

cor

0.8912745

> plot(as.numeric(log10(abcd[,2])),as.numeric(log10(abcd[,5])))

> for(i in 2:4){for(j in i:5){

+ out<-cor.test(as.numeric(log10(abcd[,i])),as.numeric(log10(abcd[,j])))

+ cat(i," ",j,"cor = ",out$estimate,")

> }}

Error: unexpected '}' in "}"

> for(i in 2:4){for(j in i:5){

+ out<-cor.test(as.numeric(log10(abcd[,i])),as.numeric(log10(abcd[,j])))

+ cat(i," ",j,"cor = ",out$estimate,"\n")

+ }}

2 2 cor = 1

2 3 cor = 0.8384145

2 4 cor = 0.8902656

2 5 cor = 0.8912745

3 3 cor = 1

3 4 cor = 0.7861543

3 5 cor = 0.7860265

4 4 cor = 1

4 5 cor = 0.8515075

> table(apply(is.na(abcd[,2:5])==FALSE,1,sum))

2 3 4

1 828 177349

> bf<-apply(abcd[,2:5],1,mean,na.rm=TRUE)

> plot(log10(bf))

> mainabcd<-for(j in 2:5){abcd[abcd[,j]==0,j]<-NA}

Error in `[<-.data.frame`(`*tmp*`, abcd[, j] == 0, j, value = NA) :

missing values are not allowed in subscripted assignments of data frames

> head(bf)

This is the code I used in R to get the means of all the columns (ie runs) into one file:

> main<-read.table("output_abcd_union")

> head(main)

V1 V2 V3 V4 V5

1 89370 0.22294 0.21967 0.22526 0.50086

2 89371 0.31939 0.29088 0.30221 0.27803

3 89372 0.20314 0.21959 0.22228 0.24279

4 89373 0.65891 0.78320 0.66023 20.73500

5 89374 0.36655 0.58751 1.05070 0.85967

6 89375 0.26230 0.26529 0.24904 0.23820

> abcd<-for(j in 2:5){main[main[,j]==0,j]<-NA}

> head(abcd)

NULL

> cor.test(as.numeric(log10(main[,2])), as.numeric(log10(main[,3])))

Pearson's product-moment correlation

data: as.numeric(log10(main[, 2])) and as.numeric(log10(main[, 3]))

t = 648.4746, df = 177711, p-value < 2.2e-16

alternative hypothesis: true correlation is not equal to 0

95 percent confidence interval:

0.8370279 0.8397902

sample estimates:

cor

0.8384145

> cor.test(as.numeric(log10(abcd[,2])), as.numeric(log10(abcd[,3])))

Error in log10(abcd[, 2]) : non-numeric argument to mathematical function

> cor.test(as.numeric(log10(abcd[,2])), as.numeric(log10(abcd[,3])))

Error in log10(abcd[, 2]) : non-numeric argument to mathematical function

> cor.test(as.numeric(log10(main[,2])), as.numeric(log10(main[,3])))

Pearson's product-moment correlation

data: as.numeric(log10(main[, 2])) and as.numeric(log10(main[, 3]))

t = 648.4746, df = 177711, p-value < 2.2e-16

alternative hypothesis: true correlation is not equal to 0

95 percent confidence interval:

0.8370279 0.8397902

sample estimates:

cor

0.8384145

> cor.test(as.numeric(log10(main[,2])), as.numeric(log10(main[,4])))

Pearson's product-moment correlation

data: as.numeric(log10(main[, 2])) and as.numeric(log10(main[, 4]))

t = 823.6122, df = 177530, p-value < 2.2e-16

alternative hypothesis: true correlation is not equal to 0

95 percent confidence interval:

0.8892967 0.8912266

sample estimates:

cor

0.8902656

> cor.test(as.numeric(log10(main[,2])), as.numeric(log10(main[,5])))

Pearson's product-moment correlation

data: as.numeric(log10(main[, 2])) and as.numeric(log10(main[, 5]))

t = 828.1367, df = 177528, p-value < 2.2e-16

alternative hypothesis: true correlation is not equal to 0

95 percent confidence interval:

0.8903140 0.8922271

sample estimates:

cor

0.8912745

> rowM<-rowMeans(main[,-ncol(main)])

> head(rowM)

[1] 22342.67 22342.98 22343.16 22343.78 22344.00 22343.94

> dim(rowM)

NULL

> DF<-data.frame(ID=main[,1], BFmeans=rowMeans(main[,-1]))

> head(DF)

ID BFmeans

1 89370 0.2921825

2 89371 0.2976275

3 89372 0.2219500

4 89373 5.7093350

5 89374 0.7161075

6 89375 0.2537075

> DF<-data.frame(SNPpositions=main[,1], BFmeans=rowMeans(main[,-1]))

> write.table(DF, file="bayesfactorsmeans", quotes=FALSE, sep=" ", colnames=TRUE, ro> write.table(DF, file="bayesfactorsmeans", quote=FALSE, sep=" ", col.names=TRUE, row.names=FALSE)

bayenv2 -i snpfiles_allruns/snp5184082 -e stndenvironfile.txt -m onematrix.txt -f -t -r 1050 -o outX -n 1 -p 25 -k 10000

Page updated

Google Sites

Report abuse