SNP-based analyses for Timema genome paper revision

Post date: Mar 17, 2014 7:7:40 PM

We were asked to test justify or consider SNP-based rather than region based analyes for parallelism and natural population by experiment comparisons for the revision. Here are summaries of two things I did to address these comments.

First we asked whether hmm region and single SNP parallel divergence in nature were related (they are). I counted the number of population pairs in which each SNP was (i) classified as a high divergence SNP (top 10% of the empirical Fst distribution), and (ii) occurred in a divergence region based on the HMM analysis. The number of times a SNP was highly divergent between populations and in a divergence region were positively correlated (r = 0.19, p < 10^{^-10}).

The table below reports the number of times a SNP was classified as high divergence between 0 to 4 population pairs (rows) and in a divergence region between 0 to 4 population pairs (columns).

And here are the expectations if these two things were independent

The table below reports the X enrichment of each category relative to the null expectations (i.e., observed/expected)

0 1 2 3 4

0 1.1173512 0.8101410 0.5841571 0.4354957 0.2989245

1 0.9449744 1.1084049 1.1142803 1.0410788 0.9169735

2 0.7888296 1.3064268 1.9097174 2.2868003 2.6489172

3 0.6586967 1.3915190 2.7920075 5.3041640 8.2144022

4 0.5102597 1.4272151 3.9973650 9.7036734 23.6279703

Notice the enrichment along the diagonal an din general for parallel divergence SNPs to coincide with SNPs in parallel divergence regions. The attached figure is a plot color-coded based on these enrichment values.

Finally, here are p-values for observed > expected from a randomization test with 1000 randomizations. Note that the 0's are really p < 0.001.

0 1 2 3 4

0 0 1 1 1 1.000

1 1 0 0 0 0.983

2 1 0 0 0 0.000

3 1 0 0 0 0.000

4 1 0 0 0 0.000

I think this all nicely demonstrates that parallel divergence SNPs are in parallel divergence regions, and that more generally divergence SNPs are in divergence regions.

Next we asked whether parallel divergence change SNPs (the top 99.5th quantile) from the experiment were more likely to be parallel high divergence SNPs (that is high genetic differentiation regions identified by the HMM in two or more populations, but here we must have the actual SNP in the data set) more often then expected by chance (i.e., under the null hypothesis of no association between these metrics). This limited us to 3654 SNPs with actual HMM states. We addressed this question with a randomization test (1000 permutations). The results are below.

99.5th quantile parallel divergent change and parallel high genetic differentiation

Thus, we have evidence that the top (99.5th quantile) parallel divergent change SNPs in the experiment occur significantly more often in parallel (2 or more pairs) high genetic differentiation regions than expected by chance. This is stronger evidence (lower p-value) than for an analysis with all SNPs that could be placed in HMM regions, but the signal is driven purely by SNPs with high divergence states in two populations (none occur in 3+). The relevant R code (projects/timema_wgwild/hmm/snpAnalyses.R) is below and the relevant R workspace is snp.rwsp:

## SNP-based analyses requested during the timema genome revisions

hmm<-read.table("lgFstHmm.txt.gz",header=TRUE)

fst<-read.table("parallelismInformation.txt",header=TRUE)

fqs<-matrix(scan("fstQuantiles.txt",n=4391556*4,sep=" "),nrow=4391556,ncol=4,byrow=TRUE)

## high divergence snps are 90th quantile, pdiv tells how many pops this is true for

pdiv<-apply(fqs >= 0.9,1,sum)

highdiv<-pdiv > 0

## snps in parallel divergence regions

hmmHigh<-apply(hmm[,9:12]==1,1,sum)

## overall correlation between number of times in divergence region and number of times high divergence SNP

cor.test(hmmHigh,pdiv)

##t = 406.6988, df = 4391554, p-value < 2.2e-16

##alternative hypothesis: true correlation is not equal to 0

##95 percent confidence interval:

## 0.1896162 0.1914189

##sample estimates:

## cor

##0.1905177

table(hmmHigh,pdiv)

pdiv

hmmHigh 0 1 2 3 4

0 1508297 461752 61859 4517 131

1 1082443 536086 100128 9163 341

2 284357 198846 54004 6334 310

3 34086 30404 11334 2109 138

4 1663 1964 1022 243 25

htab<-table(hmmHigh)

ptab<-table(pdiv)

nl<-sum(htab)

hfrq<-htab/nl

pfrq<-ptab/nl

exp<-matrix(NA,nrow=5,ncol=5)

for(i in 1:5){

for(j in 1:5){

exp[i,j]<- nl * hfrq[i] * pfrq[j]

}

library(RColorBrewer)

pdf("snpRegionEnrichment.pdf",width=6,height=7.5)

layout(matrix(1:2,nrow=2,ncol=1),widths=6,heights=c(6,1.5))

par(mar=c(5,5,1,1))

image((1:5)-1,(1:5)-1,obs/exp,col=brewer.pal(n=5,"Blues"),breaks=c(0,1,2,5,10,30),xlab="divergence region",ylab="divergence SNP",cex.lab=1.8,cex.axis=1.3)

#par(mar=c(2,2,0,0))

image(1:5,1,matrix(c(0.5,1.5,3.5,7.5,20),nrow=5,ncol=1),col=brewer.pal(n=5,"Blues"),breaks=c(0,1,2,5,10,30),xlab="",ylab="",axes=FALSE)

axis(1,at=seq(0.5,5.5,1),as.character(c(0,1,2,5,10,30)),cex.axis=1.2)

box()

dev.off()

## randomization test

nrep<-1000

ran<-matrix(0,nrow=5,ncol=5)

x<-hmmHigh

y<-pdiv

for(i in 893:nrep){

cat(i,"\n")

y<-sample(y,nl,replace=F)

out<-table(x,y)

ran<-ran + (out > obs)

}

###### SNP based analysis of experiment by hmm enrichment ######

nl<-82060

nlwg<-4391556

npp<-4

## allele frequency divergence

afDiv<-scan("../experiment/bgsrAlleleFreqDiv.txt",n=nl)

## read bgsr locus ids

bgsrIds<-matrix(scan("../experiment/bgsr_locids.txt",n=nl*2,sep=" "),nrow=nl,ncol=2,byrow=TRUE)

## read wgrs locus ids, lg snps only

wgrsIds<-matrix(scan("../experiment/wgrs_locids.txt",n=nlwg*4,sep=" "),nrow=nlwg,ncol=4,byrow=TRUE)

## read hmm states for loci, used perl script to get these

hmmHVAxHVC<-matrix(scan("bed bgsr locus ids

bgsrIds<-matrix(scan("../experiment/bgsr_locids.txt",n=nl*2,sep=" "),nrow=nl,ncol=2,byrow=TRUE)

## read wgrs locus ids, lg snps only

hmmHVAxHVC<-matrix(scan("../experiment/bgsrSnpStates_HVAxHVC.txt",n=nl*3,sep=" "),nrow=nl,ncol=3,byrow=TRUE)

hmmMRAxMRC<-matrix(scan("../experiment/bgsrSnpStates_MRAxMRC.txt",n=nl*3,sep=" "),nrow=nl,ncol=3,byrow=TRUE)

hmmR12AxR12C<-matrix(scan("../experiment/bgsrSnpStates_R12AxR12C.txt",n=nl*3,sep=" "),nrow=nl,ncol=3,byrow=TRUE)

hmmLAxPRC<-matrix(scan("../experiment/bgsrSnpStates_LAxPRC.txt",n=nl*3,sep=" "),nrow=nl,ncol=3,byrow=TRUE)

hmmStates<-as.matrix(cbind(hmmHVAxHVC[,3],hmmMRAxMRC[,3],hmmR12AxR12C[,3],hmmLAxPRC[,3]))

load("../experiment/exper.rwsp") ## this contains the experiment results, the rest of the objects come from code below

q<-quantile(abs(afDiv),probs=c(0.995)) ## this yields 35 loci with hmm states out of 3654 total loci with hmm states

X<-which(abs(afDiv) >= q)

bs<-numeric(npp)

for(pp in 1:npp){

obs[pp]<-sum(hmmStates[X,pp] == 1,na.rm=TRUE)

}

pply(hmmStates[X,]==1,1,sum,na.rm=TRUE)

obsCnts<-numeric(5)

for(j in 1:5){

obsCnts[j]<-sum(cnt==(j-1))

}

nsam<-1000

Nx<-length(X)

nullCnts<-matrix(NA,nrow=nsam,ncol=5)

for(i in 1:nsam){

Y<-sample(1:nl,Nx,replace=FALSE)

cnt<-apply(hmmStates[Y,] == 1,1,sum,na.rm=TRUE)

for(j in 1:5){

nullCnts[i,j]<-sum(cnt==(j-1))

}

obsGr1<-sum(obsCnts[3:5])

nullGr1<-apply(nullCnts[,3:5],1,sum)

mean(nullGr1 >= obsGr1)

## p = 0.012