Post date: Oct 24, 2013 6:44:0 PM
I have been playing around with different ways to think about and quantify parallelism. Here is a summary of what I have done and know so far:
As I mentioned earlier, we now have no SNPs with genome-wide significant parallel Fst (measured by the composite Fst quantile). Similarly, no SNPs are genome-widesignificant for parallel allele frequency differences (measured by a similar quantile approach). A few SNPs are significant with lower significance thresholds, but any specific alternative threshold is difficult to justify.
There is no overall correlation between these two measures of parallelism (Fst vs. allele frequency difference that accounts for direction). But this of course doesn't mean that individual loci do not have high Fst and allele frequency difference composite quantiles.
Given 2, I though it would be interesting to ask whether the top (I tried 10, 100, or 1000) composite Fst quantile SNPs had significantaly elevated composite allele frequency difference quantiles. They do not.
I then asked wether the top Fst SNPs from individual population pairs (10 * 4, 100 * 4, or 1000 * 4) had significantly elevataed composite allele frequency difference quantiles. Again they did not. And interestingly the top 100 Fst SNPs for each population pair was different (there were a few overlapping SNPs in the top 1000 Fst sets). So, again, no evidence that high Fst SNPs in one or across all popualtion pairs corresponde with SNPs that show high and consistent allele frequency differences.
Finally, I asked whether the top 100 higest Fst SNPs in each popualtion pair (four separate analyses here) showed high composite Fst quantiles based on the three other population pairs (it is important to exclude the test population pair for obvious reasons). Here we finally have significant results for three of the four popualtion pairs (not R12, the 'inversion' pair). For these three pairs, the composite Fst quantile (based on the three other pairs) for the top 100 Fst SNPs is greater than expected by chance. In other words, loci with high Fst in one pair (with the exception of R12), have higher Fst for other pairs than predicted by chance.
The most recent bits of R code are below, and all objects are in the fst.rwsp (this R code is in fstanalysis.R)
## direction of parallelism
afParSnps<-af[parSnps,]
colnames(afParSnps)<-plist
## always A - C
afDifParSnps<-matrix(NA,nrow=100,ncol=4)
afDifParSnps[,1]<-afParSnps[,1]-afParSnps[,2]
afDifParSnps[,2]<-afParSnps[,4]-afParSnps[,5]
afDifParSnps[,3]<-afParSnps[,7]-afParSnps[,8]
afDifParSnps[,4]<-afParSnps[,3]-afParSnps[,6]
dirAfDif<-afDifParSnps > 0
cntDirAfDif<-apply(dirAfDif,1,sum)
samDir<-cntDirAfDif
samDir[samDir < 2]<-4-samDir[samDir < 2] ## not different than expected by chance
## different measures of parallelism
afDif<-matrix(NA,nrow=nloc,ncol=4)
afDif[,1]<-af[,1]-af[,2]
afDif[,2]<-af[,4]-af[,5]
afDif[,3]<-af[,7]-af[,8]
afDif[,4]<-af[,3]-af[,6]
qafDif<-afDif
for(i in 1:4){
qafDif[,i]<-order(afDif[,i])
}
qafDif<-qafDif/nloc
qsumAfDif<-apply(qafDif,1,sum)
cor.test(qsum,qsumAfDif) ## NOTE these are not related at all!
## Pearson's product-moment correlation
##data: qsum and qsumAfDif
##t = -0.5271, df = 4391554, p-value = 0.5981
##alternative hypothesis: true correlation is not equal to 0
##95 percent confidence interval:
##-0.0011867898 0.0006837582
##sample estimates:
## cor
##-0.000251516
nrep<-1000
minmaxq<-matrix(NA,nrow=nrep,ncol=2)
q1e6q<-matrix(NA,nrow=nrep,ncol=2)
for(n in 1:nrep){
qrep<-qafDif
for(j in 1:4){
qrep[,j]<-qrep[sample(1:dim(qrep)[1],dim(qrep)[1],replace=FALSE),j]
}
qrepsum<-apply(qrep,1,sum)
minmaxq[n,1]<-min(qrepsum)
minmaxq[n,2]<-max(qrepsum)
q1e6q[n,]<-quantile(qrepsum,probs=c(1e-6,0.999999))
}
pvminmax<-matrix(NA,nrow=nloc,ncol=2)
pvq1e6q<-matrix(NA,nrow=nloc,ncol=2)
for(i in 1:nloc){
pvminmax[i,1]<-sum(minmaxq[,1] < qsumAfDif[i])
pvminmax[i,2]<-sum(minmaxq[,2] > qsumAfDif[i])
pvq1e6q[i,1]<-sum(q1e6q[,1] < qsumAfDif[i])
pvq1e6q[i,2]<-sum(q1e6q[,2] > qsumAfDif[i])
}
## tests for enrichment
parFstSnps<-rev(order(qsum))[1:100]
parAfSnps<-rev(order(qsumAfDif))[1:100]
topFstSnps<-cbind(rev(order(fstHVAxHVChu[,3],na.last=FALSE))[1:100],rev(order(fstMR1AxMR1Chu[,3],na.last=FALSE))[1:100],rev(order(fstR12AxR12Chu[,3],na.last=FALSE))[1:100],rev(order(fstLAxPRChu[,3],na.last=FALSE))[1:100])
simMns<-numeric(nrep)
for(i in 1:nrep){
A<-sample(1:nloc,4000,replace=F)
simMns[i]<-mean(qsumAfDif[A])
}
qsumByPop<-matrix(NA,nrow=nloc,ncol=4)
for(i in 1:4){
qsumByPop[,i]<-apply(quant[,-i],1,sum)
}
simMns<-matrix(NA,nrow=nrep,ncol=4)
for(i in 1:4){
for(j in 1:nrep){
A<-sample(1:nloc,100,replace=F)
simMns[j,i]<-mean(qsumByPop[A,i])
}
}
mean(qsumByPop[topFstSnps[,1],1])
## mean comp. q for HV 1.710218
mean(qsumByPop[topFstSnps[,2],2])
## MR 1.778255
mean(qsumByPop[topFstSnps[,3],3])
## R12 1.538551
mean(qsumByPop[topFstSnps[,4],4])
## allopatric 1.669827
## p-values
mean(simMns[,1] >= 1.710218)
#0
mean(simMns[,2] >= 1.778255)
# 0
mean(simMns[,3] >= 1.538551)
# 0.237
mean(simMns[,4] >= 1.669827)
# 0.001