Post date: Nov 10, 2015 4:43:4 AM
We generated a file diptripAllelFrqs.txt with allele frequencies (non-reference allele) for diploids and triploids (from the MLEAF column in the vcf files, with separate estimates for dips and trips). We only retained variants with an average coverage of 2x in one or both groups. The file has four columns: scaffold, position, DIP MLEAF, and TRIP MLEAF (NA for missing variant). Here are a few summaries from the file.
1. We have 332,102 SNPs, with 58.8% shared, 75.1 % present in diploids and 83.7% present in triploids.
2. Overall, allele frequencies were quite similar,
n<-332102
x<-matrix(scan("diptripAlleleFreqs.txt",n=n*4,sep=" "),nrow=n,ncol=4,byrow=TRUE)
cor.test(x[,3],x[,4],na.rm=TRUE)
Pearson's product-moment correlation
data: x[, 3] and x[, 4]
t = 1632.433, df = 195419, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.9649304 0.9655362
sample estimates:
cor
0.9652346
3. Interestingly, for SNPs where diploids and triploids differed most in allele frequencies, non-reference alleles were more common on average in diploids,
mean(x[,3] > x[,4],na.rm=TRUE)
[1] 0.4935908
a<-which(abs(x[,3]-x[,4]) > 0.1)
mean(x[a,3] > x[a,4],na.rm=TRUE)
[1] 0.5674498
a<-which(abs(x[,3]-x[,4]) > 0.2)
mean(x[a,3] > x[a,4],na.rm=TRUE)
[1] 0.7068966
4. We suspect this could reflect a history of hybridization between the northern aspen clade (the source of the reference genome, we think) and the sourthern clade that resulted in the production of triploids, thus making them more similar to the reference than the diploids. Alternatively, the northern clade (large Ne) and southern triploids (old individuals, thus fewer generations) might have drifted less from an ancestral population than the southern diploids.