Post date: Nov 06, 2013 4:35:44 PM
I started by classifying the top 0.1% of SNPs for each population as high Fst SNPs (this is about 4000 SNPs per population). With this classification a few SNPs are high in two populations, but none in three or four. In fact to get any four population high Fst SNPs I had to include the top 10% (about 400,000) SNPs for each population pair as high Fst SNPs (I also tried 1%, you get some three population high Fst SNPs but none in all four).
When defining the top 10% of loci in each population as high Fst loci we get the following,
Thus, we have significantly more loci that are classified as high Fst loci in 2, 3, or 4 populations (also in no populations) with the greatest excess in 4 populations. On the other hand we have fewer single population high Fst loci (I report one-tailed p-values, but this would be a significant deficit for a two-tailed p-value, and the other p-values would still be significant). Thus, these results are consistent with the hypothesis of parallelism. With that said, minor allele frequency could affect these results too. In particular for three of the four populations (not R12) the average maf (across all 8 populations) is weakly but positively related to Fst quantiles (r between 0.013 and 0.020, the correlation is actually -0.033 for R12, presumably because the average maf is not indicative of the maf in these populations which are farther from the others). This is probably not sufficient to explain the two-fold enrichment of four population high Fst loci (and really it is not necessarily wrong anyway, similar maf across populations is part of what allows for parallel patterns of differentiation), but it is probably worth noting in the paper.