Post date: Aug 22, 2013 9:1:5 PM
Recent studies of human genetic variation demonstrate that rare (MAF < 0.5%) variants are common, at least when many individuals are sequenced, better capture recent (vs. ancient) demography and population structure, are more geographically restricted, and might be more strongly associated with disease (i.e., they are more likely to be deleterious). Whereas the last point might be a consequence of the recent demographic expansion in humans, the other points should apply generally. Thus, we could learn more about population structure, admixture, and historical demography in Lycaeides by considering different classes of variants based on MAF. Here are initial counts of variants in each of three classes for the Lycaeides admixture sequence data set (each has an associated file in projects/lycaeides_admixture/variant_calling/; note maf_0.05_0.5_lycaeides_gbs_admix_d80.vcf and nolow_lycaeides_gbs_admix_d80.vcf are equivalent). This is based on lycaeides_gbs_admix_d80.vcf.
total number of variants = 801,218
common variants (MAF > 5%) = 28,701
low frequency variants (0.5% < MAF < 5%) = 189,043
rare variants (0.1% < MAF < 0.5%) = 349,497
I generated gentoype likelihood files from each of these vcf files by first grabbing a random subset of single nucleotide variants (SNVs) such that there is only one variant per GBS contig and this variant is not contiguous with another variant. Thus, I have three vcf files:
33430 SNVs - sub_locus_ids_maf_0.001_0.005_lycaeides_gbs_admix_d80.vcf
32154 SNVs - sub_locus_ids_maf_0.005_0.05_lycaeides_gbs_admix_d80.vcf
15076 SNVs - sub_locus_ids_nolow_lycaeides_gbs_admix_d80.vcf
I then generated similarly named .gl files from these vcf files with the script subvcf2gl.pl.