Post date: Sep 20, 2016 3:26:42 AM
I made a new directory for population genetic analyses of the sperm data, /uufs/chpc.utah.edu/common/home/gompert-group1/projects/timema_sperm/data, and linked the filtered sperm vcf file (with data for at least 27 individuals) to the directory (ilteredHQsprmMiss70_spermvariants.vcf). I wrote a script (extractGls.pl) to extract the scaffold number and genotype estimate (0 or 1) from the vcf file (this does not account for uncertainty, but is probably fine as these are haploid cells). The results in are hqGens.txt. I then summarized the results based on quantiles of a binomial distribution with p = 0.5 (allele freq. in the donor).
In the analysis I worked with 2011 SNPs that were heterozygous in the donor, called as variable in the sperm, had data for a reasonable proportion of sperm (at least 27 sperm samples, mean = 36... i.e., still lots of missing data), and that passed a series of other quality metrics. In other words, these are the SNPs we are most confident in. This plot summarizes the main points. The first image shows the number of individuals with data (x-axis) plotted against the quantile of the number of non-reference alleles in a binomial distribution with p=0.5 (the expectation for a het.). Thus, you can think of values <0.025 or >0.975 as being significantly different than expectations with no segregation distortion (~97% of SNPs fall in this category). The next plot shows the sorted binomial quantiles. Finally, the last plot just shows a histogram of the non-reference allele frequency in the sperm. Two things are clear, most SNPs have wopingbly biased segregation patterns (almost all one allele or the other). The pattern is so extreme that I suspect it reflects the molecular methods not reality (or reality is really damn wonky). Second, we see an excess of non-reference alleles. I think this simply reflects that fact that most SNPs heterozygous in the donor were fixed or nearly fixed for one or the other allele in the sperm, and those that were fixed for the non-reference allele were more likely to be called as SNPs than those fixed or nearly fixed for the reference allele.