Post date: Sep 10, 2015 5:15:19 PM
I generated some coverage and allele frequency stats on the sperm_bulkvariants.vcf variant file using the calcIndDepth.pl script, which I wrote to indpth_sperm_bulkvariants.vcf. Here is what I found:
Number of SNPs with data from 1, 2, 3, or 4 bulk samples (note most only have 1):
1 2 3 4
1030549 53296 17437 11856
Average coverage per bulk sample:
7.559987 3.830562 3.788411 3.851840
Average number of SNPs with at least one read per sample:
0.983 0.037 0.044 0.047
Basically all of the genome coverage comes from one bulk sample, while the others just have high coverage for less than 5% of the genome. It turns out, though I hadn't noticed it, that this is also what the Oxford folks found and with the sample (the first of the four bulk samples is their ID 189).
Given this, I guess I will subset the genotype files from all sperm to only include those variants with AF=0.5 in the bulk sample.