summary of variant calling with bulk samples

Post date: Sep 10, 2015 5:15:19 PM

I generated some coverage and allele frequency stats on the sperm_bulkvariants.vcf variant file using the calcIndDepth.pl script, which I wrote to indpth_sperm_bulkvariants.vcf. Here is what I found:

Number of SNPs with data from 1, 2, 3, or 4 bulk samples (note most only have 1):

1 2 3 4

1030549 53296 17437 11856

Average coverage per bulk sample:

7.559987 3.830562 3.788411 3.851840

Average number of SNPs with at least one read per sample:

0.983 0.037 0.044 0.047

Basically all of the genome coverage comes from one bulk sample, while the others just have high coverage for less than 5% of the genome. It turns out, though I hadn't noticed it, that this is also what the Oxford folks found and with the sample (the first of the four bulk samples is their ID 189).

Given this, I guess I will subset the genotype files from all sperm to only include those variants with AF=0.5 in the bulk sample.

Page updated

Google Sites

Report abuse