Post date: Sep 15, 2016 6:46:47 PM
My plan is to filer the variant set from the sperm donor. Most importantly, I only want to retain SNPs with an allele frequency of 0.5 in the donor (i.e., where the donor is a heterozygote).
I started with 12,081,785 variants of which 8,169,107 were heterozygous. Aside from removing non-het sites, I also dropped SNVs that failed to meet the following criteria:
20 reads total, 30% of reads supporting the less well supported allele, BQRS, MQRS and RPRS all < 2, QD (ratio of variant confidence to non-ref. read depth) > 2, minimum mapping quality of 40, bi-allelic SNP (no indels or mult-allelic SNPs). This was done with my own script:
perl vcfFilter.pl donor.vcf
After filtering 3,852,376 high quality variants were retained and are in /uufs/chpc.utah.edu/common/home/u6000989/data/timema/SpermDonorPlus/Variants/filteredHQhet_donor.vcf.
I wrote a second script to filter the sperm data to retain only variants that were present in the donor and that were high quality (this is the data I will use for looking at segregation distortion and recombination rates). I tried this with different cutoffs in terms of missingness of data (number of individuals with no data, out of 96, note that some are composite samples) an total coverage, here are the summaries:
I am going to go with 70/250. The filtering script (vcfFilterSperm.pl) header is below:
#### stringency variables, edits as desired
my $miss = 70;
my $minCoverage = 250; # minimum number of sequences; DP
my $bqrs = 3; # maximum absolute value of the base quality rank sum test; BaseQRankSum
my $mqrs = 3; # maximum absolute value of the mapping quality rank sum test; MQRankSum
my $rprs = 3; # maximum absolute value of the read position rank sum test; ReadPosRankSum
my $qd = 3; # minimum ratio of variant confidenct to non reference read depth; QD
my $mq = 30; # minimum mapping quality; MQ
##### this set is for whole genome shotgun data
This also drops multi-allelic SNPs and INDELs. And the results are in filteredHQsprmMiss70_spermvariants.vcf