Post date: Aug 19, 2014 11:39:0 PM
I am using a series of variant annotations to filter out lower-quality variants. These annotations are defined and described by the GATK folks here under '5. Understanding annotations'.
Some of the 'standards' (i.e. defaults that are commonly used but don't seem to have clear justifications) for discarding variants are: QD < 2.0, MQ < 40.0, FS > 60.0, HaplotypeScore > 13.0, MQRankSum < -12.5, ReadPosRankSum < -8.0. I looked at the distribution of some of these annotations in a subset of the data and tried to arrive at meaningful thresholds from first principles (i.e. all of the rank sum annotations are values from a standard normal distribution and thus values of +- 2 or 3 are somewhat appropriate). Thus, several of my choices differ from the standards.
I wrote a script for filtering variants (gatk can do this, but I would rather have complete control and it runs fast), which is projects/timema_wgexperiment/variants/vcfFilter.pl. Here are the requirements I used to retain variants (this is directly from the script, and all must be met):
my $minCoverage = 500; # minimum number of sequences; DP
my $minAltRds = 10; # minimum number of sequences with the alternative allele; AC
my $notFixed = 1.0; # removes loci fixed for alt; AF
my $dels = 0.01; # maximum prop. of reads spanning an indel; DEL
my $mq0 = 5; # maximum number of mapping quality 0 reads; MQ0
my $bqrs = 3; # maximum absolute value of the base quality rank sum test; BaseQRankSum
my $mqrs = 2; # maximum absolute value of the mapping quality rank sum test; MQRankSum
my $rprs = 2; # maximum absolute value of the read position rank sum test; ReadPosRankSum
my $qd = 2; # minimum ratio of variant confident to non reference read depth; QD
my $mq = 50; # minimum mapping quality; MQ
The filtered vcf files are being written to projects/timema_wgexperiment/variants/filtered_variants*.vcf.
Here are the numbers of retained variants (total = 8,539,653):