created by Sheila
on 2016-02-03
This document aims to provide insight into the logic of the generic hard-filtering recommendations that we provide as a substitute for VQSR. Hopefully it will also serve as a guide for adapting these recommendations or developing new filters that are appropriate for datasets that diverge significantly from what we usually work with.
Hard-filtering consists of choosing specific thresholds for one or more annotations and throwing out any variants that have annotation values above or below the set thresholds. By annotations, we mean properties or statistics that describe for each variant e.g. what the sequence context is like around the variant site, how many reads covered it, how many reads covered each allele, what proportion of reads were in forward vs reverse orientation, and so on.
The problem with this approach is that it is very limiting because it forces you to look at each annotation dimension individually, and you end up throwing out good variants just because one of their annotations looks bad, or keeping bad variants in order to keep those good variants.
In contrast, VQSR is more powerful because it uses machine-learning algorithms to learn from the data what are the annotation profiles of good variants (true positives) and of bad variants (false positives) in a particular dataset. This empowers you to pull out variants based on how they cluster together along different dimensions, and liberates you to a large extent from the linear tyranny of single-dimension thresholds.
Unfortunately this method requires a large number of variants and well-curated known variant resources. For those of you working with small gene panels or with non-model organisms, this is a deal-breaker, and you have to fall back on hard-filtering.
In this article, we illustrate how the generic hard-filtering recommendations we provide relate to the distribution of annotation values we typically see in callsets produced by our variant calling tools, and how this in turn relates to the underlying physical properties of the sequence data.
We also use results from VQSR filtering (which we take as ground truth in this context) to highlight the limitations of hard-filtering.
We do this in turn for each of five annotations that are highly informative among the recommended annotations: QD, FS, MQ, MQRankSum and ReadPosRankSum. The same principles can be applied to most other annotations produced by GATK tools.
Origin of the dataset
We called variants on a whole genome trio (samples NA12878, NA12891, NA12892, previously pre-processed) using HaplotypeCaller in GVCF mode, yielding a gVCF file for each sample. We then joint-genotyped the gVCFs using GenotypeGVCF, yielding an unfiltered VCF callset for the trio. Finally, we ran VQSR on the trio VCF, yielding the filtered callset. We will be looking at the SNPs only.
Plotting methods and interpretation notes
All plots shown below are density plots generated using the ggplot2 library in R. On the x-axis are the annotation values, and on the y-axis are the density values. The area under the density plot gives you the probability of observing the annotation values. So, the entire area under all of the plots will be equal to 1. However, if you would like to know the probability of observing an annotation value between 0 and 1, you will have to take the area under the curve between 0 and 1.
In plain English, this means that the plots shows you, for a given set of variants, what is the distribution of their annotation values. The caveat is that when we're comparing two or more sets of variants on the same plot, we have to keep in mind that they may contain very different numbers of variants, so the amount of variants in a given part of the distribution is not directly comparable; only their proportions are comparable.
This is the variant confidence (from the QUAL field) divided by the unfiltered depth of non-hom-ref samples. This annotation is intended to normalize the variant quality in order to avoid inflation caused when there is deep coverage. For filtering purposes it is better to use QD than either QUAL or DP directly.
The generic filtering recommendation for QD is to filter out variants with QD below 2. Why is that?
First, let’s look at the QD values distribution for unfiltered variants. Notice the values can be anywhere from 0-40. There are two peaks where the majority of variants are (around QD = 12 and QD = 32). These two peaks correspond to variants that are mostly observed in heterozygous (het) versus mostly homozygous-variant (hom-var) states, respectively, in the called samples. This is because hom-var samples contribute twice as many reads supporting the variant than do het variants. We also see, to the left of the distribution, a "shoulder" of variants with QD hovering between 0 and 5.
We expect to see a similar distribution profile in callsets generated from most types of high-throughput sequencing data, although values where the peaks form may vary.
Now, let’s look at the plot of QD values for variants that passed VQSR and those that failed VQSR. Red indicates the variants that failed VQSR, and blue (green?) the variants that passed VQSR.
We see that the majority of variants filtered out correspond to that low-QD "shoulder" (remember that since this is a density plot, the y-axis indicates proportion, not number of variants); that is what we would filter out with the generic recommendation of the threshold value 2 for QD.
Notice however that VQSR has failed some variants that have a QD greater than 30! All those variants would have passed the hard filter threshold, but VQSR tells us that these variants looked artifactual in one or more other annotation dimensions. Conversely, although it is not obvious in the figure, we know that VQSR has passed some variants that have a QD less than 2, which hard filters would have eliminated from our callset.
This is the Phred-scaled probability that there is strand bias at the site. Strand Bias tells us whether the alternate allele was seen more or less often on the forward or reverse strand than the reference allele. When there little to no strand bias at the site, the FS value will be close to 0.
Note: SB, SOR and FS are related but not the same! They all measure strand bias (a type of sequencing bias in which one DNA strand is favored over the other, which can result in incorrect evaluation of the amount of evidence observed for one allele vs. the other) in different ways. SB gives the raw counts of reads supporting each allele on the forward and reverse strand. FS is the result of using those counts in a Fisher's Exact Test. SOR is a related annotation that applies a different statistical test (using the SB counts) that is better for high coverage data.
Let’s look at the FS values for the unfiltered variants. The FS values have a very wide range; we made the x-axis log-scaled so the distribution is easier to see. Notice most variants have an FS value less than 10, and almost all variants have an FS value less than 100. However, there are indeed some variants with a value close to 400.
The plot below shows FS values for variants that passed VQSR and failed VQSR.
Notice most of the variants that fail have an FS value greater than 55. Our hard filtering recommendations tell us to fail variants with an FS value greater than 60. Notice that although we are able to remove many false positives by removing variants with FS greater than 60, we still keep many false positive variants. If we move the threshold to a lower value, we risk losing true positive variants.
This is another way to estimate strand bias using a test similar to the symmetric odds ratio test. SOR was created because FS tends to penalize variants that occur at the ends of exons. Reads at the ends of exons tend to only be covered by reads in one direction and FS gives those variants a bad score. SOR will take into account the ratios of reads that cover both alleles.
Let’s look at the SOR values for the unfiltered variants. The SOR values range from 0 to greater than 9. Notice most variants have an SOR value less than 3, and almost all variants have an SOR value less than 9. However, there is a long tail of variants with a value greater than 9.
The plot below shows SOR values for variants that passed VQSR and failed VQSR.
Notice most of the variants that have an SOR value greater than 3 fail the VQSR filter. Although there is a non-negligible population of variants with an SOR value less than 3 that failed VQSR, our hard filtering recommendation of failing variants with an SOR value greater than 3 will at least remove the long tail of variants that show fairly clear bias according to the SOR test.
This is the root mean square mapping quality over all the reads at the site. Instead of the average mapping quality of the site, this annotation gives the square root of the average of the squares of the mapping qualities at the site. It is meant to include the standard deviation of the mapping qualities. Including the standard deviation allows us to include the variation in the dataset. A low standard deviation means the values are all close to the mean, whereas a high standard deviation means the values are all far from the mean.When the mapping qualities are good at a site, the MQ will be around 60.
Now let’s check out the graph of MQ values for the unfiltered variants. Notice the very large peak around MQ = 60. Our recommendation is to fail any variant with an MQ value less than 40.0. You may argue that hard filtering any variant with an MQ value less than 50 is fine as well. This brings up an excellent point that our hard filtering recommendations are meant to be very lenient. We prefer to keep all potentially decent variants rather than get rid of a few bad variants.
Let’s look at the VQSR pass vs fail variants. At first glance, it seems like VQSR has passed the variants in the high peak and failed any variants not in the peak.
It is hard to tell which variants passed and failed, so let’s zoom in and see what exactly is happening.
The plot above shows the x-axis from 59-61. Notice the variants in blue (the ones that passed) all have MQ around 60. However, some variants in red (the ones that failed) also have an MQ around 60.
This is the u-based z-approximation from the Rank Sum Test for mapping qualities. It compares the mapping qualities of the reads supporting the reference allele and the alternate allele. A positive value means the mapping qualities of the reads supporting the alternate allele are higher than those supporting the reference allele; a negative value indicates the mapping qualities of the reference allele are higher than those supporting the alternate allele. A value close to zero is best and indicates little difference between the mapping qualities.
Next, let’s look at the distribution of values for MQRankSum in the unfiltered variants. Notice the values range from approximately -10.5 to 6.5. Our hard filter threshold is -12.5. There are no variants in this dataset that have MQRankSum less than -10.5! In this case, hard filtering would not fail any variants based on MQRankSum. Remember, our hard filtering recommendations are meant to be very lenient. If you do plot your annotation values for your samples and find none of your variants have MQRankSum less than -12.5, you may want to refine your hard filters. Our recommendations are indeed recommendations that you the scientist will want to refine yourself.
Looking at the plot of pass VQSR vs fail VQSR variants, we see the variants with an MQRankSum value less than -2.5 fail VQSR. However, the region between -2.5 to 2.5 contains both pass and fail variants. Are you noticing a trend here? It is very difficult to pick a threshold for hard filtering. If we pick -2.5 as our hard filtering threshold, we still have many variants that fail VQSR in our dataset. If we try to get rid of those variants, we will lose some good variants as well. It is up to you to decide how many false positives you would like to remove from your dataset vs how many true positives you would like to keep and adjust your threshold based on that.
This is the u-based z-approximation from the Rank Sum Test for site position within reads. It compares whether the positions of the reference and alternate alleles are different within the reads. Seeing an allele only near the ends of reads is indicative of error, because that is where sequencers tend to make the most errors. A negative value indicates that the alternate allele is found at the ends of reads more often than the reference allele; a positive value indicates that the reference allele is found at the ends of reads more often than the alternate allele. A value close to zero is best because it indicates there is little difference between the positions of the reference and alternate alleles in the reads.
The last annotation we will look at is ReadPosRankSum. Notice the values fall mostly between -4 and 4. Our hard filtering threshold removes any variant with a ReadPosRankSum value less than -8.0. Again, there are no variants in this dataset that have a ReadPosRankSum value less than -8.0, but some datasets might. If you plot your variant annotations and find there are no variants that have a value less than or greater than one of our recommended cutoffs, you will have to refine them yourself based on your annotation plots.
Looking at the VQSR pass vs fail variants, we can see VQSR has failed variants with ReadPosRankSum values less than -1.0 and greater than 3.5. However, notice VQSR has failed some variants that have values that pass VQSR.
Updated on 2016-08-01
From jphekman on 2016-04-23
I’m having trouble with VariantFiltration (GATK 3.5) sometimes filtering as it should, sometimes (apparently) silently ignoring filters.
The following works as expected with no problems:
java -jar GenomeAnalysisTK.jar -T VariantFiltration -R genome.fa -V hpa.vcf -window 35 -cluster 3 -filterName FS -filter “FS > 60.0” -filterName QD -filter “QD < 2.0” —missingValuesInExpressionsShouldEvaluateAsFailing -o hpa-filtered2c.vcf
However the following silently does not filter as expected:
java -jar GenomeAnalysisTK.jar -T VariantFiltration -R genome.fa -V hpa-filtered2c.vcf -window 35 -cluster 3 -filterName SOR -filter “SOR > 4” -o hpa-filtered2d.vcf
I also tried this syntax with exactly the same behavior (silently does not filter):
java -jar GenomeAnalysisTK.jar -T VariantFiltration -R genome.fa -V hpa-filtered2c.vcf -window 35 -cluster 3 —filterExpression “SOR > 4” —filterName SOR -o hpa-filtered2d.vcf
The output from the first command is here:
INFO 11:38:02,214 HelpFormatter – ———————————————————————————————————————— INFO 11:38:02,226 HelpFormatter – The Genome Analysis Toolkit (GATK) v3.5-0-g36282e4, Compiled 2015/11/25 04:03:56 INFO 11:38:02,226 HelpFormatter – Copyright © 2010 The Broad Institute INFO 11:38:02,226 HelpFormatter – For support and documentation go to http://www.broadinstitute.org/gatk INFO 11:38:02,229 HelpFormatter – Program Args: -T VariantFiltration -R /home/lab/dog/cf3/meta/ncbi/Sequence/WholeGenomeFasta/genome.fa -V hpa-filtered2c.vcf -window 35 -cluster 3 -filterName SOR -filter SOR > 4 -o hpa-filtered2d.vcf INFO 11:38:02,273 HelpFormatter – Executing as hekman2@bigdog.ansci.illinois.edu on Linux 4.4.6-301.fc23.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_77-b03. INFO 11:38:02,274 HelpFormatter – Date/Time: 2016/04/23 11:38:02 INFO 11:38:02,274 HelpFormatter – ———————————————————————————————————————— INFO 11:38:02,275 HelpFormatter – ———————————————————————————————————————— INFO 11:38:02,444 GenomeAnalysisEngine – Strictness is SILENT INFO 11:38:02,637 GenomeAnalysisEngine – Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 11:38:03,045 GenomeAnalysisEngine – Preparing for traversal INFO 11:38:03,049 GenomeAnalysisEngine – Done preparing for traversal INFO 11:38:03,050 ProgressMeter – [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 11:38:03,050 ProgressMeter – | processed | time | per 1M | | total | remaining INFO 11:38:03,050 ProgressMeter – Location | sites | elapsed | sites | completed | runtime | runtime INFO 11:38:33,054 ProgressMeter – 15:23491292 719835.0 30.0 s 41.0 s 15.6% 3.2 m 2.7 m INFO 11:39:03,055 ProgressMeter – 1:91404072 1388693.0 60.0 s 43.0 s 31.3% 3.2 m 2.2 m INFO 11:39:33,057 ProgressMeter – 25:9050344 2145604.0 90.0 s 41.0 s 44.7% 3.4 m 111.0 s INFO 11:40:03,058 ProgressMeter – 30:39254841 2932767.0 120.0 s 40.0 s 59.1% 3.4 m 83.0 s INFO 11:40:33,059 ProgressMeter – 4:41188029 3656216.0 2.5 m 41.0 s 76.2% 3.3 m 46.0 s INFO 11:41:03,060 ProgressMeter – 8:26307663 4403189.0 3.0 m 40.0 s 90.0% 3.3 m 20.0 s INFO 11:41:22,345 ProgressMeter – done 4900056.0 3.3 m 40.0 s 100.0% 3.3 m 0.0 s INFO 11:41:22,346 ProgressMeter – Total runtime 199.30 secs, 3.32 min, 0.06 hours INFO 11:41:24,764 GATKRunReport – Uploaded run statistics report to AWS S3
In the shell I grepped out just the passing calls:
egrep “#|PASS” hpa-filtered2c.vcf > hpa-filtered2c-passed.vcf egrep “#|PASS” hpa-filtered2d.vcf > hpa-filtered2d-passed.vcf
Then in R I established both that calls exist which should be filtered, and that they are not filtered:
> vcf<-readVcf(“hpa-filtered2c-passed.vcf”,“cf3”) > metrics.pre <- data.frame(QUAL=qual(vcf), QD=info(vcf)$QD, FS=info(vcf)$FS, MQ=info(vcf)$MQ, MQRankSum=info(vcf)$MQRankSum, ReadPosRankSum=info(vcf)$ReadPosRankSum, SOR=info(vcf)$SOR) > vcf<-readVcf(“hpa-filtered2d-passed.vcf”,“cf3”) > metrics.pre <- data.frame(QUAL=qual(vcf), QD=info(vcf)$QD, FS=info(vcf)$FS, MQ=info(vcf)$MQ, MQRankSum=info(vcf)$MQRankSum, ReadPosRankSum=info(vcf)$ReadPosRankSum, SOR=info(vcf)$SOR) > metrics.pre <- read.table(file=“filtered2c.txt”,header=TRUE) > notNaSOR.pre <- metrics.pre$SOR[!is.na(metrics.pre$SOR)] > length (notNaSOR.pre) [1] 3596470 > summary(notNaSOR.pre) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000 2.303 3.912 4.830 6.666 16.790 > metrics.post <- read.table(file=“filtered2d.txt”,header=TRUE) notNaSOR.post <- metrics.post$SOR[!is.na(metrics.post$SOR)] > length (notNaSOR.post) [1] 3596470 > summary(notNaSOR.post) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000 2.303 3.912 4.830 6.666 16.790
As you can see, though the filter specified that SOR > 4 should be filtered, nevertheless we seem to have calls which passed despite SOR > 4.
Am I doing something wrong? It’s so perplexing that the same syntax works fine with QD and FS filters. Is it something about SOR? I am having the same problems with MQ, actually, I just picked SOR as a test case.
Thanks,
Jessica
From tkoyama on 2016-05-11
Hi Sheila,
Thank you for your valuable information!
Is this applicable for SNPs? INDELs? or both?
From Sheila on 2016-05-11
@jphekman
Hi Jessica,
Sorry for the late response. I think the issue with your SOR filter is that you need to make the value a float. So, instead of SOR > 4, it should be SOR > 4.0. I hope that works!
-Sheila
From Sheila on 2016-05-11
@tkoyama
Hi,
The document above is specifically for SNPs, but you can certainly plot the annotation values of indels to determine cutoffs.
-Sheila
P.S. [Here](https://www.broadinstitute.org/gatk/guide/article?id=3225) are our basic recommendations for filtering indels :smile:
From tkoyama on 2016-05-12
Hi, Sheila
Thank you for your quick response! It is quite interesting to me for validating VQSR results.
Can you share the codes used to make the figures shown here?
From jphekman on 2016-05-19
Hi Sheila — thanks for your answer; sorry it took me so long to reply (I had stopped checking). Yes, that worked great! Would it be possible for someone to edit the documentation (or better yet the code to provide an error message)? I had been quite flummoxed!
Jessica
From Sheila on 2016-05-20
@tkoyama
Hi,
Sorry we did not share the commands there. As it may take some time to add them, I recommend having a look at the tutorial [here](https://drive.google.com/a/broadinstitute.org/folderview?id=0BwTg3aXzGxEDVGtYYkFKTG84ZDA&usp=sharing&tid=0BwTg3aXzGxEDUUhUTU4xQ0oxRE0). In the Variant Filtering and Callset Evaluation worksheet, there are the basic commands you can run to produce the plots. Have a look at the entire worksheet for extra information. The data bundle is also available at the link above, so you can try running the commands on test data first :smile: The tutorials are from workshops we do, so I hope you find them useful.
-Sheila
From Sheila on 2016-05-20
@jphekman
Hi Jessica,
Ah yes. I have seen a few users ask the same question here. Let me check with the team on what to do. I suspect the main issue is that SOR is not part of our hard filtering recommendations, so we don’t have an example. But, if we did, no one would have any issues! :wink:
Also, you can always check the VCF header to confirm what type of number the annotation values take.
-Sheila
From Geraldine_VdAuwera on 2016-05-20
@Sheila, SOR is indeed part of our recommendations, see the article linked above. If you remember, for this article we only covered a subset of the recommended annotations.
@jphekman The article gives the SOR threshold in decimal form. We’ll check what the correct format should be and edit the document appropriately if needed.
From Sheila on 2016-05-20
jphekman
Geraldine_VdAuwera
Sorry for the confusion. Indeed the document [here](http://gatkforums.broadinstitute.org/gatk/discussion/3225/i-am-unable-to-use-vqsr-recalibration-to-filter-variants) shows the correct usage for SOR.
-Sheila
From jphekman on 2016-05-20
Thanks, Sheila and Geraldine! I do suggest that there be an error message of some sort rather than silent failure. But I am up and running again now so I’m all set, and I really appreciate your help!
Jessica
From Geraldine_VdAuwera on 2016-05-20
Glad you’re all set! I agree it would be better to emit an error message rather than silently fail to evaluate the expression. The difficulty here is that the evaluation is done by code from an an external library, so adding up-front type validation would not be trivial. It’s not something that we can justify spending much time on at all, I’m afraid.
From sbourgeois on 2016-06-13
Hello,
so, I’m now done with variant calling on a small set of exomes (4 samples only) and am contemplating hard thresholds for variant filtering. I have a couple of questions, I’ll try to put them in order to make sure I’m clear:
1 – I guess I should produce all these plots from a- the combined vcf or b- the set of BAM files used to produce the gVCF files?
2 – Is there a GATK tool that produces all these plots automatically? (sorry, I couldn’t find out) If not, is there a page with the details on how to generate each of these plots?
Finally, just a wish really, it would be great to show us these plots for a real life small set before any filtering, how to choose the filters based on these plots, run the filtering, show the same plots post filtering, then try a different filtering and again show the post filtering plots, and finally make a decision on the final filtering values based on “ideal plots one should aspire to” (looking like the ones on this page I suppose)
Cheers,
Steph
From Sheila on 2016-06-14
@sbourgeois
Hi Steph,
1) You should generate the plots based on the values in your final VCF.
2) No, there are no GATK tools to produce the plots for you. However, I really hope you will find [this workshop tutorial](https://drive.google.com/a/broadinstitute.org/folderview?id=0BwTg3aXzGxEDVGtYYkFKTG84ZDA&usp=sharing&tid=0BwTg3aXzGxEDUUhUTU4xQ0oxRE0) helpful. If you look over the Callset Filtering Appendix and Callset Filtering Worksheet in the link, you will see how we produced the plots and some more of the reasoning behind why we do filtering. Also, if you would like some hands-on guided practice, you can download the data bundle there and follow along :smile:
-Sheila
From prepagam on 2016-07-27
Thanks for this article. It has a lot of useful information! One thing that remains unclear to me is how to refine. I understand GaTK can’t give some set of filters that work for all. I see that by plotting the distribution of an annotation’s values will show if a GaTK recommended hard filter value is definitely not appropriate (i.e. filtering out values less than x if all my values are > x), but what about more fine tuning. If I don’t have snp chip data for the same animal to compare to, how do I know if a filter needs to be adjusted or if adjusting a filter is helping? Eye balling a small set of sites in IGV seems too small scale and error prone.
From Geraldine_VdAuwera on 2016-08-01
@prepagam At this point we don’t have precise recommendations for the fine-tuning of the filters. Mainly it’s a lot of trial and error, trying different combinations and examining what effect they have on overall metrics (via eg VariantEval), plus sets of true positives and false positives that you’ve identified in your data and that you know should/shouldn’t be called. This part of the work takes skill and experience, there’s no real workaround for that. One thing you can do to learn how to do this is to practice on known datasets that have been well characterized — generate raw calls and try to reconstruct the published/curated dataset using hard filters.
From sbourgeois on 2016-08-01
@Sheila
Thanks a lot for your help, unfortunately the link to the tutorial doesn’t seem to work (This folder doesn’t exist or you no longer have permission to access it.). Is this a premission issue or has it been moved?
Cheers,
Steph
From shlee on 2016-08-01
@sbourgeois , The folder directory structure was changed and the link is no longer valid. This link should take you to our latest workshop materials within which you’ll find your tutorial of interest.
From helene on 2017-03-21
Hello,
I am trying to use hard filtering on one sample. However, when I plot MQ, I noticed that it is capped at 60. I briefly searched some other posts on this forum and it seems like this is indeed the case for the more recent GATK releases. I just want to confirm this, and if I should still go with MQ <40 as the filter. Thank you.
From shlee on 2017-03-24
Hi @helene,
What do you mean by capped? Can you show us your plot?
From Begali on 2017-06-06
hi
I am new for tools such as samtools, bcftools, BWA and GATK and I am working on RADSeq I reach for point call variants which include ( INDEL and SNP) now time to filter them I could not do that by GATK with VQSR which reuired resources bundle folder which are not available for my organisms plant ( 5 samples for Cardamine hirsuta and 197 samples for Arabidopsis thaliana ) so only can do Hard-filtering however I need to plot my data to see my original distribution so I can determine thresholds which need to be filtered by considering lower est thresholds … so I need to grep only INFO col first then how to select DP ,VDB ;AF1;MQ and so one so I can obtain file for each one which will be easier to plot then in R …
Thanks in advance
From Sheila on 2017-06-08
@Begali
Hi,
I think the hands on tutorial on hard filtering will help. You can find it in the presentations page [here](https://drive.google.com/drive/folders/0BwTg3aXzGxEDbDJPNmtPX3hYT0U).
-Sheila
From Begali on 2017-06-09
@ Sheila
hi
Thanks so much for ur information will try it now
From mjtiv on 2017-09-14
Dear GATK,
We have a couple questions about understanding the topic of hard filtering.
https://gatkforums.broadinstitute.org/gatk/discussion/6925/understanding-and-adapting-the-generic-hard-filtering-recommendations
Question 1. In the QualByDepth (QD) figure you point out the two peaks are caused by heterozygous versus mostly homozygous variants. Which peak is specifically due to heterozygous and which peak is due homozygous calls? Follow-up on this, in our Indel dataset the peak heights swap (left lower, right higher).
Question 2. At the following link (QualByDepth)
https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_annotator_QualByDepth.php
You give a very detailed excellent explanation of how the scores are calculated for SNPs, but how do the calculations change for Indels (Quality score etc.)?
Question 3. Do you have figures of VQSR Status (Fail/Pass) for Indel data for the various hard filter recommendations? The figures you posted for SNPs are extremely useful in helping understand the data better.
Thank You
Joe
From Sheila on 2017-09-19
@mjtiv
Hi Joe,
1) The left peak with the lower QDs are heterozygous (because they have less evidence for the alt alleles). The right peaks are the homozygous calls (because they have more evidence for the alt allele). In your case of indels, that means you have more homozygous variant indels that heterozygous indels.
2) I think [this thread](https://gatkforums.broadinstitute.org/gatk/discussion/5112/qd-in-join-vcf-vs-its-components-in-gvcfs-does-not-match) will help with how QD for indels is calculated.
3) We don’t have the indel plots to share. We chose to focus on the SNPs only, because the indels tend to have different annotation distributions. They should be easy to plot however, if you have some VQSR data. You can follow the filtering tutorial [here](https://drive.google.com/drive/folders/0BwTg3aXzGxEDVHdyWFNWX0pnUTg) which gives some basic plotting commands :smiley:
-Sheila
From mjtiv on 2017-09-20
Sheila thank you for the feedback and suggestions!
From mjtiv on 2017-10-16
With reference to VQSR data, do you have any or know of a link for example a publicly available VQSR data? I did find one from a GATK tutorial (VQSR file), but I am more looking for a dataset that can be cited in the future.
Also, lets say I would use a well documented publicly available VCF from an organism to extract SNPs and Indels and see how they plot, would that work too to better understand Indel data behavior?
The main goal is to get plots that I compare to raw vcf data (SNPs and indels) to validate I am choosing the correct hard filters, especially for indels. Currently, the indels appear to form nice plots that match well with GATK’s hard filtering criteria, so I am also wondering if I am making more work for myself that is not necessary.
From Sheila on 2017-10-19
@mjtiv
Hi,
ExAC used VQSR to determine the VQSLODs, but then the VQSLOD threshold (tranche) was determined and applied manually. Also, the 1000Genomes phase 3 release used VQSR. You can use any one of those for your purpose, I think.
>Also, lets say I would use a well documented publicly available VCF from an organism to extract SNPs and Indels and see how they plot, would that work too to better understand Indel data behavior?
Sure, that will work for understanding Indel annotation distributions :smile:
As for the hard filters we recommend, those were chosen based on human annotation distributions. They tend to be pretty “relaxed” and allow for high sensitivity.
-Sheila
From flow on 2018-01-17
I am working on a diploid, outcrossing non-model organism and I want to use hard-filtering to reduce false positive SNPs.
I created a density plot of QD and see three peaks, one at QD=1-3, one at QD=15-20, and one at QD=30-34. The most troubling observation is that the low QD peak consist of snps with high ts:tv ratio (apparent when I estimate ts:tv in QD bins) so I’m not at all certain that these SNPs should be filtered.
I’m most interested in entertaining thoughts on this curious pattern, but in the meantime, I’m trying to fully understand QD.
Its not clear to me why the two peaks in QD density plot in the example above should correspond to SNPs with het and homozygous calls.
(1) Do you have a reference for QUAL formula?
(2) I read the the explanation example calculations for QD here:
https://gatkforums.broadinstitute.org/gatk/discussion/comment/18866
and
https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_walkers_annotator_QualByDepth.php
I assume QD is impacted by the number of reads supporting the ALT allele primarily by the effect it has on QUAL (numerator in QD) and not the denominator (sum of all reads in 0/1 and 1/1 genotypes).
(3) However, if this is the case wouldn’t we expect density distribution of QD to look more like the frequency distribution of ALT alleles (which I assume is not typically bimodal)?
Any clarification and insight you might have to my curious observation would be helpful.
From Sheila on 2018-01-22
@flow
Hi,
>Its not clear to me why the two peaks in QD density plot in the example above should correspond to SNPs with het and homozygous calls.
The QUAL is affected by the number of reads that have the variant allele. The first peak corresponds to the het sites because they only have one alternate allele, so the number of reads supporting them will be ~half that of hom-var sites. The hom var sites have 2 alternate alleles, and have nearly double the number of reads supporting them compared to het sites, so the peak is at nearly double the value of the first peak. This is assuming you have pretty even and good quality coverage over your entire dataset.
1) If you are using version 3, [here](https://software.broadinstitute.org/gatk/documentation/article?id=7258) is an explanation of the original QUAL score calculation. In version 3.8, there is a [newQual](https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php#—useNewAFCalculator) model that you can try out. We don’t have documentation for that yet, but it may be worth trying out and seeing if that changes your results.
2) That is correct. See the link above for more information.
3) I have not taken a look at the frequency distribution of ALT alleles so cannot comment properly on that. Have you plotted the frequency distribution of ALT alleles in your dataset?
-Sheila
From PrakkiRama on 2018-01-23
Hi,
I see some researchers using some hard-filtering on their vcf files such as:
1) using minimum cutoff of ‘X’ reads
2) using 75% of all reads supporting a variant call.
For the first one, I understand that I need to look at DP field in vcf file for filtering. But for the second one, could I please know if there is a parameter or tool in GATK on make sure to filter variant call bases with 75% support. Any useful pointers are very much appreciated.
-Prakki Rama
From Sheila on 2018-01-25
@PrakkiRama
Hi Prakki,
Have a look at [this thread](https://gatkforums.broadinstitute.org/gatk/discussion/7908/filtering-vcf-help). Note, we do not recommend filtering on depth because HaplotypeCaller takes into account other factors when deciding to emit a variant call. You can read more about the way it works in the Methods and Algorithms section.
-Sheila
From MehulS on 2019-01-21
“The problem with this approach is that it is very limiting because it forces you to look at each annotation dimension individually, and you end up throwing out good variants just because one of their annotations looks bad, or keeping bad variants in order to keep those good variants”
curious to know if there’s any way of incorporating multidimensionality in Hard filtering which doesn’t involve sophisticated machine learning techniques. Theoretically, one could devise a methodology that filters out a variant that doesn’t meet the quality requirements of more than 3 parameters.
From MehulS on 2019-01-22
Also, what is the core quantitative logic behind the hard-filtering thresholds that are applied ? From my understanding the logic presented here seems very relativistic i.e. one plots the distribution of variants corresponding to specific parameters and then basically filters out variants corresponding to lower ends of the parameter (say , QD). But surely, there must be some established absolute statistics and/or standards which say that on an average variants corresponding to say, QD of below 10 tend to be false positives ? How do i know whether a specific threshold is good or bad ?
To be fair, even the latter would be relativistic but it would be wrt established stats or biological facts rather than wrt its own callset.
Hope my question was clear.
Edit: I know that VQSR does it wrt truth and training datasets so does that mean hard-filtering is bound to be self-relativistic ?
From MehulS on 2019-01-22
And lastly, (last question, promise :p) how do I get the distribution plots for unfiltered variants and truth sets ? Even if I can get numerical estimates as opposed to graphs I can work with them. Is there any GATK tool to output distribution metrics from a VCF file ?
From masagis on 2019-02-11
Dear GATK,
I am running a trial for variant filtration on chr.20, before running for the whole genome.
VQSR is not an option for me, because I couldn’t find supporting files as input.
Thus, I used this script:
./gatk VariantFiltration R Galaxy47[ARS-UCD1.2_Btau5.0.1Y.fa.gz].fasta —variant raw1.vcf.gz —filter-expression “ QD < 2.0 || FS > 60 || MQ < 40 || MQRankSUM < -12.5 || ReadPosRankSum < -8.0 “ —filter-name “someFilterName” -O FilteredVariantsBufCat20.vcf -L 20
Manually looking at the vcf input, I expect many variants retained after filtering.
However, the vcf output doesn’t contain any variants, except several headers.
Could you help me figure out the problem?
From shlee on 2019-02-11
Hi @masagis,
Please see the up-to-date tutorial outlining steps for VQSR or hard-filtering at https://software.broadinstitute.org/gatk/documentation/article?id=23216.
From M_C on 2019-08-01
Dear GATK team,
I am very new to all these and checking these annotations I see that the MQRankSum value for some variants (SNPs) is…”?”. I am not sure how to explain this?
Thanks