created by Geraldine_VdAuwera
on 2013-09-11
Our preferred method for filtering variants after the calling step is to use VQSR, a.k.a. recalibration. However, it requires well-curated training/truth resources, which are typically not available for organisms other than humans, and it also requires a large amount of variant sites to operate properly, so it is not suitable for some small-scale experiments such as targeted gene panels or exome studies with fewer than 30 exomes. For the latter, it is sometimes possible to pad your cohort with exomes from another study (especially for humans -- use 1000 Genomes or ExAC!) but again for non-human organisms it is often not possible to do this.
So, if this is your case and you are sure that you cannot use VQSR, then you will need to use the VariantFiltration tool to hard-filter your variants. To do this, you will need to compose filter expressions using JEXL as explained here based on the generic filter recommendations detailed below. There is a tutorial that shows how to achieve this step by step. Be sure to also read the documentation explaining how to understand and improve upon the generic hard filtering recommendations.
Let's be painfully clear about this: there is no magic formula that will give you perfect results. Filtering variants manually, using thresholds on annotation values, is subject to all sorts of caveats. The appropriateness of both the annotations and the threshold values is very highly dependent on the specific callset, how it was called, what the data was like, what organism it belongs to, etc.
HOWEVER, because we want to help and people always say that something is better than nothing (not necessarily true, but let's go with that for now), we have formulated some generic recommendations that should at least provide a starting point for people to experiment with their data.
In case you didn't catch that bit in bold there, we're saying that you absolutely SHOULD NOT expect to run these commands and be done with your analysis. You absolutely SHOULD expect to have to evaluate your results critically and TRY AGAIN with some parameter adjustments until you find the settings that are right for your data.
In addition, please note that these recommendations are mainly designed for dealing with very small data sets (in terms of both number of samples or size of targeted regions). If you are not using VQSR because you do not have training/truth resources available for your organism, then you should expect to have to do even more tweaking on the filtering parameters.
Here are some recommended arguments to use with VariantFiltration when ALL other options are unavailable to you. Be sure to read the documentation explaining how to understand and improve upon these recommendations.
Note that these JEXL expressions will tag as filtered any sites where the annotation value matches the expression. So if you use the expression QD < 2.0
, any site with a QD lower than 2 will be tagged as failing that filter.
For SNPs:
QD < 2.0
MQ < 40.0
FS > 60.0
SOR > 3.0
MQRankSum < -12.5
ReadPosRankSum < -8.0
If your callset was generated with UnifiedGenotyper for legacy reasons, you can add HaplotypeScore > 13.0
.
For indels:
QD < 2.0
ReadPosRankSum < -20.0
InbreedingCoeff < -0.8
FS > 200.0
SOR > 10.0
Some bits of this article may seem harsh, or depressing. Sorry. We believe in giving you the cold hard truth.
HOWEVER, we do understand that this is one of the major points of pain that GATK users encounter -- along with understanding how VQSR works, so really, whichever option you go with, you're going to suffer.
And we do genuinely want to help. So although we can't look at every single person's callset and give an opinion on how it looks (no, seriously, don't ask us to do that), we do want to hear from you about how we can best help you help yourself. What information do you feel would help you make informed decisions about how to set parameters? Are the meanings of the annotations not clear? Would knowing more about how they are computed help you understand how you can use them? Do you want more math? Less math, more concrete examples?
Tell us what you'd like to see here, and we'll do our best to make it happen. (no unicorns though, we're out of stock)
We also welcome testimonials from you. We are one small team; you are a legion of analysts all trying different things. Please feel free to come forward and share your findings on what works particularly well in your hands.
Updated on 2016-07-29
From Geraldine_VdAuwera on 2014-09-01
Questions and comments up to August 2014 have been moved to an archival thread here:
http://gatkforums.broadinstitute.org/discussion/4562/questions-about-filtering-variants-manually
From Gustav on 2014-09-05
Hello,
I am trying to soft filter my indel calls with the variant recalibrator, but have gotten very confused.
For my snps everything works fine. And even the model plots for the indels looks pretty good. But there is no “PASS” or “FILTERED or anything else written in the filter column in the vcf file. It does not matter which tranche level I pick. Simply nothing happens?? Nothing gets written in the filter column ever, no matter how I have tweaked the settings.
My data consists of 35 whole exomes. and there are something around 25,000 indel calls.
I am using the latest version of GATK.
My two command lines are:
-T VariantRecalibrator -R /references/hs.build37.1.fa -input samples_poulsen.indel.vcf \ -resource:mills,known=true,training=true,truth=true,prior=12.0 /RESOURCES/Mills_and_1000G_gold_standard.indels.b37.vcf \ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 /dbsnp138.vcf.gz -mode INDEL \ -recalFile recalibrate_INDEL.recal -tranchesFile recalibrate_INDEL.tranches -rscriptFile recalibrate_INDEL_plots.R \ -an QD -an FS -an MQRankSum -an ReadPosRankSum -an InbreedingCoeff —maxGaussians 4
and
-T ApplyRecalibration -R /references/hs.build37.1.fa \
-input samples_poulsen.indel.vcf —ts_filter_level 99.0 \
-recalFile recalibrate_INDEL.recal \
-tranchesFile recalibrate_INDEL.tranches \
-o recal_samples_poulsen.indel.vcf
I would be very thankful for any kind of help or explanation. Thank you.
From Geraldine_VdAuwera on 2014-09-05
Hi @Gustav,
You need to specify `-mode INDEL` for ApplyRecalibration too, otherwise it only recalibrates SNPs.
From Gustav on 2014-09-05
Ah ok. Thank you!
From vsvinti on 2015-06-30
Above, you say:
“for exomes, a straight DP filter shouldn’t be used“
but the hard filters say
“QD < 2.0“
Since they are related, do you mean ‘Do not use QD < 2.0’ filter for exomes ?
From pdexheimer on 2015-06-30
No. QD != DP
Depth is a normalizing factor used in the QD calculation, but the two metrics are not related
From vsvinti on 2015-07-01
Ok. Just curious as I notice a ‘peak’ in variants with QD < 2 that pass the VQSR annotation. So I’ve been doing VQSR + QD > 2 for retaining variants… this is UG on gatk 2.7 though.
From spekkio on 2015-08-06
I have a question about padding a cohort with public exomes from 1000 Genomes phase one data.
We have 8 exome samples and want to bring the total up to 30, by using 22 phase one 1k genome exomes, in order to run VQSR to find variants. I’ve seen it mentioned that this is something the Broad institute does when dealing with small sample sizes and would like to know if you do any kind of pre-processing on the public data before you slide it into your pipeline? I’ve read ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/README.phase1_alignment_data how the 1k g phase one .bam files were prepared and from what I figure they are already good to go, ie. run HaplotypeCaller on in -ERC GVCF mode. Am I correct in making this assumption or should there be some steps to further prepare the .bam files before running HaplotypeCaller on them?
quick edit: if there is some more processing to be done, is it different per sequencing platform, ie. some .bams were from SOLID and some were from ILLUMINA.
Thanks
From Geraldine_VdAuwera on 2015-08-06
Hi @spekkio,
That is indeed our recommendation for dealing with small datasets. Ideally you’d want to apply exactly the same pre-processing to all samples (yours and the 1000G samples), including using the same versions of the tools, to avoid any batch effects. The pre-processing applied to the 1000G samples is essentially the one recommended in our Best Practices, but done with a much older version of the tools. So if you can spare the compute, it is better to revert and reprocess them all in the same way. When you use the padding samples for multiple projects it’s easy to justify as a one-time investment. Unfortunately we aren’t able to offer recently reprocessed versions of the 1000G samples at this time, although something like that is on the roadmap for our future cloud-based offering, which we’re working on with Google.
From rfraser on 2015-09-14
Hi GATK Team,
Just a quick note – in the “The solution: hard-filtering” paragraph above, the “VariantFiltration” link, along with the first two “here” links, seem to be broken.
Thanks,
Russ
From Geraldine_VdAuwera on 2015-09-14
Thanks for pointing this out, @rfraser. Fixed the two VariantFiltration links; that second “here” was a deprecated doc that should no longer be linked to so I’ve removed it.
From everestial007 on 2016-01-07
Hi @Geraldine_VdAuwera
I am still working on VQSR and have tried several settings but my results are still not that good. But, here are some questions I would like to ask to help myself doing a better job.
1) How does the use of known=true vs. false effect the output of the ti/tv? In other link you have said that it wouldnot effect the Variant recalibration calculation (which is obvious from the PDF – probability density function plots) but it surely is affecting reported ti/tv ratio, # of tranche specific TP and FP and also the number of novel variants. Could you please provide me little explanation on this?
2) What information do you feel would help you make informed decisions about how to set parameters? – If its possible I would like an example containing detailed PDF plots for every several pair of annotations. I have tweaked around annotation values and getting different results (different pdf plots), if I have a framework (plots, models) that I can aim for it could be of great help.
3) Are the meanings of the annotations not clear? – Its quite clear I should say.
4) Would knowing more about how they are computed help you understand how you can use them? Do you want more math? – Yes I would like to get more information on the maths and computation behind the process. I did some reading on GMM but if there is anything else that could help.
5) concrete examples? – yes, please.
Regarding the use of annotation DP for filtering I have question:
You stated that DP should be set at 5-6 sigma of the mean. But, should the DP be different for different biological samples (as the coverage may vary) or should it be average of across all the samples.
From everestial007 on 2016-01-07
After selecting my SNPs I applied stringent filter to the SNP call set:
java -jar GenomeAnalysisTK.jar -T VariantFiltration -R lyrata_genome.fa -V confHCVariantsSNPsMA605.vcf —filterExpression “QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0” —filterName “confHCVariantsSNPsMA605” -o filteredHCVariantsSNPsMA605.vcf
I am able to get the filtered data but not able to quite identify the cultprit that filtered the particular variant:
See a part of my output
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT MA622
scaffold_1111 301 . G A 1000.77 confHCVariantsSNPsMA622 AC=1;AF=0.500;AN=2;BaseQRankSum=-6.643;ClippingRankSum=1.212;DP=445;FS=4.863;MLEAC=1;MLEAF=0.500;MQ=27.78;MQRankSum=0.754;QD=2.25;ReadPosRankSum=0.619;SOR=0.362 GT:AD:DP:GQ:PL 0/1:365,76:441:99:1029,0,11104
scaffold_1111 340 . C T 910.77 confHCVariantsSNPsMA622 AC=1;AF=0.500;AN=2;BaseQRankSum=-6.699;ClippingRankSum=0.892;DP=223;FS=42.552;MLEAC=1;MLEAF=0.500;MQ=24.63;MQRankSum=1.587;QD=4.08;ReadPosRankSum=-0.589;SOR=2.769 GT:AD:DP:GQ:PL 0/1:156,64:220:99:939,0,4906
scaffold_1106 82 . G C 168.90 PASS AC=2;AF=1.00;AN=2;DP=6;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=55.79;QD=28.15;SOR=3.611 GT:AD:DP:GQ:PL 1/1:0,5:5:15:197,15,0
scaffold_1106 314 . T C 530.77 PASS AC=2;AF=1.00;AN=2;DP=19;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=56.53;QD=27.94;SOR=0.793 GT:AD:DP:GQ:PL 1/1:0,19:19:57:559,57,0
The variants that were filtered only has the filter name. Is there a way to identify that specific culprit?
Also, I can set annotation parameters in a different way when filtering (using JEXL). Could you please let me know based on your experience if you had better results using the “OR” combination or rather && comination. for eg. “A” || “B” vs. “A” && “B” annotations.
Thank you !!!
From Geraldine_VdAuwera on 2016-01-08
Hi @everestial007, we are working on developing some documentation that will address much of this, but it will take a few more weeks before it is ready to share.
If you are struggling to find out why variants were filtered out, I would recommend splitting up your filter query into multiple queries with separate filter names. This will be more informative. Combination filters are great if you know how you want to filter and you’re not worried about diagnosing individual calls, but they are not so great if you do care about individual calls or if you are fairly new to this.
Also, to help you choose filtering cutoffs, try plotting the distribution of annotation values in your data.
From everestial007 on 2016-01-14
Hi @Geraldine_VdAuwera
Thanks for the info.
I have came across SnpSift to extract the annotation values out of *.vcf files. While the extraction of QUAL values has been straight forward the extraction of other annotation values like BaseQRankSum, ClippingRankSum, etc has been little not so successful as they are multiple annotation values with in the INFO field. I cannot find which tools in GATK is appropriate for extracting annotation values and plotting them.
Also, could you please direct me to some other good example you know of VQSR that could be of help.
From Geraldine_VdAuwera on 2016-01-14
Have a look at VariantsToTable.
See also the tutorial materials from the last BroadE workshop that were posted on the GATK blog in November.
From YaoLan on 2016-05-26
hi!
I’m wondering if i can use VQSR for human mitogenome? or should i use the hard-filtering for this kind of data?
thanks!
From Sheila on 2016-05-31
@YaoLan
Hi,
Are you working strictly with the mitogenome or do you have other parts of the genome as well as the mitogenome in your data. Currently, to use VQSR, we recommend having at least 30 whole exome samples or 1 whole genome sample.
-Sheila
From YaoLan on 2016-05-31
>
Sheila said: >
YaoLan
> Hi,
>
> Are you working strictly with the mitogenome or do you have other parts of the genome as well as the mitogenome in your data. Currently, to use VQSR, we recommend having at least 30 whole exome samples or 1 whole genome sample.
>
> -Sheila
>
>
Thank you for your reply!
There are just 20 strictly mitogenome samples, and when I try VQSR, there are always error message about: No overlapping contigs found. I used rCRS as reference and 1000G.b37.vcf as resource dataset, is that means just mitogenome could not use the rCRS as reference?
From Sheila on 2016-05-31
@YaoLan
Hi,
Okay. I think the error message is because the rCRS reference uses a different naming system for the MT contig than the b37 reference. However, even if that was not the case, you cannot use VQSR for your analysis. The 20 mitogenome samples are simply not enough data for VQSR to build a good model from.
You will have to try hard filtering. We have some basic recommendations [here](http://gatkforums.broadinstitute.org/gatk/discussion/2806/howto-apply-hard-filters-to-a-call-set). But, it is up to you to determine the best cutoffs for your dataset. You may find [this article](https://www.broadinstitute.org/gatk/guide/article?id=6925) helpful as well :smile:
-Sheila
From YaoLan on 2016-06-01
@Sheila
Thank you so much! I will try hard filtering~
From sbourgeois on 2016-06-13
Hello,
you say “the DP threshold should be set a 5 or 6 sigma from the mean coverage across all samples, so that the DP > X threshold eliminates sites with excessive coverage caused by alignment artifacts.” which I understand, but how does one get the mean coverage across all samples and the sigma?
Cheers,
Steph
From Sheila on 2016-06-14
@sbourgeois
Hi Steph,
You can use [DepthOfCoverage](https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_coverage_DepthOfCoverage.php) which will give you a number of output files. You can then use those to plot your coverage values and determine the standard deviation for your data.
-Sheila
From benjaminpelissie on 2016-07-18
I work on a non-model species, with Illumina Hiseq paired-end WGS reads. I have 98 samples, that I pre-processed through the GATK best practices pipeline. I called for variants using HaplotypeCaller (GVCF mode), and I am currently joint genotyping my entire population. My final aim is to produce a variant database to do population genomics and selection analyses. Since I do not have any known variants database, I could not do any BQSR, and I won’t be able to do VQSR either. I decided to bootstrap a database of reliable variants, as advised in the best practices, but I have some doubts about the approach I should opt for.
First, I know that I should hard filter my SNP and indels separately. But then, should I provide both resulting SNP and Indels VCF files as known variants databases ate the BaseRecalibrator step, or should I only use SNP’s VCF?
Second, on the page about hard filtering variant calls, it is stated that under than 10x coverage “it is virtually impossible to use manual filtering to reliably separate true positives from false positives”. I’m thinking about applying a last hard filtering step on the final callset (obtained after completed the bootstrapped procedure). Would this approach improve significantly the situation, or should I really consider going for a more elaborated (and mysterious) bootstrap approach to obtain high confidence variants for VQSR?
Third, most of my sample (88 out of 98) are from the same species, but 10 of them (out of 98) are from closely-related species. As written previously, I was thinking about joint genotyping all my samples, since I am interested in how the inter-specific genetic diversity is also visible at the intra-specific level (eg. to study hybridization patterns). I am guessing that it should significantly increase the number of variants to bootstrap. Do you think it might cause analytical (or technical) problems, or should I be good to go?
Thanks!
Ben
From Sheila on 2016-07-19
@benjaminpelissie
Hi Ben,
It is best to pass in both the SNPs and indels to BaseRecalibrator. Have a look at [this article](https://www.broadinstitute.org/gatk/documentation/article?id=1247) for more details.
What coverage do you have? You may take a look at [this article](https://www.broadinstitute.org/gatk/documentation/article?id=6925) for a starting place on how to adapt the hard filters.
You should be good to go calling variants on all 98 samples together. We recommend calling variants on all samples you want to analyze together. https://www.broadinstitute.org/gatk/documentation/article?id=4150
As for increasing the number of variants for BQSR, I think you should be fine, as it is better to slightly overmask sites than undermask known variant sites.
-Sheila
From benjaminpelissie on 2016-07-19
Thank you @Sheila!
So for BaseRecalibrator, I can pass the -knownSites argument twice in the same run, i.e. one for (eg.) known_indels.vcf and one for known_snp.vcf files?
I have an average coverage of 6-7x for the 88 same-species samples and 20x for the 10 other species. Do you think it would be enough to go both for a bootstrap approach for BQSR and or a hard filtering of final variants (i.e. from the post-boostrap callset)?
Ben
From Sheila on 2016-07-20
@benjaminpelissie
Hi Ben,
Yes, you can use -knownSites more than once.
I think you should be fine to use hard filtering with that coverage and that many samples. Our basic hard filtering recommendations are [here](https://software.broadinstitute.org/gatk/documentation/article?id=2806). Also, have a look at the article I linked to above for ways to tweak the filters.
Good luck!
-Sheila
From Sumudu on 2016-08-07
Hi,
I have a similar problem. I want to verify my approach since I’m very new to this.
I’m working on a non-human WGS sample. No known SNP/indel resources are available. In order to do BQSR Im hoping to bootstrap a known set using HaplotypeCaller and VariantFiltration. And then to use that recalibrated BAM to call variants and again perform a hard-filtering step on that VCF to get a more reliable set of variants.
Another way I’m wondering is to use the knownSites I got from the BQSR step and use it in a VQSR directly with VariantRecalibration and ApplyRecalibration, since I have already done a hard-filtering at the BQSR step.
Or, I’m trying to use Samtools MPileup to call variants and select only the ones with highest quality and then do a hard-filtering to get a more reliable set of final variants.
Or as mentioned in one of your documentations, try several variant callers and use the variants in agreement with all. Will this set be compatible to run VariantFiltration with GATK?
I appreciate your advice because it will help me to get rid of confusions that I’m having.
Thanks in advance.
Best Regards
Sumudu.
From Sheila on 2016-08-10
@Sumudu
Hi Sumudu.
I think I have answered your question [here](http://gatkforums.broadinstitute.org/gatk/discussion/comment/32192/#Comment_32192).
-Sheila
From tc13 on 2016-08-19
The examples of the distributions of QC values and how these relates to VQSR is really helpful (“Understanding and adapting the generic hard-filtering recommendations”). I would be interested to know more about the expected distributions of the QC scores and the false positive/ false negative SNP calling rates when using hard filters on whole-genome population data (rather than just trios) compared with ‘known’ variants from VQSR. As I’m working on a non-model organism these resources don’t exist and so more information on how to fine tune hard filtering would be beneficial. —Tom
From Geraldine_VdAuwera on 2016-08-19
Hi Tom @tc13, I’m glad you found the examples helpful. Unfortunately we don’t currently have the resources to provide the next level of information that you’re asking for. I realize that it would be helpful to a lot of people who are in a situation similar to yours, and I do hold out hope that we’ll be able to tackle this question eventually — but to blunt, I don’t foresee that happening for at least another year. We just have too much on our plate right now.
From gdumont on 2017-02-23
Hi,
I’m working on a non model organism with WGS data and I’m currently trying to apply the recalibration steps. I bootstrapped my SNPs to get the set of known variants for the BQSR. Then I tried to run VQSR with, as “True sites training resource” : this same set of known SNPs, as “Non-true sites training resource” : the intersection of HaplotypeCaller and a another SNPcaller and finally, I don’t specify any “Known sites resource, not used in training” (is it absolutely mandatory?). I attached to this comment the plots I got. Do my method to run VQSR and what I got seems to make sense? Or not at all and I should go for hard filtering ?
Second short question, in theory, after VQSR we are not supposed to apply any other filter (like on max/min coverage, MAF, number of non genotyped individuals per SNP…) ? Because here, after this try of recalibration I still clearly have outliers, in term of max coverage at least.
Thanks (ones again!) in advance for your answer !
Guillaume
recalibrate_SNP_contig88_plots_2.R.pdf
recalibrate_SNP_contig88.tranches.pdf
From Sheila on 2017-02-28
@gdumont
Hi Guillaume,
It can be tricky to use VQSR when you don’t have known variant resources. Have a look at [this thread](http://gatkforums.broadinstitute.org/gatk/discussion/comment/25693#Comment_25693) for more help. If you do decide to go the hard filtering route, you may be interested in [this article](https://software.broadinstitute.org/gatk/documentation/article?id=6925).
Yes, you do not need to do any extra filtering after VQSR. VQSR does all the filtering you need :smile:
-Sheila
From gdumont on 2017-03-01
Hi Sheila,
Thanks, indeed it seems tricky … Moreover, as explain in my previous comment, after my 1st test of VQSR, I still have clearly outliers loci (=bad SNPs) when I’m plotting the overall coverage or Fis for example. I guess this is not a good sign for the success of my VQSR, right ?
Guillaume
From Geraldine_VdAuwera on 2017-03-01
@gdumont Our filtering practices are all tested exclusively on human data with specific resources, so it is entirely possible that VQSR is not doing a complete job on your dataset given custom resources. The next steps are up to you since we cannot provide support beyond this point. Good luck!
From gdumont on 2017-03-02
yes yes I know, I was meaning it’s not working properly in my case, and it’s very probably because of the quality of my home made variants resources… But not sure I can really do better here and I will probably do hard filtering.
Thanks for your answers !
Guillaume
From munzmatt on 2017-03-09
Hello,
I am working genotyping-by-sequencing data of Zebrafish and it seems that there is no truth variant set available for using VQSR. Only found vcf files at dbSNP and Ensembl. Do you provide any furhter training sets which I can use as truth variant set for this model organism available? Thanks for your help.
From Sheila on 2017-03-12
@munzmatt
Hi,
No, we do not provide resource files for organisms other than humans. Hopefully some other users who work with Zebrafish will jump in here.
-Sheila
From Begali on 2017-06-01
hallo ,
I work on RADSeq in NGS process generated for 5 samples of Cardamine hirsuta and 197 samples Arabidopsis thaliana .. now I am in filtering step which as I understood hard filtering in my case as there is no available resources for VQSR steps neither BQSR .. I have read a lot of comments which r related to same issue however how can I fix this problem for my dataset for plant ..also for plot analysis is there any command to apply it after I obtain my file.vcf of hard filtering step ..I am little confused or there is way to apply VQSR and BQSR steps for my problem .. kindly if any one has some help or hints drop here please ..
Thanks in advance
From Sheila on 2017-06-05
@Begali
Hi,
Have a look at [this document](https://software.broadinstitute.org/gatk/documentation/article?id=6925) which should help you with plotting and determining cutoffs for your particular dataset. There is a link in that doc to our basic human hard filtering recommendations as well.
-Sheila
P.S. You can also check out our hands-on tutorial in the presentations section which has some other helpful tips.
From Begali on 2017-06-05
@ Sheila
Thanks :smiley:
Yes I have read them , will focus on them to apply for my data..