howto Recalibrate variant quality scores run VQSR

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by Geraldine_VdAuwera

on 2013-06-17

Objective

Recalibrate variant quality scores and produce a callset filtered for the desired levels of sensitivity and specificity.

Prerequisites

Caveats

This document provides a typical usage example including parameter values. However, the values given may not be representative of the latest Best Practices recommendations. When in doubt, please consult the FAQ document on VQSR training sets and parameters, which overrides this document. See that document also for caveats regarding exome vs. whole genomes analysis design.

Steps

Prepare recalibration parameters for SNPs
a. Specify which call sets the program should use as resources to build the recalibration model
b. Specify which annotations the program should use to evaluate the likelihood of Indels being real
c. Specify the desired truth sensitivity threshold values that the program should use to generate tranches
d. Determine additional model parameters
Build the SNP recalibration model
Apply the desired level of recalibration to the SNPs in the call set
Prepare recalibration parameters for Indels a. Specify which call sets the program should use as resources to build the recalibration model b. Specify which annotations the program should use to evaluate the likelihood of Indels being real c. Specify the desired truth sensitivity threshold values that the program should use to generate tranches d. Determine additional model parameters
Build the Indel recalibration model
Apply the desired level of recalibration to the Indels in the call set

1. Prepare recalibration parameters for SNPs

a. Specify which call sets the program should use as resources to build the recalibration model

For each training set, we use key-value tags to qualify whether the set contains known sites, training sites, and/or truth sites. We also use a tag to specify the prior likelihood that those sites are true (using the Phred scale).

True sites training resource: HapMap

This resource is a SNP call set that has been validated to a very high degree of confidence. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). We will also use these sites later on to choose a threshold for filtering variants based on sensitivity to truth sites. The prior likelihood we assign to these variants is Q15 (96.84%).

True sites training resource: Omni

This resource is a set of polymorphic SNP sites produced by the Omni genotyping array. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q12 (93.69%).

Non-true sites training resource: 1000G

This resource is a set of high-confidence SNP sites produced by the 1000 Genomes Project. The program will consider that the variants in this resource may contain true variants as well as false positives (truth=false), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q10 (90%).

Known sites resource, not used in training: dbSNP

This resource is a SNP call set that has not been validated to a high degree of confidence (truth=false). The program will not use the variants in this resource to train the recalibration model (training=false). However, the program will use these to stratify output metrics such as Ti/Tv ratio by whether variants are present in dbsnp or not (known=true). The prior likelihood we assign to these variants is Q2 (36.90%).

The default prior likelihood assigned to all other variants is Q2 (36.90%). This low value reflects the fact that the philosophy of the GATK callers is to produce a large, highly sensitive callset that needs to be heavily refined through additional filtering.

b. Specify which annotations the program should use to evaluate the likelihood of SNPs being real

These annotations are included in the information generated for each variant call by the caller. If an annotation is missing (typically because it was omitted from the calling command) it can be added using the VariantAnnotator tool.

Coverage (DP)

Total (unfiltered) depth of coverage. Note that this statistic should not be used with exome datasets; see caveat detailed in the VQSR arguments FAQ doc.

QualByDepth (QD)

Variant confidence (from the QUAL field) / unfiltered depth of non-reference samples.

FisherStrand (FS)

Measure of strand bias (the variation being seen on only the forward or only the reverse strand). More bias is indicative of false positive calls. This complements the StrandOddsRatio (SOR) annotation.

StrandOddsRatio (SOR)

Measure of strand bias (the variation being seen on only the forward or only the reverse strand). More bias is indicative of false positive calls. This complements the FisherStrand (FS) annotation.

MappingQualityRankSumTest (MQRankSum)

The rank sum test for mapping qualities. Note that the mapping quality rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles.

ReadPosRankSumTest (ReadPosRankSum)

The rank sum test for the distance from the end of the reads. If the alternate allele is only seen near the ends of reads, this is indicative of error. Note that the read position rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles.

RMSMappingQuality (MQ)

Estimation of the overall mapping quality of reads supporting a variant call.

InbreedingCoeff

Evidence of inbreeding in a population. See caveats regarding population size and composition detailed in the VQSR arguments FAQ doc.

c. Specify the desired truth sensitivity threshold values that the program should use to generate tranches

First tranche threshold 100.0
Second tranche threshold 99.9
Third tranche threshold 99.0
Fourth tranche threshold 90.0

Tranches are essentially slices of variants, ranked by VQSLOD, bounded by the threshold values specified in this step. The threshold values themselves refer to the sensitivity we can obtain when we apply them to the call sets that the program uses to train the model. The idea is that the lowest tranche is highly specific but less sensitive (there are very few false positives but potentially many false negatives, i.e. missing calls), and each subsequent tranche in turn introduces additional true positive calls along with a growing number of false positive calls. This allows us to filter variants based on how sensitive we want the call set to be, rather than applying hard filters and then only evaluating how sensitive the call set is using post hoc methods.

2. Build the SNP recalibration model

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ -T VariantRecalibrator \ -R reference.fa \ -input raw_variants.vcf \ -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap.vcf \ -resource:omni,known=false,training=true,truth=true,prior=12.0 omni.vcf \ -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G.vcf \ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.vcf \ -an DP \ -an QD \ -an FS \ -an SOR \ -an MQ \ -an MQRankSum \ -an ReadPosRankSum \ -an InbreedingCoeff \ -mode SNP \ -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \ -recalFile recalibrate_SNP.recal \ -tranchesFile recalibrate_SNP.tranches \ -rscriptFile recalibrate_SNP_plots.R

Expected Result

This creates several files. The most important file is the recalibration report, called recalibrate_SNP.recal, which contains the recalibration data. This is what the program will use in the next step to generate a VCF file in which the variants are annotated with their recalibrated quality scores. There is also a file called recalibrate_SNP.tranches, which contains the quality score thresholds corresponding to the tranches specified in the original command. Finally, if your installation of R and the other required libraries was done correctly, you will also find some PDF files containing plots. These plots illustrated the distribution of variants according to certain dimensions of the model.

For detailed instructions on how to interpret these plots, please refer to the VQSR method documentation and presentation videos.

3. Apply the desired level of recalibration to the SNPs in the call set

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ -T ApplyRecalibration \ -R reference.fa \ -input raw_variants.vcf \ -mode SNP \ --ts_filter_level 99.0 \ -recalFile recalibrate_SNP.recal \ -tranchesFile recalibrate_SNP.tranches \ -o recalibrated_snps_raw_indels.vcf

Expected Result

This creates a new VCF file, called recalibrated_snps_raw_indels.vcf, which contains all the original variants from the original raw_variants.vcf file, but now the SNPs are annotated with their recalibrated quality scores (VQSLOD) and either PASS or FILTER depending on whether or not they are included in the selected tranche.

Here we are taking the second lowest of the tranches specified in the original recalibration command. This means that we are applying to our data set the level of sensitivity that would allow us to retrieve 99% of true variants from the truth training sets of HapMap and Omni SNPs. If we wanted to be more specific (and therefore have less risk of including false positives, at the risk of missing real sites) we could take the very lowest tranche, which would only retrieve 90% of the truth training sites. If we wanted to be more sensitive (and therefore less specific, at the risk of including more false positives) we could take the higher tranches. In our Best Practices documentation, we recommend taking the second highest tranche (99.9%) which provides the highest sensitivity you can get while still being acceptably specific.

4. Prepare recalibration parameters for Indels

a. Specify which call sets the program should use as resources to build the recalibration model

Known and true sites training resource: Mills

This resource is an Indel call set that has been validated to a high degree of confidence. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q12 (93.69%).

b. Specify which annotations the program should use to evaluate the likelihood of Indels being real

Coverage (DP)

Total (unfiltered) depth of coverage. Note that this statistic should not be used with exome datasets; see caveat detailed in the VQSR arguments FAQ doc.

QualByDepth (QD)

Variant confidence (from the QUAL field) / unfiltered depth of non-reference samples.

FisherStrand (FS)

StrandOddsRatio (SOR)

MappingQualityRankSumTest (MQRankSum)

The rank sum test for mapping qualities. Note that the mapping quality rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles.

ReadPosRankSumTest (ReadPosRankSum)

InbreedingCoeff

Evidence of inbreeding in a population. See caveats regarding population size and composition detailed in the VQSR arguments FAQ doc.

c. Specify the desired truth sensitivity threshold values that the program should use to generate tranches

First tranche threshold 100.0
Second tranche threshold 99.9
Third tranche threshold 99.0
Fourth tranche threshold 90.0

d. Determine additional model parameters

Maximum number of Gaussians (-maxGaussians) 4

This is the maximum number of Gaussians (i.e. clusters of variants that have similar properties) that the program should try to identify when it runs the variational Bayes algorithm that underlies the machine learning method. In essence, this limits the number of different ”profiles” of variants that the program will try to identify. This number should only be increased for datasets that include very many variants.

5. Build the Indel recalibration model

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ -T VariantRecalibrator \ -R reference.fa \ -input recalibrated_snps_raw_indels.vcf \ -resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.vcf \ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.b37.vcf \ -an QD \ -an DP \ -an FS \ -an SOR \ -an MQRankSum \ -an ReadPosRankSum \ -an InbreedingCoeff -mode INDEL \ -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \ --maxGaussians 4 \ -recalFile recalibrate_INDEL.recal \ -tranchesFile recalibrate_INDEL.tranches \ -rscriptFile recalibrate_INDEL_plots.R

Expected Result

This creates several files. The most important file is the recalibration report, called recalibrate_INDEL.recal, which contains the recalibration data. This is what the program will use in the next step to generate a VCF file in which the variants are annotated with their recalibrated quality scores. There is also a file called recalibrate_INDEL.tranches, which contains the quality score thresholds corresponding to the tranches specified in the original command. Finally, if your installation of R and the other required libraries was done correctly, you will also find some PDF files containing plots. These plots illustrated the distribution of variants according to certain dimensions of the model.

For detailed instructions on how to interpret these plots, please refer to the online GATK documentation.

6. Apply the desired level of recalibration to the Indels in the call set

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ -T ApplyRecalibration \ -R reference.fa \ -input recalibrated_snps_raw_indels.vcf \ -mode INDEL \ --ts_filter_level 99.0 \ -recalFile recalibrate_INDEL.recal \ -tranchesFile recalibrate_INDEL.tranches \ -o recalibrated_variants.vcf

Expected Result

This creates a new VCF file, called recalibrated_variants.vcf, which contains all the original variants from the original recalibrated_snps_raw_indels.vcf file, but now the Indels are also annotated with their recalibrated quality scores (VQSLOD) and either PASS or FILTER depending on whether or not they are included in the selected tranche.

Updated on 2018-04-04

From KinMok on 2014-08-29

Hi @Geraldine_VdAuwera,

I would like to ask for a reconfirmation that DP (coverage) should not be used in VQSR.

This is explained in “discussion/1259/what-vqsr-training-sets-arguments-should-i-use-for-my-specific-project”. However, comment in this page in Jan 8 said “same parameters as shown in the above section”.

Thank you for your help.

(pardon if I have sent this twice as I cannot see my post after the first upload)

Kin

From Geraldine_VdAuwera on 2014-09-01

Hi @KinMok‌,

As explained in the document about VQSR training sets and arguments, DP should be used for whole genomes but not for exomes.

Also, note the caveat at the start of this document:

> This document provides a typical usage example including parameter values. However, the values given may not be representative of the latest Best Practices recommendations. When in doubt, please consult the FAQ document on VQSR training sets and parameters, which overrides this document.

From yeinhorn on 2014-09-03

Hi @Geraldine_VdAuwera,

1.Can you explain me why do I need a large number of samples to use VSQR? aren’t the samples independent?

2.does it matter if the samples are dependent (same family, same rare disease) or independent (random samples from random sources)?

3.is there a minimum number of samples which i can use VSQR when using exome analysis and whole genome analysis.

Thanks,

Yaron

From pdexheimer on 2014-09-04

@yeinhorn‌ – The key requirement is really number of variants, not number of samples. It’s just that it’s easier to add more samples than it is to increase variation…

My understanding (which might be incomplete/wrong) is that the major underlying requirement is that you have a sufficient number of variants that overlap your set of “true” variants, so that the Gaussian model can be effectively constructed. You also need enough variants to have a meaningful “bad” set, which is also used in cluster discrimination.

So yes, it does matter if your samples are from the same family, as they’ll have a fewer number of total variants between them. The same rare disease shouldn’t impact anything, since they should only have a couple of variants in common. I don’t think the sample requirements are hugely different between exomes and whole genomes, I try really hard not to use fewer than 30 samples. You’ll see improved performance with more samples, I’m not sure where diminishing returns kick in.

From Geraldine_VdAuwera on 2014-09-04

To add only a tiny bit to @pdexheimer ‘s excellent answer, you typically need somewhat fewer whole genomes compared to whole exomes because the former contain more variants than the latter. However, because the known datasets used as truth sets tend to mostly comprise variants that fall within exome regions, a lot of the extra variants in whole genomes aren’t actually useful for VQSR because they don’t overlap the truth set.

It’s hard to even say how many variants you need because it depends on how the annotation features are distributed among them. If you have nice clear clustering on all dimensions, you’ll need fewer variants than if they are very spread out. And you typically don’t know that until you’ve run the program…

From Gustav on 2014-09-05

From Geraldine_VdAuwera on 2014-09-05

Answered here: http://gatkforums.broadinstitute.org/discussion/comment/15888/#Comment_15888

In future, please don’t post the same question twice in different places. We see all the posts that go through and answer them regardless of where they are. When you post twice, you make our job harder and it adds noise to the forum.

From SharonCox on 2014-12-09

I used VariantRecalibrator AND ApplyRecalibration TOOLS SEPARATELY FOR INDELS AND SNPS . AT THIS POINT what tool should I use to have a single VCF , combinevariants? with which options

From Sheila on 2014-12-09

@SharonCox‌

Hi,

You can use CombineVariants with simply the required inputs. There is no need to use any optional parameters.

However, in the future it may be easier to follow this tutorial:https://www.broadinstitute.org/gatk/guide/article?id=2805 which shows how to recalibrate SNPs and indels separately, but still end up with a single vcf.

-Sheila

From Rebecca on 2015-03-05

Hi,

if I have non-model organisms (bacterial populations) with no SNP databases, is it possible to use VQSR or does it make sence at all to use it? Or would it be more recommended to use hard filters to determine the accuracy of a variant?

From tommycarstensen on 2015-03-05

@Rebecca if you have a training set, then you can use VQSR. If not, then you can’t.

From Rebecca on 2015-03-06

@tommycarstensen Do you mean by training set a dataset where I know what SNPs are reliable to build the calibration model? I guess then I cannot use it. Thanks for the help!

From tommycarstensen on 2015-03-06

@Rebecca Yup, you need a set of truth sites for any machine learning method. See the best practices for further information. Good luck. Have fun.

From Rebecca on 2015-03-06

@tommycarstensen Thanks!

From oussama_benhrif on 2015-04-29

Hi,

If i have one sample (Paired end) can i use VSQR ?

Thanks

From Sheila on 2015-04-29

@oussama_benhrif

Hi,

If your sample is whole genome, then 1 sample is enough. However, for whole exomes we recommend using 30 or more samples for VQSR.

-Sheila

From oussama_benhrif on 2015-04-30

thank you Sheila ,

my sample is whole exome

From oussama_benhrif on 2015-04-30

Hi @Sheila

should i use a hard filter to replace VQSR ?

From Sheila on 2015-04-30

@oussama_benhrif

Hi,

Yes, you can use hard filtering. http://gatkforums.broadinstitute.org/discussion/2806/howto-apply-hard-filters-to-a-call-set

-Sheila

From 55816815 on 2015-04-30

I noticed that after I select SNP (SelectVariants = SNP) and INDEL (SelectVariants = INDEL) out to run VQSR separately, the total does not add up the the original (prior to selection).

original:

$ wc -l ceu_trio.noCombine.04142015.vcf

6576494 ceu_trio.noCombine.04142015.vcf

SNP:

$ wc -l ceu_trio.noCombine.04142015.snp.vcf

5438262 ceu_trio.noCombine.04142015.snp.vcf

INDEL

$ wc -l ceu_trio.noCombine.04142015.indel.vcf

1125600 ceu_trio.noCombine.04142015.indel.vcf

5438262 + 1125600 = 6563862 which is less than 6576494 (even though I am counting the headers twice, which is negligible compared to the amount of variant).

Some variants are NOT selected and will be lost, should I recover them and what would be the recommended strategy?

The only reason I can think of is the MNP was lost because I did not specify MNP while I am selecting SNPs. I had the impression that MNP will by default goes into INDEL (or, whatever is not a SNP will be selected as “INDEL”) — does this still holds?

Thanks,

Shuoguo

From Sheila on 2015-04-30

@55816815

Hi Shuoguo,

You are missing the mixed records. The mixed records have both SNPs and Indels at the site. You can specify MIXED with -selectType https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_SelectVariants.php#—selectTypeToInclude

-Sheila

From 55816815 on 2015-04-30

@Sheila

Thanks! But these variants should not go through VQSR steps, right?

Shuoguo

From tommycarstensen on 2015-04-30

@55816815 I’m a bit confused. The best practices do not require that you run SelectVariants. Instead you can just run VariantRecalibrator with `-mode SNP` and `-mode INDEL` on a file containing all of your called variants. See the best practices procedure above that suggest running VR and AR in sequence twice. I think this is what most people do.

To answer your latest question. Your MIXED variants should indeed go through VQSR. They will be evaluated, when you run VR with `-mode INDEL`.

From Geraldine_VdAuwera on 2015-04-30

It seems the best practice workflow is not entirely clear for everyone, but like @tommycarstensen says you don’t need to separate out the records by type. This is explained in detail in the latest workshop presentationswhich you can find on the GATK blog (pending addition to the Presentations library).

From 55816815 on 2015-05-06

@tommycarstensen

Thanks! I think I am still doing the VQSR the “older way”? I always select out SNPs and INDELs into separate vcf and VQSR separately. Then merge them back once VQSR is done.

I think the “newer” way of VQSR is easier. @Geraldine_VdAuwera I have gone through the newest workshop videos but mainly on variant call part to understand the new joint genotyping — will finish all videos to appreciate all the good things you guys have done!

From tommycarstensen on 2015-05-06

@55816815 Personally I do it in a third way. I run two VR jobs for SNPs and indels simultaneously and then I apply the recalibration using either the unix utilities paste, sort -m and awk or Python and the function [heapq.merge()](https://docs.python.org/3/library/heapq.html#heapq.merge) on the .recal and .vcf.gz files. It’s mostly because I’m impatient and want to avoid having to wait for two VR+AR sequential runs. Perhaps a future version of GATK will allow one to use multiple .recal files. That would be cool, except it would make my code redundant and make me look like an idiot for having written it :)

From knightjimr on 2015-06-30

I admit I’m confused by this part of the filtering. I am using the best practice VQSR process (with two VR+AR sequential runs), but still worry that the MIXED variants are not evaluated. Are they evaluated and filtered when -mode INDEL is set?

Is there a best practice for the MIXED variants with hard filtering? I just found out that when I run the two SelectVariants+FilterVariants (for SNP and for INDEL), those MIXED variants are lost in the “filtered” output, and I haven’t found any text describing the best practice filtering for MIXED variants.

I just would like to ensure that, for VQSR and hard filtering, all of the raw variants are being evaluated and filtered, without any being dropped on the floor because they are not simple SNP’s or simple INDEL’s.

From pdexheimer on 2015-06-30

@knightjimr

As you may have noticed, the variant types for [SelectVariants](https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_variantutils_SelectVariants.php#—selectTypeToInclude) (INDEL, SNP, MIXED, MNP, SYMBOLIC, NO_VARIATION) are different than the modes for [ApplyRecalibration](https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_variantrecalibration_ApplyRecalibration.php#—mode) (SNP, INDEL, BOTH). The VR/AR modes are much less precise, encompassing several variant types. Specifically, the SNP mode in AR/VR includes both SNPs and MNPs, and the INDEL mode includes INDEL, MIXED, and SYMBOLIC variants.

From aneek on 2015-07-01

Hi, while trying to use VariantRecalibrator tool, I got the following error message:

“Bad input: Values for SQR annotation not detected for ANY training variant in the input callset. VariantAnnotator may be used to add these annotations.”

Any suggestion about what this message means and how to solve this problem. I have used the same command lines as mentioned in this tutorial above.

Thanks..

P.S.- Sorry for this post. I have found my fault. I had given a wrong input as -an SQR. It should be -an SOR.

From Sheila on 2015-07-01

@aneek

Hi,

I suspect you simply mistyped SOR as SQR.

-Sheila

From aneek on 2015-07-02

Hi, Sheila,

Thank you for the correction. Another thing I would like to ask using -an InbreedingCoeff option is giving me an error like ‘bad input…..’. From the FAQs (https://www.broadinstitute.org/gatk/guide/article?id=1259) I came to know that this option should used in case of atleast 10 samples. Since I am analyzing a single sample I think I should omit this option from the commandline. Please correct me if I am wrong.

From aneek on 2015-07-02

Hi,

I have an another query regarding VQSR function.

For SNP recalibration when I have used the command mentioned in this tutorial it gave me an error message "No data found" with Error stack trace like

ERROR stack trace

java.lang.IllegalArgumentException: No data found. at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibratorEngine.generateModel(VariantRecalibratorEngine.java:88) at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:408) at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:156) at org.broadinstitute.gatk.engine.executive.Accumulator$StandardAccumulator.finishTraversal(Accumulator.java:129) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:116) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:315) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:106)

Then I found some information in this link: http://gatkforums.broadinstitute.org/discussion/3067/variantrecalibrator-unable-to-retrieve-result and added "--maxGaussians 4" option in my commandline since I am doing analysis for a single sample and it gave me the result perfectly.

After that when I started INDEL recalibration using the above mentioned commands in this tutorials it gave me the same error messages which I have given above even after adding the "--maxGaussians 4" option.

Then I reduced the value of this option to 1 (i.e. --maxGaussians 1) and retried the same commandline. It gave me the ouput smoothly without any errors.

So in short is that means since I am using a single sample for the analysis I should always use --maxGaussians 1 instead of --maxGaussians 4 in both SNP and INDEL recalibration?

If so then I suppose if the number of sample increases the value of --maxGaussians parameter should also be increased (like for 2 samples "--maxGaussians 2", for 3 samples "--maxGaussians 3" etc.).

Another thing is should I also use "--validation_strictness LENIENT" parameter also for lower sample variant files?

Please suggest.

Thanks.

From Sheila on 2015-07-02

@aneek

Hi,

Yes, inbreeding coefficient should not be used on a single sample.

Are your samples whole exome or whole genome? We recommend using at least 30 whole exome samples in VQSR to get good results. However, if you are using a whole genome sample, 1 sample is enough.

-Sheila

From aneek on 2015-07-03

Hi Sheila, I am currently using a single sample of whole exome. Actually we receive exome sequencing data for a single sample at a time for genetic testing purpose and hence we have to do GATK analysis for a single exome only. It is not possible for us to wait till 30 exome samples get accumulated otherwise we will be late to give reports to the patients. Please suggest whether the VQSR analysis parameters I have mentioned in my earlier query is ok for a single exome analysis and my understanding about the “—maxGaussians” function.

If not then kindly recommend the parameters I should use for this analysis to get atleast a fair result for a single whole exome sample.

Thanks..

From Sheila on 2015-07-03

@aneek

Hi,

We do not recommend using VQSR on less than 30 whole exome samples.

Do you have samples from previous tests that you can use? The GVCF workflow is designed for cases like yours. You simply create a GVCF for each sample as you get it, then combine the GVCFs with GenotypeGVCFs each time you add a new sample. https://www.broadinstitute.org/gatk/guide/article?id=4150

https://www.broadinstitute.org/gatk/guide/article?id=3893

You can also use data from the 1000Genomes project to add more data to your VQSR input. You can get bam files from samples related to yours and run the pipeline on them. Please see this thread for more information: http://gatkforums.broadinstitute.org/discussion/3612/1000genomes-exomes-and-vqsr

Lastly, if you do not have samples from previous tests or you do not want to use 1000Genomes data, you can use hard filtering for the single exome. http://gatkforums.broadinstitute.org/discussion/2806/howto-apply-hard-filters-to-a-call-set

-Sheila

From aneek on 2015-07-06

Hi Sheila,

Thank you very much for the suggestions. If I use the hard filtering option you have mentioned here it is generating the filtered SNP and INDEL files seperately. Now after this if I want to combine both of these files into a single vcf file then which command should I use?

Also can you elaborate the usage of “—maxGaussians” function a little bit. What this function does and why should we use only —maxGaussians 4 or more and not —maxGaussians 1, 2 or 3 (Although —maxGaussians 1, 2 or 3 works well in the commandline)?

Thanks..

From Sheila on 2015-07-06

@aneek

Hi,

You can use Combine Variants to merge the two filtered vcfs.

Have a look at this article under “Can I use the variant quality score recalibrator with my small sequencing experiment?” for more information on maxGaussians. https://www.broadinstitute.org/gatk/guide/article?id=39

I hope this helps.

-Sheila

From aneek on 2015-07-07

Hi Sheila,

Thanks a lot for the information..

From knightjimr on 2015-07-09

@pdexheimer Thanks for the response. I had seen the documentation on that, but that difference I think adds to the confusion.

What I really want to know, in English, is “Does each variant called by GenotypeGVCFs get assigned a PASS or non-PASS value by the filtering process, and ideally in such a way so that I can create a single VCF file containing those calls with their assigned value”.

For the best practices hard filtering process, the answer to that question is false. I have single sample exome callsets where some raw VCF lines have a “G A,AT … 1/2:…” call (with 1/2 as the genotype). Those lines don’t get included by the SNP or INDEL filtering in the best practices, and in effect, they disappear. In my test, if I also do a SelectVariants on MIXED for my dataset, I get the remaining variants. But, there is no best practices for filtering those, and I also don’t know for sure that that will be all variants for all callsets (it worked in my dataset, but does that cover all of the variants or not). (And, if you would like me to move this question to a hard filtering specific forum page, I’d be happy to do that).

For VQSR, my question is, does the resulting VCF file, after going through the best practices process, have every variant call that was produced by GenotypeGVCFs?

From pdexheimer on 2015-07-09

@knightjimr – Yes. The VQSR INDEL mode explicitly includes MIXED variants.

From knightjimr on 2015-07-10

Thanks. Given my surprise with hard filtering, I just wanted to double check.

From everestial007 on 2015-11-10

I have finished recalibrating my bam files (whole genome, 6 samples). My all sites vcf calls for 1 sample is taking more than 3 days. Since, my model organism doesnot have a truth, training, knowsites data of variants, I wanted to ask if it is just best to do 1) hard filtering of SNPs and INdels 2) hard filtering on all samples and collect the common variants which can then be used for VQSR process as truth-training table.

If anyone had come across the same/similar problem please let me know. How much benefit can I gain by just doing extra VQSR process; I am asking this because at some point I have to call “time out” on my data analysis.

Thanks much in advance.

From everestial007 on 2015-11-11

Hi @Geraldine_VdAuwera : Additionally question came up. So, I want to call variants on the reclibrated bam files and apply the strict filter. I can call variant for different samples from with in the same populations using:

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller \

-R lyrata_genome.fa -I recalibSAMPLE01.bam -I recalibSAMPLE2 -I recalibSAMPLE3\

—genotyping_mode DISCOVERY -stand_emit_conf 30 -stand_call_conf 50 \ -o confHCVariants.vcf

(as I have more than 1 sample actually 6 sample per population).

If I happen to call the variants separately on each reclibrated bam file and then merge the vcf files (from all samples) using:

java -jar GenomeAnalysisTK.jar \ -T CombineVariants \ -R reference.fasta \ —variant recalibSAMPLE1.vcf \ —variant recalibSAMPLE2.vcf \ —variant recalibSAMPLE3.vcf \ -o merged.vcf \ -genotypeMergeOptions UNIQUIFY

Would the confHCVariants.vcf and merged.vcf file be equivalent. If there is any difference what could it be? I could run both the analyses and compare it by myself but my variant calling for just 1 sample is taking so long that I have to just go ahead and ask. If its the the same I plan to go with the former option.

Thanks,

From Sheila on 2015-11-24

@everestial007

Hi,

We recommend calling variants on all your samples together instead of calling variants on each sample separately then merging the VCFs. Have a look at this article for more information: https://www.broadinstitute.org/gatk/guide/article?id=4150

-Sheila

From Geraldine_VdAuwera on 2015-11-26

@everestial007 No, it’s not equivalent. The correct way to do it is explained in the Best Practices documentation.

To clarify @Sheila’s comment, you need to apply a joint discovery method, which can be done in two ways:

1) provide all inputs together to HC, as in your first example;

2) run HC on each sample alone in -ERC GVCF mode to produce a .g.vcf file for each, then run GenotypeGVCFs on all .g.vcf files together.

These two ways are equivalent; the first way is how people used to do it, and scales badly with sample numbers, while the second is the new way to do it which scales much better.

From everestial007 on 2015-12-19

I have prepared set of highly confident variant by unifying the variants from several samples then selecting the common variants (at least among 3 samples) using hard filters. I now plan to use these variants for VQSR. However, I am encountering some issue that is not discussed anywhere in the forum.

I used the following command: java -Xmx4g -jar /home/everestial007/GenomeAnalysisTK-3.4-46/GenomeAnalysisTK.jar -T VariantRecalibrator -R lyrata_genome.fa -input genotypedGvcfVARiantsMA.vcf -resource:strictSelection,known=false,training=true,truth=false,prior=10.0 VARiantsMAYODANsnp.vcf -an DP -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 -recalFile raw.SNPs.recal -tranchesFile raw.SNPs.tranches -rscriptFile recalSNPs.plots.R

-where, genotypedGvcfVARiantsMA.vcf is the vcf called from several *.g.vcf samples using genotypeGVCF VARiantsMAYODANsnp.vcf is the highly confident variants for a new population of one of the model organism I am working with; it was prepared by unifying the variants from several samples then selecting the common variants (at least among 3 samples) using hard filters.

I have followed the provided direction on VQSR page to include appropriate parameters/flags but getting error message. I am posting the part of the message that is mostly concerning:

ERROR MESSAGE: Invalid command line: No truth set found! Please provide sets of known polymorphic loci marked with the truth=true ROD binding tag. For example, -resource:hapmap,VCF,known=false,training=true,truth=true,prior=12.0 hapmapFile.vcf

Isn't the truth set the one I provided. i.e VARiantsMAYODANsnp.vcf

Looks like I need to manipulate the vcf file and add ROD binding tag containing the information discussed under "For example".

Thanks is advance,

From Geraldine_VdAuwera on 2015-12-19

You need at least one set that has the property truth=true.

From everestial007 on 2015-12-19

> @Geraldine_VdAuwera said:

> You need at least one set that has the property truth=true.

Great, it worked. Its so annoying and surprising at the same time that sometimes such a tiny changes make everything work. Thanks a lot @Geraldine_VdAuwera for all the help you have been. Wish you “Happy Holidays” !!! :smiley:

From Geraldine_VdAuwera on 2015-12-19

Ah, but that tiny change is what determines how the data is used by the statistical model… That’s pretty important ;)

Happy holidays to you too!

From yeinhorn on 2016-01-06

Hi @Geraldine_VdAuwera,

I usually get single exome each time from a different lab which uses a different sequencing machine and different coverage.

I want to use VSQR in order to get good filtration, but since 1 exome isn’t enough to build a reliable model there’s an option to pad it with other exomes (like suggested for example using 1kg or exac data, or i can use my own other exomes).

But if use different sequencers/coverage/labs wouldn’t it dramatically change the results of the model? for example, maybe for all the 1kg samples for FisherStrand score less than value 6 it’s false positive while for my exome it’s only less than 12 .

Is still suggested to run the exome with other type of data to have many variants, or maybe I should use hard filters in this case? Is there a preferable data i should use (my data/1kg/exac)?

Or third option, maybe i should still try to use the VSQR for one exome, and maybe use a smaller number of Gaussians?

Thanks!

From Geraldine_VdAuwera on 2016-01-08

Having so much technical heterogeneity is definitely going to complicate things by muddling the tool’s ability to build an appropriate model. It’s hard to say if it will be so bad as to make the results worse than hard-filtering. What you can try is to plot the distribution of annotation values in your data vs the 1kg samples, and see how much divergence there is. You could decide to exclude annotations where you see dramatic differences, but still run VQSR with annotations for which the distribution is fairly similar.

From drtamermansour on 2016-01-11

Hi @Geraldine_VdAuwera

I am working with many whole genome sequencing of dogs (different breeds, so I know we have a lot of polymorphism). I want to use the VQSR but of course we lack the human resource of known True sites like HapMap and Omni. Can I develop my own resource by applying some VERY strict hard filters to define those true sites? We can also repeat the VQSR couple time in iterative way where we update the true variants with those variants having high VQSR score?

If you think this suggestion is applicable (or at least worth trying it), can you suggest some hard filter cutoffs to start with?

P.S. How the good model plot should look like !!?

Thank you

From Sheila on 2016-01-11

@drtamermansour

Hi,

Have a look at Geraldine’s response in this thread: http://gatkforums.broadinstitute.org/gatk/discussion/comment/25693

I am working on documents describing hard filtering, but it will be a little while until they are fully complete. In the meantime, you can start out with the [hard filtering recommendations for humans](http://gatkforums.broadinstitute.org/gatk/discussion/2806/howto-apply-hard-filters-to-a-call-set) and refine them to fit your data. As for the plots, you can plot each of the annotations you are filtering on and see what they distribution looks like. It will be different for different datasets.

-Sheila

From drtamermansour on 2016-01-16

@Sheila

Thanks for the response. What I understand from Geraldine’s response is that we can apply a set of stringent hard filters to identify a set of variants that we can use as a “truth”. For both SNPs and indels, I used these stringent filters “QD < 6.0 || MQ < 60.0 || MQRankSum < -1.0 || FS > 20.0 || SOR > 2.0 || ReadPosRankSum < -1.0”

My question is, when we use this set of variants as the truth to train the model, are not we forcing the model to set high thresholds?

From Geraldine_VdAuwera on 2016-01-18

To some extent. It may limit the tool’s ability to identify valid “minority” profiles, and if you’re using the stringent subset as sensitivity target, then that affects the results you get for a particular level of sensitivity. But you can always lower the value and retrieve more variants. And it’s still better than flat-out hard filtering.

From ekanterakis on 2016-08-18

Hello,

Just wanted to confirm some diffs I see here compared to when I last looked at Best Practices 6+ months ago:

BQSR now includes “Mills_and_1000G_gold_standard.indels”
VQSR tranche now set to 99.9 (from 99.5) for SNPs
dbSNP now removed from indel VQSR
VQSR tranche now set to 99.9 (from 99.0) for Indels

Thanks a lot!

From Sheila on 2016-08-22

@ekanterakis

Hi,

I just asked Geraldine to confirm. She will get back to you soon.

-Sheila

From Geraldine_VdAuwera on 2016-08-25

@ekanterakis If I recall correctly those were documentation fixes that were made to harmonize what we had across several documents (there were some mismatches).

From ekanterakis on 2016-08-26

So those changes reflect the current Best Practices? As a general comment, I’m finding it had to get a full picture of what a Best Practices workflow looks like. I can find bits and pieces by clicking through various pages but I’m never sure if those pages are up to date because they’ve been around for a while. It would be much more helpful if there was a Best Practices protocol attached to each release. Is there such a resource? I’m looking for the latest parameters (GATK 3.6) for BQSR, HC and VQSR for whole genome. Thank you!

From Geraldine_VdAuwera on 2016-08-26

We’re aware this is a need and we’re working on it. The first step to that is a versioned reference implementation script. We’ve posted one for the first part of the WGS pipeline in the WDL repository and will add links to it from the Best Practices pages in the near future. Over the next few months we’ll build out the collection of reference scripts.

From mscjulia on 2016-09-28

Hello,

I hope this question is not too silly —- I’m not familiar with R, but is that possible to use R-studio to generate the pdf plots based on the rscript .R file (I got it from VQSR) alone please? I changed the output path in the .R file, because I’m doing it on a different computer, and also installed all the libraries required, but it still gave me a lot of errors. Before I go debug the errors, I would like to know if it is something doable at all please. Thanks a lot.

From Sheila on 2016-10-04

@mscjulia

Hi,

Sorry for the delay, but I am not sure I understand your question. Are you saying you have RStudio but not R installed? What kind of errors are you getting?

-Sheila

From WANGxiaoji on 2016-11-17

Hello,

I saw there is 1000G phase 3 VCF data in bundle already. Is it better to replace phase 1 VCF file with phase 3 VCF file in VQSR? Can I just use this new version file within the old command?

From Geraldine_VdAuwera on 2016-12-01

@WANGxiaoji Hi there, sorry for the late response. We are still using the Phase 1 callset in our internal production pipeline. There is some work underway to update VQSR that may update resources as well, but I don’t know for sure what will be the outcome. You can certainly try both and evaluate whether the model performs better. If you do that, consider sharing your results with the community.

From ShujunO on 2017-03-29

I work on rice. Can I use the rice dbSNP from NCBI to perform VQSR? Especially there are >3M out of 13M variant records were validated (with the VLD tag, see https://www.ncbi.nlm.nih.gov/projects/SNP/snp_summary.cgi?view+summary=view+summary&build_id=148).

I extracted all VLD variants from the rice dbSNP and treat them as the HapMap input, then use the whole rice dbSNP file as the dbSNP input for VQSR.

Please confirm if this is OK or correct me if I am wrong.

Thanks!

From Geraldine_VdAuwera on 2017-04-10

Hi @ShujunO, we don’t have any experience working with rice but it’s worth a try.

From alongalor on 2017-09-05

Under “Non-true sites training resource: 1000G”, you are missing a prior likelihood (%). Thanks, Alon

From Sheila on 2017-09-06

@alongalor

Hi Alon,

Fixed, thanks :smile:

-Sheila

From Ademir on 2017-10-11

Hi, I read you have explained but could you please explain in details why we should not use DP in exome sequences?

And could please refer to any paper please?

From Sheila on 2017-10-12

@Ademir

Hi,

Have a look at [this thread](https://gatkforums.broadinstitute.org/gatk/discussion/6184/dp-parameter-vqsr-for-targeted-exome-sequencing). I am not aware of any papers that detail this, but you can probably do a google search to find some.

-Sheila

From Ademir on 2017-10-13

Hi @Shelia,

I have read the conversations you mentioned in your reply before.

But I am wondering this recommendation is based on GATK Team’s previous experience solely or is there any publication mentioning this recommendation?

Best wishes,

Ademir

From Sheila on 2017-10-17

@Ademir

Hi Ademir,

I think it is from the testing our team has done. Hopefully you can search a bit and find some papers to help you.

-Sheila

From bwubb on 2017-10-23

This may be a premature post on my part, but I am currently quite alarmed by my results. I have used GATK VQSR many times, but right now I am looking at a recalibrated indel variant whose FILTER value is “VQSRTrancheINDEL95.00to97.00”. My ts_filter_level was set to 90. So it is not receiving a PASS value and is subsequently excluded by SelectVariants when using the -ef option.

Am I misunderstanding something?

From bwubb on 2017-10-24

> @bwubb said:

> This may be a premature post on my part, but I am currently quite alarmed by my results. I have used GATK VQSR many times, but right now I am looking at a recalibrated indel variant whose FILTER value is “VQSRTrancheINDEL95.00to97.00”. My ts_filter_level was set to 90. So it is not receiving a PASS value and is subsequently excluded by SelectVariants when using the -ef option.

> Am I misunderstanding something?

Premature indeed. I believe I have a better understanding now.

From Sheila on 2017-10-25

@bwubb

Hi,

Glad you figured it out :smiley:

I suspect you were thinking the lower the ts_filter_level, the higher the sensitivity. It is actually the opposite. The lower the ts_filter_level, the more specific you are (and less sensitive).

-Sheila

From zhaoqiang90 on 2018-03-22

The prior likelihood of each training set is set by what rule?

From shlee on 2018-03-22

Hi @zhaoqiang90, please see https://gatkforums.broadinstitute.org/gatk/discussion/1259/which-training-sets-arguments-should-i-use-for-running-vqsr.

From ComputerGeneGuy on 2018-06-08

Hi,

I am a Research Master’s Student at the Royal College of Surgeon’s Ireland (RCSI). I am currently working on calling variants on a cohort of breast cancer samples. I have gotten to the stage where I have unfiltered VCFs produced by gatk4 mutect 2 and now want to run Variant Quality Score Recalibration (VQSR). I’m trying to stick as closely as possible to the GATK best practices pipeline.

My question is, when inputting the resources for VQSR, should I just copy the resources used in this tutorial? My calls were made using Hg38, so I presumably need to use resources corresponding to this. Essentially I want to know :

- Do I need this exact set of resources (Hg 38 versions) or would it be ok to drop some of them?

-Should I leave the “prior” value the same for each resource as it is in the tutorial? I don’t really understand where the prior values for this step came from

Thanks for the help,

Peter O’Donovan

From Sheila on 2018-06-18

@ComputerGeneGuy

Hi Peter,

For Mutect2 VCFs, we don’t recommend using VQSR. Have a look at [this tutorial](https://software.broadinstitute.org/gatk/documentation/article?id=11136).

-Sheila

From aschoenr on 2018-10-16

Sheila said: >

everestial007

> Hi,

> We recommend calling variants on all your samples together instead of calling variants on each sample separately then merging the VCFs. Have a look at this article for more information:

> -Sheila

Hi shlee, Sheila

I’m working my way through implementing a pipeline to identify de novo mutations in humans. Since I have some computational resources available to me, I’m doing this in a parallel manner (by processing entire families (4-8 individuals) at a time). To this end, I implemented the joint genotyping stage to process each chromosome individually. This means that I am left with 23 or 24 chrZZ.vcf files (one for each chromosome including X and potentially Y if there are any daughters). Each of these files contain the variants from all family members on that specific chromosome. Doing it this way allowed me to run more things in parallel. Anyway, I just wanted to confirm that I should combine all of these files into a single VCF before running this stage of the pipeline. Initially I thought I should combine all of the VCFs and then split them per person, but based on the comment above I guess I’m fine to just leave them all as one combined VCF. My other option would be to run this stage on all of the individual chromosome vcfs as they are. This would allow for more parallelism but I’m not sure chromosome specific modelling would be appropriate.

My impression is that it would be recommended to combine all of this data in to a single VCF before running this stage. If you could confirm that this is correct, I would appreciate it. Thanks again for all of your help.

From shlee on 2018-10-16

Hi @aschoenr,

As you surmise, it would be best to provide VQSR as much data as possible towards modeling. So having the combined VCF across samples and contigs is the best approach.

From wlai on 2019-03-23

Hi @Geraldine_VdAuwera

May i know how low can a tranche value be? Could it be 50 if the Ti/Tv ratio is 2 at that from with more than50000 variant?

Report abuse