Which training sets arguments should I use for running VQSR

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by rpoplin

on 2012-08-02

This document describes the resource datasets and arguments that we recommend for use in the two steps of VQSR (i.e. the successive application of VariantRecalibrator and ApplyRecalibration), based on our work with human genomes, to comply with the GATK Best Practices. The recommendations detailed in this document take precedence over any others you may see elsewhere in our documentation (e.g. in Tutorial articles, which are only meant to illustrate usage, or in past presentations, which may be out of date).

The document covers:

Explanation of resource datasets
Important notes about annotations
Important notes about exome experiments
Argument recommendations for VariantRecalibrator
Argument recommendations for ApplyRecalibration

These recommendations are valid for use with calls generated by both the UnifiedGenotyper and HaplotypeCaller. In the past we made a distinction in how we processed the calls from these two callers, but now we treat them the same way. These recommendations will probably not work properly on calls generated by other (non-GATK) callers.

Note that VQSR must be run twice in succession in order to build a separate error model for SNPs and INDELs (see the VQSR documentation for more details).

Explanation of resource datasets

The human genome training, truth and known resource datasets mentioned in this document are all available from our resource bundle.

If you are working with non-human genomes, you will need to find or generate at least truth and training resource datasets with properties corresponding to those described below. To generate your own resource set, one idea is to first do an initial round of SNP calling and only use those SNPs which have the highest quality scores. These sites which have the most confidence are probably real and could be used as truth data to help disambiguate the rest of the variants in the call set. Another idea is to try using several SNP callers in addition to the UnifiedGenotyper or HaplotypeCaller, and use those sites which are concordant between the different methods as truth data. In either case, you'll need to assign your set a prior likelihood that reflects your confidence in how reliable it is as a truth set. We recommend Q10 as a starting value, which you can then experiment with to find the most appropriate value empirically. There are many possible avenues of research here. Hopefully the model reporting plots that are generated by the recalibration tools will help facilitate this experimentation.

Resources for SNPs

*True sites training resource: HapMap *
This resource is a SNP call set that has been validated to a very high degree of confidence. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). We will also use these sites later on to choose a threshold for filtering variants based on sensitivity to truth sites. The prior likelihood we assign to these variants is Q15 (96.84%).
*True sites training resource: Omni *
This resource is a set of polymorphic SNP sites produced by the Omni geno- typing array. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q12 (93.69%).
Non-true sites training resource: 1000G
This resource is a set of high-confidence SNP sites produced by the 1000 Genomes Project. The program will consider that the variants in this re- source may contain true variants as well as false positives (truth=false), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q10 (90%).
Known sites resource, not used in training: dbSNP
This resource is a call set that has not been validated to a high degree of confidence (truth=false). The program will not use the variants in this resource to train the recalibration model (training=false). However, the program will use these to stratify output metrics such as Ti/Tv ratio by whether variants are present in dbsnp or not (known=true). The prior likelihood we assign to these variants is Q2 (36.90%).

Resources for Indels

True sites training resource: Mills
This resource is an Indel call set that has been validated to a high degree of confidence. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q12 (93.69%).
Known sites resource, not used in training: dbSNP
This resource is a call set that has not been validated to a high degree of confidence (truth=false). The program will not use the variants in this resource to train the recalibration model (training=false). However, the program will use these to stratify output metrics such as Ti/Tv ratio by whether variants are present in dbsnp or not (known=true). The prior likelihood we assign to these variants is Q2 (36.90%).

Important notes about annotations

Some of the annotations included in the recommendations given below might not be the best for your particular dataset. In particular, the following caveats apply:

Depth of coverage (the DP annotation invoked by Coverage) should not be used when working with exome datasets since there is extreme variation in the depth to which targets are captured! In whole genome experiments this variation is indicative of error but that is not the case in capture experiments.
You may have seen HaplotypeScore mentioned in older documents. That is a statistic produced by UnifiedGenotyper that should only be used if you called your variants with UG. This statistic isn't produced by the HaplotypeCaller because that mathematics is already built into the likelihood function itself when calling full haplotypes with HC.
The InbreedingCoeff is a population level statistic that requires at least 10 samples in order to be computed. For projects with fewer samples, or that includes many closely related samples (such as a family) please omit this annotation from the command line.

Important notes for exome capture experiments

In our testing we've found that in order to achieve the best exome results one needs to use an exome SNP and/or indel callset with at least 30 samples. For users with experiments containing fewer exome samples there are several options to explore:

Add additional samples for variant calling, either by sequencing additional samples or using publicly available exome bams from the 1000 Genomes Project (this option is used by the Broad exome production pipeline). Be aware that you cannot simply add VCFs from the 1000 Genomes Project. You must either call variants from the original BAMs jointly with your own samples, or (better) use the reference model workflow to generate GVCFs from the original BAMs, and perform joint genotyping on those GVCFs along with your own samples' GVCFs with GenotypeGVCFs.
You can also try using the VQSR with the smaller variant callset, but experiment with argument settings (try adding --maxGaussians 4 to your command line, for example). You should only do this if you are working with a non-model organism for which there are no available genomes or exomes that you can use to supplement your own cohort.

Argument recommendations for VariantRecalibrator

The variant quality score recalibrator builds an adaptive error model using known variant sites and then applies this model to estimate the probability that each variant is a true genetic variant or a machine artifact. One major improvement from previous recommended protocols is that hand filters do not need to be applied at any point in the process now. All filtering criteria are learned from the data itself.

Common, base command line

This is the first part of the VariantRecalibrator command line, to which you need to add either the SNP-specific recommendations or the indel-specific recommendations given further below.

java -Xmx4g -jar GenomeAnalysisTK.jar \ -T VariantRecalibrator \ -R path/to/reference/human_g1k_v37.fasta \ -input raw.input.vcf \ -recalFile path/to/output.recal \ -tranchesFile path/to/output.tranches \ -nt 4 \ [SPECIFY TRUTH AND TRAINING SETS] \ [SPECIFY WHICH ANNOTATIONS TO USE IN MODELING] \ [SPECIFY WHICH CLASS OF VARIATION TO MODEL] \

SNP specific recommendations

For SNPs we use both HapMap v3.3 and the Omni chip array from the 1000 Genomes Project as training data. In addition we take the highest confidence SNPs from the project's callset. These datasets are available in the GATK resource bundle.

-resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.vcf \ -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.vcf \ -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.b37.vcf \ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.b37.vcf \ -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an DP -an InbreedingCoeff \ -mode SNP \

Please note that these recommendations are formulated for whole-genome datasets. For exomes, we do not recommend using DP for variant recalibration (see below for details of why).

Note also that, for the above to work, the input vcf needs to be annotated with the corresponding values (QD, FS, DP, etc.). If any of these values are somehow missing, then VariantAnnotator needs to be run first so that VariantRecalibration can run properly.

Also, using the provided sites-only truth data files is important here as parsing the genotypes for VCF files with many samples increases the runtime of the tool significantly.

You may notice that these recommendations no longer include the --numBadVariants argument. That is because we have removed this argument from the tool, as the VariantRecalibrator now determines the number of variants to use for modeling "bad" variants internally based on the data.

Indel specific recommendations

When modeling indels with the VQSR we use a training dataset that was created at the Broad by strictly curating the (Mills, Devine, Genome Research, 2011) dataset as as well as adding in very high confidence indels from the 1000 Genomes Project. This dataset is available in the GATK resource bundle.

--maxGaussians 4 \ -resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.vcf \ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.b37.vcf\ -an QD -an DP -an FS -an SOR -an ReadPosRankSum -an MQRankSum -an InbreedingCoeff \ -mode INDEL \

Note that indels use a different set of annotations than SNPs. Most annotations related to mapping quality have been removed since there is a conflation with the length of an indel in a read and the degradation in mapping quality that is assigned to the read by the aligner. This covariation is not necessarily indicative of being an error in the same way that it is for SNPs.

Argument recommendations for ApplyRecalibration

The power of the VQSR is that it assigns a calibrated probability to every putative mutation in the callset. The user is then able to decide at what point on the theoretical ROC curve their project wants to live. Some projects, for example, are interested in finding every possible mutation and can tolerate a higher false positive rate. On the other hand, some projects want to generate a ranked list of mutations that they are very certain are real and well supported by the underlying data. The VQSR provides the necessary statistical machinery to effectively apply this sensitivity/specificity tradeoff.

Common, base command line

This is the first part of the ApplyRecalibration command line, to which you need to add either the SNP-specific recommendations or the indel-specific recommendations given further below.

java -Xmx3g -jar GenomeAnalysisTK.jar \ -T ApplyRecalibration \ -R reference/human_g1k_v37.fasta \ -input raw.input.vcf \ -tranchesFile path/to/input.tranches \ -recalFile path/to/input.recal \ -o path/to/output.recalibrated.filtered.vcf \ [SPECIFY THE DESIRED LEVEL OF SENSITIVITY TO TRUTH SITES] \ [SPECIFY WHICH CLASS OF VARIATION WAS MODELED] \

#### SNP specific recommendations For SNPs we used HapMap 3.3 and the Omni 2.5M chip as our truth set. We typically seek to achieve 99.5% sensitivity to the accessible truth sites, but this is by no means universally applicable: you will need to experiment to find out what tranche cutoff is right for your data. Generally speaking, projects involving a higher degree of diversity in terms of world populations can expect to achieve a higher truth sensitivity than projects with a smaller scope.

--ts_filter_level 99.5 \ -mode SNP \

Indel specific recommendations

For indels we use the Mills / 1000 Genomes indel truth set described above. We typically seek to achieve 99.0% sensitivity to the accessible truth sites, but this is by no means universally applicable: you will need to experiment to find out what tranche cutoff is right for your data. Generally speaking, projects involving a higher degree of diversity in terms of world populations can expect to achieve a higher truth sensitivity than projects with a smaller scope.

--ts_filter_level 99.0 \ -mode INDEL \

Updated on 2018-04-04

From Geraldine_VdAuwera on 2014-05-22

Comments made before 22 May 2014 have been moved to http://gatkforums.broadinstitute.org/discussion/4204/questions-about-vqsr-arguments

From frankfeng on 2014-06-26

Thank you very much Geraldine and GATK team for making these good documentation!

In this version here I see argument “-tranche” is not used for command “VariantRecalibrator” anymore. In the previous version it was used as shown at http://www.broadinstitute.org/gatk/guide/article?id=2805

Could you please confirm that for me? Thanks!

From frankfeng on 2014-06-26

> @frankfeng said:

> Thank you very much Geraldine and GATK team for making these good documentation!

> In this version here I see argument “-tranche” is not used for command “VariantRecalibrator” anymore. In the previous version it was used as shown at http://www.broadinstitute.org/gatk/guide/article?id=2805

> Could you please confirm that for me? Thanks!

I just checked the documentation for that “-tranche” argument, and saw “The default values are 100.0, 99.9, 99.0, and 90.0 “.

Sorry for asking a question which actually was already answered by the documentation.

From lorenzotattini on 2014-09-08

Dear Geraldine,

I found two conflicting suggestions in this page concerning the resources to be exploited for VQSR:

TUTORIAL -resource:mills,known=true,training=true,truth=true,prior=12.0 mills.vcf \

FAQ -resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.sites.vcf \ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.b37.vcf \

Which one shall we use?

Thank you in advance for your help!

Best,

Lorenzo

From Geraldine_VdAuwera on 2014-09-08

The FAQ supersedes the tutorial (all tutorial should be taken as usage examples only). Sorry for the confusion, we’ll fix that ASAP.

From tommycarstensen on 2014-09-10

The current documentation on InbreedingCoeff is somewhat sparse:

http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_annotator_InbreedingCoeff.html

If I have 2000 samples – among which there are a lot of highly related samples – is it then best practice not to feed the InbreedingCoeff annotation to VR? I have a lot more than 10 founder samples among the 2000 samples.

I am happy to be pointed in the right direction: “See the 1000 Genomes Phase I release for more information.” I read the supplementary material to the phase 1 paper, but it did not illuminate me any further.

This is the final question from me in a long time. I promise.

From Geraldine_VdAuwera on 2014-09-11

Sparse is a nice way to put it :)

I honestly do not know and will need to put this to our methods people. Can you give a little more detail about the composition of your cohort? I.e. X number of families of N members on average etc., which should allow them to give you some kind of recommendation.

From tommycarstensen on 2014-09-12

@Geraldine_VdAuwera, these are households with children, parents and grandparents living in the same household. The household size distribution for the 1986 samples is as follows (count, household size):

141 1 87 2 70 3 64 4 35 5 29 6 28 7 22 8 16 9 10 10 8 11 4 12 3 13 1 14 1 15 1 17 1 19

I would expect at least half of the samples to be related. I haven’t calculated the IBD value distribution yet to confirm this. Our current decision is to not use the InbreedingCoeff annotation for VR, but I might not have understood exactly how it works. Thank you.

From Geraldine_VdAuwera on 2014-09-15

@tommycarstensen The word from the methods devs is that your decision to not use InbreedingCoeff for VQSR is correct given the degree of relatedness of your subjects.

From amywilliams on 2014-09-23

The resource bundle contains two versions of dbSNP: one with only variants up through version 129, the other containing all variants through version 138. Is there a recommendation on which of these to use? I’m going to guess that the full 138 version is best given that this resource is given a low prior and seems to used in order to increase sensitivity (thus it should be highly inclusive, even of low quality variants).

From Geraldine_VdAuwera on 2014-09-23

Hi @amywilliams,

There’s not really any hard recommendation. The dbSNP resource doesn’t really have any impact on sensitivity. Its main use is for masking out true variation for BQSR, and a for providing known indels for realignment; for those steps there is enough in the 129 version to serve their purpose. For variant calling itself and for filtering (VQSR) dbSNP is not used at all for any calculations, it’s just used to annotate rsIDs, for informative purposes. It can be good to use the later version so you are aware of all that has been previously reported, but the downside is that this may cause you to overestimate your true positives, since the content of dbSNP is not curated very strictly and includes quite a few false positives.

From bwubb on 2014-12-01

Is there a minimum number of annotations needed for VQSR or maybe I should ask how dropping annotations will affect VQSR?

In our cancer/cell line studies of which they are smaller 3.MB captures we have dropped InbreedingCoeff and HaplotypeScore from the hard filter recommendations. We then in turn dropped them from VQSR in our exome filtering. Recently, ReadPosRankSumTest hard filters have been modified; variants at ~80% AF seemed to introduce some artificial bias to that test. Our cutoff was adjusted to -10 from -8 and this was a big deal in terms of what was called. I am now concerned with how ReadPosRankSumTest is affecting the VQSR in our tumor data, but I do not understand enough to determine if additionally removing it from VQSR is a terrible idea. All we would be left with would be `-an QD -an MQ -an MQRankSum -an FS -an DP`

From Geraldine_VdAuwera on 2014-12-01

@bwubb There’s no minimum number of annotations required for VQSR. The idea is that more annotations should give you finer resolution (although too many may destabilize the model) but our recommendations are not designed for cancer analysis so some of the recommended annotations may not be helpful, and indeed may be counterproductive on that data type. Ultimately the best way to know what works for your dataset is to test different configurations.

From jacobhsu on 2015-01-06

Could you please recommend the appropriate scenario for whole genome sequence data ? The coverage is about ~50 in average, the sample size less than 10. I have followed the best practice and finished the Haplotype caller and VQSR SNP specific recommendations without any error from this page https://www.broadinstitute.org/gatk/guide/article?id=1259.

However, the Ti/Tv ratio looks very strange to me. The Ti/Tv ratios from 100 ,99.9 ,99 ,95 ,90 are (1.345, 1.52, 1.69, 1.843, 1.81)

Is there anything I missing ?

From Sheila on 2015-01-06

@jacobhsu‌

Hi,

You are correct that the Ti/Tv ratios look low.

What version of GATK did you use? Can you please post the exact command lines you used? Also, have you looked at the data quality beyond just average coverage?

Thanks,

Sheila

From jacobhsu on 2015-01-07

Thanks Sheila, here is the command I tried. I’m using GTK3.2-2 version. This command can be run perfectly on my other exom sequencing projects. Could you please advice more about the data quality that I might have to check ?

java -Xmx20g -jar $gatk

-T VariantRecalibrator

-R $reference

-input $wkdir/my.vcf

-nt 4

-resource:hapmap,known=false,training=true,truth=true,prior=15.0 $reference_dir/hapmap_3.3.hg19.vcf

-resource:omni,known=false,training=true,truth=true,prior=12.0 $reference_dir/1000G_omni2.5.hg19.vcf

-resource:1000G,known=false,training=true,truth=false,prior=10.0 $reference_dir/1000G_phase1.snps.high_confidence.hg19.vcf

-resource:dbsnp,known=true,training=false,truth=false,prior=2.0 $reference_dir/dbsnp_138.hg19.vcf

-an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS

-mode SNP

-tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 95.0 -tranche 90.0

-recalFile recalibrate_SNP.recal

-tranchesFile recalibrate_SNP.tranches

-rscriptFile recalibrate_SNP_plots.R

From Geraldine_VdAuwera on 2015-01-12

@jacobhsu Those titv values suggest that there may be a lot of false positives in your data. There is no easy solution for this, and no easy answer as to what might be the cause. One thing you can do is check the sequencing QC to see if the sequence data is of reasonable quality. Please see the literature for QC guidelines as we currently don’t provide that as part of GATK support.

From jacobhsu on 2015-01-13

@Geraldine_VdAuwera‌

Do you think it’s worth to use dbsnp_138.hg19.excluding_sites_after_129.vcf in order to check some details by using VariantEval function ?

From Geraldine_VdAuwera on 2015-01-19

It’s probably worth a try, yes.

From 55816815 on 2015-04-17

I have some observed huge difference in terms of running vqsr using

dbsnp_138.b37.excluding_sites_after_129.vcf vs

dbsnp_138.b37.vcf

pipeline:

GATK 3.3, call variants to get gvcf first, then genotyped to get vcf.

data:

illunima platinum dataset (only the NA12878/91/92 trio), 50X PCR free data.

when using dbsnp_138.b37.vcf for VariantRecalibrator, too obvious bad observations:

1. the tstv ration is very low (~1.3).

2. the tranches plot is strange — it goes negative, the order of 99/99.9/100 messed up (e.g. 100 shows on top, then 99 tranche, then 99.9 tranche), and the TP/FP bars overlaps each others for some tranches as well.

the mystery that i couldn’t understand is that, by changing dbsnp_138.b37.vcf to dbsnp_138.b37.excluding_sites_after_129.vcf, the plots looks gorgeous, the tstv ratio goes back to the expected ~2.0.

What could be the explanation? Should I worry about the variant call quality?

thanks!!

From Geraldine_VdAuwera on 2015-04-17

55816815 Have a look at tommycarstensen’s comment here: http://gatkforums.broadinstitute.org/discussion/comment/20279/#Comment_20279

Basically what’s going on is that internally, the VQSR tools have specific expectations for novel Ti/Tv that are based on what was known at the time before the 1000 Genomes project results were added to dbsnp. If you use the corresponding “old” dbsnp version, the expectations are fulfilled and your data comes out looking shiny. If you use a more recent version, key assumptions break down and it screws up the VQSR’s QC routines (which produce the plots).

We were aware internally of the 1000G impact on dbsnp, but somehow we didn’t connect that to the issues people have been reporting, and that’s my bad. We’ll make sure to document that more clearly in future to avoid causing unnecessary concern.

On the bright side the VQSR results themselves are not affected at all by this problem, it’s only the post-analysis QC routines that end up being useless.

I hope that helps.

From 55816815 on 2015-04-27

Thanks much Geraldine! You always respond instantly with solutions or helpful comments.

Shuoguo

From tommycarstensen on 2015-06-04

Sheila Geraldine_VdAuwera

I’m calling variants in some low coverage data (I promise it’s the last!). I also have high coverage trios for some of these populations. Would it be a) really ingenious or b) really stupid or c) really indifferent or d) really really ingenious, if I add these as a resource to VR? For example like this:

-resource:high_trio,known=false,training=true,truth=true,prior=15.0 high_coverage_trio_sites.vcf.gz

What should I set the prior to?

As an aside: I’ve never understood why `prior=2.0` for dbSNP. Why isn’t it just zero? This confused me a lot, when I was new to VR.

I just noticed the percentage is missing in this sentence relating to 1000G:

The prior likelihood we assign to these variants is Q10 (%)

P.S. @TechnicalVault is jealous of me having 3 stars and him only having 2 stars. How can I get 4 stars to make him really jealous? :smiley:

From Geraldine_VdAuwera on 2015-06-05

@tommycarstensen Bucking for a promotion so soon after your most recent one? Tsk tsk. OK, well in recognition of your regular status and many contributions to the community, I just awarded you a trophy badge, which counts for 100 points toward the next rank (500). It’ll be a while before you get the fourth star, but at least you have a trophy!

But to be fair to @TechnicalVault, this reminded me he was overdue for a CodeMonkey badge, which bumps him up 50 points and earns him a third star.

Did this work out for you alright? :lol:

From Geraldine_VdAuwera on 2015-06-05

Oh, and to answer your question (which I suppose is what they pay me for) it may be fairly clever to use your high coverage callset as training data at least. I would be hesitant to use them as truth data because that would make it harder to compare sensitivity/specificity across projects, but it’s not out of the question. Would actually be interesting to evaluate the performance.

The prior should reflect your level of confidence in the quality of the callset, ie what proportion of calls you believe are most probably real vs being false positives. Trial and error may be your best friend here.

The prior for dbsnp is essentially meaningless because it is not used anywhere (since dbsnp is neither truth nor training). But there has to be a number there for argument parsing reasons, and I believe the original authors used 2 because in the sequencing vocabulary, Q2 is the universal symbol for “don’t trust this crap”. (Don’t get me wrong, dbsnp has value — just not for recalibration as such)

And, oops, typo I suppose.

From JaeYjung on 2015-10-22

As far as I remember, 1KG INDEL was included as resource in variantRecalibrator INDEL mode, like:

``` -resource:1000G,known=false,training=true,truth=true,prior=10.0 1000G_phase1.indels.b37.vcf

```

But now I see only mills and dbsnp above. Also in the how-to doc, there’s no dbsnp shown in the example:

[howto-recalibrate-variant-quality-scores-run-vqsr](http://gatkforums.broadinstitute.org/discussion/2805/howto-recalibrate-variant-quality-scores-run-vqsr)

So can you please tell me which one is correct?

From Sheila on 2015-10-27

@JaeYjung

Hi,

I do see the 1000Genomes indels file in the bundle. It should be there.

Have a look at this article for more help: https://www.broadinstitute.org/gatk/guide/article?id=1259

This article is more representative of the arguments and inputs to use in VQSR.

-Sheila

From MUHAMMADSOHAILRAZA on 2015-12-22

@Geraldine_VdAuwera

I have read your paper entitled: “From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline“

Im this paper, The VQSR “VariantRecalibrator” recommendations for SNPs and INDELs are slightly different than what are described above. i.e. (-an argument):

For SNPs case:

-an DP -an QD -an FS -an MQRankSum -an ReadPosRankSum

For INDELs case:

-an DP -an FS -an MQRankSum -an ReadPosRankSum

You cam see this list of “annotations for calculation” is different than what is mentioned above. I know “-an InbreedingCoeff” is not used if we have less than 10 samples** but why not the other annotations were used?** and at what extent -an is important in VQSR calculation, and how we can decide which annotations can be used for our particular given sample, such as Exome sequencing and WGS analysis?

Thanks!

From Geraldine_VdAuwera on 2015-12-22

The paper is a bit out of date, so the current recommendations on the website are what you should follow.

You can change the annotations if you want; those are simply what we found to work best in our hands and on our data (for both WGS and exome).

From MUHAMMADSOHAILRAZA on 2015-12-23

@Geraldine_VdAuwera

I have another point in my mind, the parameters such as:

-percentBad 0.01

-minNumBad 1000

were found in VQSR past recommendations, but now they are omitted in recent discussions. so my questions is:

In the recent releases, doesn’t “VariantRecalibrator” require anymore such percentage/number of the worst scoring variants to use when building the model of bad variants? and secondly at what extent working with these parameters can affect bad variants modeling in WGS analysis (where we have a lot of variants)? .

Thank you very much!

From everestial007 on 2015-12-23

Hi,

I am kind of stuck at VQSR process and would appreciate little bit of guidance.

First of all, I want to clarify myself on the use of known=true vs. false while using VariantRecalibrator. In this link http://gatkforums.broadinstitute.org/discussion/2805/howto-recalibrate-variant-quality-scores-run-vqsr

for the hapmap data shouldnot we use known=true because the variant data from Hapmap are verified to high level of confidence and hence are mostly true and for the data from dbsnp to be known=false. Otherwise, I am actually not finding a very reason of why a given variant should be true vs. false.

I however proceeded with some VQSR experimentation using both known=true and known=false using the variants I had generated. Acutally, there is no available variant database for my model organism so I had to generate the variants from the samples I had. I called variants using HC and UG (6 biological samples from 1 population) and filtered using stringent parameters. I then combined the variants by selecting the variants present in at least 3 samples and used this database for my VQSR. I was therefore expecting that my VQSR results would not be that great, but to start something new is always fun. I ran VQSR at different prior levels (10, 12 and 15). Below is the command for prior=12 (using known=false and know=true).

Using Q12, known=false

java -Xmx4g -jar /home/everestial007/GenomeAnalysisTK-3.4-46/GenomeAnalysisTK.jar -T VariantRecalibrator -R lyrata_genome.fa -input genotypedGvcfVARiantsMA.vcf -resource:strictSelection,known=false,training=true,truth=true,prior=12.0 VARiantsMAYODANsnp.vcf -an DP -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 -recalFile raw.SNPsQ12.recal -tranchesFile raw.SNPsQ12.tranches -rscriptFile recalSNPsQ12.plots.R

Since the variants that I generated were the variants from the alignment (samples) itself, running VariantRecalibrator would find that the number of matching variants are exceptionally high and I think that using known=false would make the model overconfident (since there is a higher match between provided variant and the samples even when the provided variants are tagged known=false) which will lead to selecting higher number of novel variants.

Using Q12, known=true with all other parameters equal

Since the variant that I generated were the variants from the alignment itself, and would find that matching variants are exceptionally high I think that using known=true would make the model moderately confident (since the variants were already known, so known=true) which will lead to selecting moderate number of novel variants (less than the one using known=false).

I am not sure if my explanation above are right. I am just assuming something that may be going inside the GATK engine (black box to me). :neutral:

I also see that my Ti/Tv ratio are small, but I was expecting this to happen as I don’t have a large variant information to train the VQSR model. However, my variant file had 1209862 variant sites for the genome size of 207 mb (Arabidopsis lyrata). Could you please tell me if the issue is mainly the limitation in the amount of information required fro machine learning? I have looked at the QC of my reads before I started the GATK guide and did every measures to make the data appropriate.

My other plots (pairwise comparison of the annotations) also don’t look great.

I would appreciate if I could get a little illumination on how to take this further or any other choices I may have.

Thanks in advance,

raw.SNPsQ12.tranches.pdf

raw.SNPsQ12true.tranches.pdf

recalSNPsQ12.plots.R.pdf

recalSNPsQ12true.plots.R.pdf

From everestial007 on 2015-12-27

> @everestial007 said:

> Hi,

> I am kind of stuck at VQSR process and would appreciate little bit of guidance.

> First of all, I want to clarify myself on the use of known=true vs. false while using VariantRecalibrator. In this link http://gatkforums.broadinstitute.org/discussion/2805/howto-recalibrate-variant-quality-scores-run-vqsr

> for the hapmap data shouldnot we use known=true because the variant data from Hapmap are verified to high level of confidence and hence are mostly true and for the data from dbsnp to be known=false. Otherwise, I am actually not finding a very reason of why a given variant should be true vs. false.

> I however proceeded with some VQSR experimentation using both known=true and known=false using the variants I had generated. Acutally, there is no available variant database for my model organism so I had to generate the variants from the samples I had. I called variants using HC and UG (6 biological samples from 1 population) and filtered using stringent parameters. I then combined the variants by selecting the variants present in at least 3 samples and used this database for my VQSR. I was therefore expecting that my VQSR results would not be that great, but to start something new is always fun. I ran VQSR at different prior levels (10, 12 and 15). Below is the command for prior=12 (using known=false and know=true).

> Using Q12, known=false

> java -Xmx4g -jar /home/everestial007/GenomeAnalysisTK-3.4-46/GenomeAnalysisTK.jar -T VariantRecalibrator -R lyrata_genome.fa -input genotypedGvcfVARiantsMA.vcf -resource:strictSelection,known=false,training=true,truth=true,prior=12.0 VARiantsMAYODANsnp.vcf -an DP -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 -recalFile raw.SNPsQ12.recal -tranchesFile raw.SNPsQ12.tranches -rscriptFile recalSNPsQ12.plots.R

> Since the variants that I generated were the variants from the alignment (samples) itself, running VariantRecalibrator would find that the number of matching variants are exceptionally high and I think that using known=false would make the model overconfident (since there is a higher match between provided variant and the samples even when the provided variants are tagged known=false) which will lead to selecting higher number of novel variants.

> Using Q12, known=true with all other parameters equal

> Since the variant that I generated were the variants from the alignment itself, and would find that matching variants are exceptionally high I think that using known=true would make the model moderately confident (since the variants were already known, so known=true) which will lead to selecting moderate number of novel variants (less than the one using known=false).

> I am not sure if my explanation above are right. I am just assuming something that may be going inside the GATK engine (black box to me). :neutral:

> I also see that my Ti/Tv ratio are small, but I was expecting this to happen as I don’t have a large variant information to train the VQSR model. However, my variant file had 1209862 variant sites for the genome size of 207 mb (Arabidopsis lyrata). Could you please tell me if the issue is mainly the limitation in the amount of information required fro machine learning? I have looked at the QC of my reads before I started the GATK guide and did every measures to make the data appropriate.

> My other plots (pairwise comparison of the annotations) also don’t look great.

> I would appreciate if I could get a little illumination on how to take this further or any other choices I may have.

> Thanks in advance,

Geraldine_VdAuwera Sheila

I did a little bit of research and found out that the Ti/Tv for Arabidosis thaliana (a sister species of my model organism) is in the range of Ti/Tv ratio that I have found for my data (for Arabidosis lyrata). I have obtained the Ti/Tv ratio of 1.2 to 1.4 for different tranche and prior levels which is similar to that of A. thaliana, so I hope I should worry less about it. :smiley:

I will check how the used of —target_titv for my expected ranges changes the visual output o the data.

Regarding my other question, why I am seeing larger number of novel variants when using known=true vs. false. I think I have an explanation. Using the known=false actually adds the supplemented variants (used for building the VQSR model) to the identified novel variants, which is why I have higher number of novel variants when using known=false (see output file raw.SNPsQ12.tranches.pdf in previous post) vs. when using know=true (see output/attached file raw.SNPsQ12true.tranches.pdf).

Also, explanation on the use of —resource in the link https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_variantrecalibration_VariantRecalibrator.php

states the use of ** Known -** The program only uses known sites for reporting purposes (to indicate whether variants are already known or novel). They are not used in any calculations by the algorithm itself.

So, I am guessing Known=True vs. False is only helpful for illustration purpose on the plot file generated by R, recalSNPsQ12.plots.R.pdf.

I however still have some questions to clarify:

1) on the use of —ignore_all_filters: I am not very clear on the expression of this sentence Useful to rerun the VQSR from a filtered output file. – Does it mean that I can take the filtered output file to run a second round of recalibration?? If so, is it recommended?

2) I am not clear on the use of this: “Bad – A database of known bad variants can be used to supplement the set of worst ranked variants (compared to the Gaussian mixture model) that the program selects from the data to model “bad” variants.”

3)I tried using different —maxGaussians and it looks like —maxGaussians 4 was able to generate model PDF which distinct PDF distribution for several pair of annotations. My hunch is the model represented by recalSNPsQ12mG4.plots.R.pdf (using —maxGaussians) is better than the model represented by recalSNPsQ12m.plots.R.pdf (using —maxGaussians 8 i.e default). I would appreciate any suggestions on this particular query. :wink:

Thank you !!!

recalSNPsQ12mG4.plots.R.pdf

recalSNPsQ12.plots.R.pdf

From Geraldine_VdAuwera on 2016-01-08

Hi @everestial007,

I’m afraid your questions are a bit too extensive and we don’t have time to address them at the moment. If you have the opportunity to come to one of our workshops, we may be able to help you more that way. Also, be sure to watch the past workshop videos as they might prove helpful as well.

From buddej on 2016-02-15

[Many](https://www.broadinstitute.org/gatk/guide/presentations?id=5993#materials) [recent](https://www.broadinstitute.org/gatk/guide/presentations?id=6007#materials) [presentations](http://gatkforums.broadinstitute.org/gatk/discussion/6897/presentation-slides-for-gatk-workshops-in-australia-feb-1-4) on VQSR show parameters which do not match what is stated above. Should the presentations (all 3 more recent than this document) be used?

Specifically I’m wondering about parameters when recalibrating SNPs:

` -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.sites.vcf`

listed as truth=true above, but truth=false in the presentations

`—ts_filter_level 99.5`

listed as 99.5 above, but 99.0 in the presentations

Changing the filter to 99.0 makes sense, since the 99.5 tranche isn’t calculated, and GATK ends up using 99.9 when running ApplyRecalibration. I’m not sure what to make of using the omni2.5 SNPs as truth or not.

Thank you!

From Sheila on 2016-02-16

@buddej

Hi,

I’m not sure why the presentations do not include Omni as part of the truth set. We do consider Omni as truth, so you can use the documentation as reference. I will make a note to change the presentations.

As for the sensitivity level, it is up to you to experiment and determine which level best suits your data. Have a look at my answer from August 2015 in [this thread](http://gatkforums.broadinstitute.org/wdl/discussion/comment/23696) for more information.

-Sheila

From buddej on 2016-02-16

@Sheila

Thank you for the reply. I certainly consider the omni set to be a highly curated set of SNPs, so I was confused to see it included as truth=false in the presentation. Glad to hear it was just a typo.

As far as sensitivity, we do experiment to find a sensitivity level we are comfortable with. More than anything, I wanted to point out that recommending a `ts_filter_level` of 99.5 for SNPs in the Best Practices (this document) is mismatchy, since neither the default `-tranche` [option to VariantRecalibrator](https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_variantrecalibration_VariantRecalibrator.php#—TStranche), nor the [Tutorial recommendations](http://gatkforums.broadinstitute.org/gatk/discussion/2805/howto-recalibrate-variant-quality-scores-run-vqsr) (1c) include this tranche level.

This results, as far as I can tell, in GATK setting FILTER=PASS to any SNP variant in the 99.0 tranche or below, even though the 99.5 level is recorded in the #GATKCommandLine record in the post-ApplyRecalibration .vcf.

Maybe this doesn’t make much of a difference in the end, but it might cause confusion for anyone trying to replicate a pipeline based on the log info in the .vcf

From Geraldine_VdAuwera on 2016-02-17

Unfortunately there are a few places in our VQSR documentation where there’s been some divergence — some docs got updated and some weren’t. When in doubt, trust the forum-based docs over the presentations. We’ll try to fix that in the next set of presentations to avoid sowing confusing. Thanks for reporting this!

From metheuse on 2016-06-15

I know that I need at least 30 exome-seq samples to run VQSR. What about samples from a small panel? There are 3000 target regions in this panel, totaling 1.4Mb.

Thanks.

From Sheila on 2016-06-15

@metheuse

Hi,

We don’t recommend using VQSR on small targeted data, as there is simply not enough data for the tool to build a proper model. You will need to use [hard filtering](http://gatkforums.broadinstitute.org/gatk/discussion/2806/howto-apply-hard-filters-to-a-call-set).

-Sheila

From metheuse on 2016-06-15

Sheila said: >

metheuse

> Hi,

> We don’t recommend using VQSR on small targeted data, as there is simply not enough data for the tool to build a proper model. You will need to use [hard filtering](http://gatkforums.broadinstitute.org/gatk/discussion/2806/howto-apply-hard-filters-to-a-call-set).

> -Sheila

Thanks.

What’s the threshold size of a panel for VQDR to be applicable?

From stkelley on 2016-08-29

I am seeing conflicting information on whether dbSNP should be included as a training set for Indel VQSR. Near the top of this page, under “Explanation of resource datasets” only Mills is listed for “Resources for Indels” but then under “Argument recommendations for VariantRecalibrator”, the example for indels includes dbSNP. In addition, in [this tutorial](https://software.broadinstitute.org/gatk/documentation/article?id=2805) linked to in the Best Practices only Mills is included for indels.

Since there are multiple places that suggest using only Mills and the prior comments on this article suggest that there was confusion about the same as far back as 2014 that was cleared up, I am inclined to not include dbSNP for indel training. I wanted to check what your recommendation actually is though and see if the discrepancies could be cleared up so others aren’t confused.

From Sheila on 2016-08-29

@stkelley

Hi,

Sorry for the confusion. We are in the process of cleaning up the documentation and making sure all the recommendations match. [This article](https://software.broadinstitute.org/gatk/guide/article?id=1259) has the most up-to-date Best practice recommendations. As you can see, we do recommend using dbSNP as an input resource file.

-Sheila

From stkelley on 2016-08-29

@Sheila Thanks! I will go forward with using dbSNP as a resource for indel VQSR.

Just so you know, the article you linked is the same one that I commented on and contains conflicting information. Under the first heading, Explanation of resource datasets, only Mills is listed for Resources for Indels.

From Sheila on 2016-08-30

@stkelley

Hi,

Thanks for the clarification. I just fixed the document. Sorry for the confusion!

-Sheila

From andkam on 2016-11-19

Can someone help me? (conflicting recommendation for VSQR)

When using the VariantRecalibrator for INDELs I cannot figure out which one of these two commands is the one that is recommended to use:

-resource:mills,known=true,training=true,truth=true,prior=12.0 mills.vcf \

-resource:mills,known=false,training=true,truth=true,prior=12.0 mills.vcf \

In the “best practice page” – “(howto) Recalibrate variant quality scores = run VQSR” the command:

-resource:mills,known=true,training=true,truth=true,prior=12.0 mills.vcf \ is recommended.

But in the documentation “Which training sets / arguments should I use for running VQSR?” which is suppose to be the most updated page, the recommended command is:

-resource:mills,known=false,training=true,truth=true,prior=12.0 mills.vcf. HOWEVER – in the same page (“Which training sets / arguments should I use for running VQSR?”) is also says in the text:

“Resources for Indels

Known and true sites training resource: Mills” – This description makes one think that the parameter “known=true” should be used and not “known=false”.

Can someone help me clarify which command I am recommended to use?

Best, Anders

From Sheila on 2016-11-22

@andkam

Hi Anders,

[This document](https://software.broadinstitute.org/gatk/guide/article?id=1259) will always be the correct document. We will fix the misleading other documents.

Thanks,

Sheila

From chrarnold on 2016-11-23

@Sheila

I am also confused exactly by the same thing that @andkam posted, and your answer points to a document that actually is confusing. It not clear in this (always correct, as you say) document which one to use:

-resource:mills,known=true,training=true,truth=true,prior=12.0 mills.vcf \

-resource:mills,known=false,training=true,truth=true,prior=12.0 mills.vcf \

because for the Mills dataset, it is written “Known and true sites training resource: Mills” in the title, suggesting “known=true” but then in the examples it says “known=false”.

Please clarify, thanks a lot!

Christian

From Sheila on 2016-11-28

@chrarnold

Hi Christian,

I see. Thanks for pointing it out. I just fixed the document. You should use `-resource:mills,known=false,training=true,truth=true,prior=12.0 mills.vcf`

-Sheila

From stkelley on 2016-12-22

I am a bit unclear on the recommended arguments for VariantRecalibrator for Indels. Under Argument recommendations for VariantRecalibrator / Indel specific recommendations, the example command addition includes `—maxGaussians 4 \` but there is no mention of why this is included.

Is it correct to read that maxGaussians should be set to 4 for Indel recalibration regardless of the size of your study? The only recommendations I can find for maxGuassians indicate to use 4 when dealing with a small number of samples and/or targets.

From Geraldine_VdAuwera on 2016-12-22

Reducing the max number of Gaussian clusters that the model will consider helps produce a more stable model when you don’t have a lot of data to work from. This happens especially with indels in small cohorts, but can also affect larger cohorts if there is a lot of noise in the data. Using this argument may limit the resolution of the model, which is why you may want to try without it first, then add it if the run fails. It’s not a one-size-fits-all solution.

From artitandon on 2017-03-01

I have exome data for a single sample, what is the best way to apply variant re-calibration step? Is it best to apply hard coded values?

From Geraldine_VdAuwera on 2017-03-01

@artitandon It’s not possible to apply VQSR to a single exome; please see our hard-filtering recommendations instead.

From yanran on 2017-03-23

Can I ask three questions about VQSR:

1. I have two VCF files from two different sources, and each of them contains several samples. The sequencing processes are basically the same (library prep, sequence kit, etc.) for the two data sets. But the VSQ Lod cutoff for various tranches are different for the two sets (e.g. for file1 Tranche99.00to99.90 is VSQ Lod -100 <= x < -0.8, and -7 <= x < -1.5 for file2). I’m guessing it’s because they used different training data, but I’m not sure if my guessing was correct. i.e. If the VQSR steps were processed in the exactly same way (same training set and parameters), will two data sets get the same VSQ Lod cutoff for various tranches?

2. Does the testing data (i.e. the variant calls from certain experiment) have an impact on the VQSR results? For example, for one data set, if I do VQSR again on the filtered (only keeping the “PASS” variant loci) variants using the same training variants, will I get a different VQSR evaluation compared to the first time for the same variant loci?

3. I read that InDels and SNPs should be somehow seperately evaluated via VQSR, and I see that in some VCF files, they have TrancheINDEL99.00to99.90 and TrancheSNP99.00to99.90, etc.. However, there are also VCF files that only have Tranche99.00to99.90, etc. and no indication of InDels or SNP. Does this mean that InDel and SNPs were evaluated together? Is it a good thing?

Sorry for so many questions. I just learned about VQSR and have some confusions. Thanks in advance!

From Sheila on 2017-03-31

@yanran

Hi,

1) Because the program is building models based on annotation profiles in your input VCFs, differences in annotations will make a difference in the VQSLOD score. Take a look at the model plots to see some of those differences. You can read more about the plots [here](https://software.broadinstitute.org/gatk/guide/article?id=39). Note, differences in data quality will lead to differences in annotation distributions. You may try looking into quality control metrics for your data. Have a look at [this presentation](https://software.broadinstitute.org/gatk/events/slides/1504/GATKwr7-X-4-Quality_control.pdf) for more information.

2) In theory, the results should be the same if you are using the PASS variants. However issues may arise because you may not have enough sites to make a good negative model. You may penalize the good variant sites (because the model assumes a certain number of sites are bad) and not be able to build a good model of bad variants. This in turn could lead to different results from your previous run.

3) Can you tell us which version of GATK you are using? Can you also post the exact command that you ran? The command and version will be stated in the VCF header.

Thanks,

Sheila

From pbardou on 2017-04-12

Hi all,

We have a question on how to do the VQSR step and how to build a “True sites training resource”.

We work on a 1000G project and we plan to build the “True sites training resource” by using one animal by gene pool (~15) and select the variants common between GATK and freebayes and filtered by phred score.

Next we plan to use this set of “true sites” to do the “VQSR” step on each animal of the 1000G.

So, briefly our pipeline, by animal :

- Alignment (bwa mem) => bam

- Preprocess (realn, recal) => bam

- Calling haplotypecaller => gcvf

- VQSR using the previously building set “True sites training resource” => gvcf

And finaly joingenotyping to produce the final vcf with all our animals.

What do you think about that ?

Does it possible to do VQSR from gvcf ?

Hope it make sense

Thanks in advance for your feed back and your help

Philippe

From Sheila on 2017-04-13

@pbardou

Hi Philippe,

The process looks fine, but you must do joint genotyping on the GVCFs using GenotypeGVCFs before running VQSR. Have a look at [this article](https://software.broadinstitute.org/gatk/documentation/article?id=3893). You do not have to perform Indel Realignment if you are using HaplotypeCaller. Have a look at [this article](https://software.broadinstitute.org/gatk/blog?id=7847).

As for using ~15 genomes for the true sites resource, you can certainly try it out. You might try experimenting with the number of genomes you use to see which gives the best results. I think these threads discuss some similar options: http://gatkforums.broadinstitute.org/gatk/discussion/3470/bootstrapping-high-confidence-variants-for-vqsr

http://gatkforums.broadinstitute.org/gatk/discussion/4280/vqsr-in-non-human

http://gatkforums.broadinstitute.org/gatk/discussion/comment/25693

You will see it is difficult to create really good high confidence resource files, but it is well worth the effort. Good luck and let us know how things go!

-Sheila

From pbardou on 2017-04-14

Hi Sheila,

Thank you very much for your reply and all these links !!!

I have another question on the pipeline with a lot of samples … if I understood well the VQSR step is only done at the end, so we can not deliver results gradually ?

Thanks again

Regards

Philippe

From Geraldine_VdAuwera on 2017-04-19

@pbardou Right, you do VQSR once the entire cohort is ready. You could do it at earlier points if you want to see intermediate results, but you’ll still have to redo it later. That being said VQSR runs quickly so it’s not considered a rate-limiting step.

From alongalor on 2017-09-11

The GATK Resource Bundle does not have the files hapmap_3.3.b37.sites.vcf, 1000G_omni2.5.b37.sites.vcf, 1000G_phase1.snps.high_confidence.vcf or dbsnp.b37.vcf. It does have the files hapmap_3.3.b37.vcf.gz, 1000G_omni2.5.b37.vcf.gz, 1000G_phase1.snps.high_confidence.vcf.b37.gz and dbsnp_138.b37.vcf.gz. I found this confusing.

From Sheila on 2017-09-17

@alongalor

Hi,

Thanks, I just fixed those. I think the names got changed over time and we never fixed them in the docs.

-Sheila

From alongalor on 2017-09-20

Another possible typo: under the SNP and Indel specific recommendations for VariantRecalibrator, one mentions a dbsnp file called dbsnp_138.b37.vcf and the other dbsnp.b37.vcf I assume the second should be dbsnp_138.b37.vcf? Thanks!

From Sheila on 2017-09-25

@alongalor

Hi,

Fixed. Thank you for helping improve our documentation :smile:

-Sheila

From zng on 2018-06-19

Hello,

I want to run VQRS for INDELs. I will be using Mills as positive training set. I have created a file with internal controls of false positive INDELs. How do I use this file as the negative training set? Please provide an example of the arguments to specify a negative training set.

Thank you.

From Sheila on 2018-06-21

@zng

Hi,

There is no way to input a negative training set. The tool makes the positive and negative models internally. Have a look at [this answer](https://gatkforums.broadinstitute.org/gatk/discussion/comment/49681#Comment_49681).

-Sheila

From mr_dave on 2018-08-01

Out of curiosity, I am looking at adding the Axiom resource to my indel VQSR (WGS on hg19). Is the hg38 bundle version, `Axiom_Exome_Plus.genotypes.all_populations.poly.hg38.vcf.gz` ([ftp](ftp://ftp.broadinstitute.org/bundle/hg38/Axiom_Exome_Plus.genotypes.all_populations.poly.hg38.vcf.gz )), a curated resource?

I was thinking about using the ExAC-hosted version ([ftp](ftp://ftp.broadinstitute.org/distribution/ExAC_release/release0.3/resources/Axiom_Exome_Plus.genotypes.all_populations.poly.vcf.gz)) that is listed in the [ExAC supplement](https://media.nature.com/original/nature-assets/nature/journal/v536/n7616/extref/nature19057-s1.pdf). I think I just need to convert from b37 to hg19. The hg38 vcf is at least 97% identical to the ExAC b37 vcf (when comparing columns except POS, maintaining order, so that number is likely higher with a proper liftover).

From Sheila on 2018-08-06

@mr_dave

Hi,

I need to confirm with the team and get back to you.

-Sheila

From Sheila on 2018-08-10

@mr_dave

Hi,

The developer says “Axiom is great. That’s what we use in production. The hg38 version is just a liftover.”

-Sheila

From 55816815 on 2018-10-13

Thanks a lot Geraldine! You always respond instantly with solutions or helpful comments!

Shuoguo

> @Geraldine_VdAuwera said:

> 55816815 Have a look at tommycarstensen’s comment here: http://gatkforums.broadinstitute.org/discussion/comment/20279/#Comment_20279

> Basically what’s going on is that internally, the VQSR tools have specific expectations for novel Ti/Tv that are based on what was known at the time before the 1000 Genomes project results were added to dbsnp. If you use the corresponding “old” dbsnp version, the expectations are fulfilled and your data comes out looking shiny. If you use a more recent version, key assumptions break down and it screws up the VQSR’s QC routines (which produce the plots).

> We were aware internally of the 1000G impact on dbsnp, but somehow we didn’t connect that to the issues people have been reporting, and that’s my bad. We’ll make sure to document that more clearly in future to avoid causing unnecessary concern.

> On the bright side the VQSR results themselves are not affected at all by this problem, it’s only the post-analysis QC routines that end up being useless.

> I hope that helps.

From aschoenr on 2018-11-02

Hi,

I’m currently putting together the the GATK best practices workflow for identifyiong de novo mutations in humans (Data pre-processing pipeline followed by the Germline short variant discovery followed by the Genotype Refinement workflow for germline short variants (which will annotate de novo mutations)). I’ve been doing this using GATK 4 and the hg38 reference genome. This combines the newest version of each so the reference material in the forums usually does not refer to both of these versions (or possibly either. I have a couple of questions which I’m hoping have easy answers.

First, I will paste in the commands as I executed them since I had to made some slight changes to the suggested parameters due to changes in GATK4 and others may find this useful (please note that I have omitted the -an InbreedingCoeff parameter as my pipeline will be applied to families).

STEP 1

>gatk VariantRecalibrator \

> -R $ref_dir/Homo_sapiens_assembly38.fasta \

> -V $input_VCF \

> —resource hapmap,known=false,training=true,truth=true,prior=15.0 $ref_dir/hapmap_3.3.hg38.vcf.gz \

> —resource omni,known=false,training=true,truth=true,prior=12.0:$ref_dir/1000G_omni2.5.hg38.vcf.gz \

> —resource 1000G,known=false,training=true,truth=false,prior=10.0:$ref_dir/1000G_phase1.snps.high_confidence.hg38.vcf.gz \

> —resource dbsnp,known=true,training=false,truth=false,prior=2.0:$ref_dir/dbsnp_146.hg38.vcf.gz \

> -an DP \

> -an QD \

> -an FS \

> -an SOR \

> -an MQ \

> -an MQRankSum \

> -an ReadPosRankSum \

> -mode SNP \

> -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \

> -O input_VCF_recalibrate_SNP.recal \

> —tranches-file input_VCF_recalibrate_SNP.tranches \

> —rscript-file input_VCF_recalibrate_SNP_plots.R

Step 2

>gatk ApplyVQSR \

> -R $ref_dir/Homo_sapiens_assembly38.fasta \

> -V $input_VCF \

> -mode SNP \

> —ts-filter-level 99.5 \

> —recal-file input_VCF_recalibrate_SNP.recal \

> —tranches-file input_VCF_recalibrate_SNP.tranches \

> -O input_VCF_recalibrated_snps_raw_indels.vcf

Step 3

>gatk VariantRecalibrator \

> -R $ref_dir/Homo_sapiens_assembly38.fasta \

> -V $input_VCF_recalibrated_snps_raw_indels.vcf \

> —resource hapmap,known=false,training=true,truth=true,prior=15.0 $ref_dir/hapmap_3.3.hg38.vcf.gz \

> —resource mills,known=false,training=true,truth=true,prior=12.0:$ref_dir/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz

> —resource dbsnp,known=true,training=false,truth=false,prior=2.0:$ref_dir/dbsnp_146.hg38.vcf.gz \

> -an DP \

> -an QD \

> -an FS \

> -an SOR \

> -an MQ \

> -an MQRankSum \

> -an ReadPosRankSum \

> -mode INDEL \

> -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \

> -O input_VCF_recalibrate_INDEL.recal \

> —tranches-file input_VCF_recalibrate_INDEL.tranches \

> —rscript-file input_VCF_recalibrate_INDEL_plots.R

Step 4

>gatk ApplyVQSR \

> -R $ref_dir/Homo_sapiens_assembly38.fasta \

> -V $input_VCF_recalibrated_snps_raw_indels.vcf \

> -mode INDEL \

> —ts-filter-level 99.0 \

> —recal-file input_VCF_recalibrate_INDEL.recal \

> —tranches-file input_VCF_recalibrate_INDEL.tranches \

> -O input_VCF_recalibrated_variants.vcf

This is what I have applied and, like I said, it is slightly modified from the documentation for the new formatting of some parameters in GATK4 and it is using the files from the hg38 reference. I have read elsewhere that the ‘prior’ parameter values used on older references should still be used here. I was just curious about the specific files I used. For the most part it is obvious the changes you should make when using hg38, but, for example, there are the following files (bold is what I used):

>hapmap_3.3_grch38_pop_stratified_af.vcf.gz

>hapmap_3.3.hg38.vcf.gz

and

> dbsnp_138.hg38.vcf.gz

> dbsnp_144.hg38.vcf.gz

> dbsnp_146.hg38.vcf.gz

Did I use the correct hapmap file? I have read elsewhere that some people continue to use dbsnp_138. Should I use that as well or was my choice of dbsnp_146 ok? Please also let me know if there is a problem with any of the other files I used to do the recalibration.

The reason I ask these questions is that the results of this filtering was not what I expected. In the documentation it is mentioned that the variant selection at earlier stages is quite lenient with this stage filtering out a lot of variants. I know it will be hard to comment on my specific case but the family I am currently working on consists of 5 individuals. The unfiltered VCF used as input to this stage has 7,452,752 identified variants. I was expecting some sizeable fraction of these to fail or at least be places in one of the tranches. This is how it broke down though:

>PASS: 6,943,634

>VQSRTrancheINDEL99.00to99.90: 121,377

>VQSRTrancheINDEL99.90to100.00: 17,387

>VQSRTrancheSNP99.90to100.00: 370,354

This means that 93% of the variants passed. Does this sound reasonable? I was expecting less than this, but I’m not exactly sure why. Also, I’m surprised that only 3 of the 6 defined tranches show up in these results

Lastly, are there any sample inputs (raw fastq or unaligned BAM files) with resulting analysis-ready VCFs that have been processed using the GATK pipelines? I have followed the best practices as best as I could but I’m worried I made a mistake somewhere. Having a small example data set that I can process and compare to would be great.

Sorry for the length of this post but I do think it would be useful for people using GATK4 and hg38.

From aschoenr on 2018-11-02

One other question: is the VQSR process (specifically the machine learning parts) deterministic or probabilistic? If I rerun this stage on the same input data should I expect the exact same output?

From aschoenr on 2018-11-06

Just thought I’d ping shlee and Sheila as you’ve recently been so helpful with this stuff.

Report abuse