created by rpoplin
on 2012-07-23
This document describes what Variant Quality Score Recalibration (VQSR) is designed to do, and outlines how it works under the hood. The first section is a high-level overview aimed at non-specialists. Additional technical details are provided below.
For command-line examples and recommendations on what specific resource datasets and arguments to use for VQSR, please see this FAQ article. See the VariantRecalibrator tool doc and the ApplyRecalibration tool doc for a complete description of available command line arguments.
As a complement to this document, we encourage you to watch the workshop videos available in the Presentations section.
VQSR stands for “variant quality score recalibration”, which is a bad name because it’s not re-calibrating variant quality scores at all; it is calculating a new quality score that is supposedly super well calibrated (unlike the variant QUAL score which is a hot mess) called the VQSLOD (for variant quality score log-odds). I know this probably sounds like gibberish, stay with me. The purpose of this new score is to enable variant filtering in a way that allows analysts to balance sensitivity (trying to discover all the real variants) and specificity (trying to limit the false positives that creep in when filters get too lenient) as finely as possible.
The basic, traditional way of filtering variants is to look at various annotations (context statistics) that describe e.g. what the sequence context is like around the variant site, how many reads covered it, how many reads covered each allele, what proportion of reads were in forward vs reverse orientation; things like that -- then choose threshold values and throw out any variants that have annotation values above or below the set thresholds. The problem with this approach is that it is very limiting because it forces you to look at each annotation dimension individually, and you end up throwing out good variants just because one of their annotations looks bad, or keeping bad variants in order to keep those good variants.
The VQSR method, in a nutshell, uses machine learning algorithms to learn from each dataset what is the annotation profile of good variants vs. bad variants, and does so in a way that integrates information from multiple dimensions (like, 5 to 8, typically). The cool thing is that this allows us to pick out clusters of variants in a way that frees us from the traditional binary choice of “is this variant above or below the threshold for this annotation?”
Let’s do a quick mental visualization exercise (pending an actual figure to illustrate this), in two dimensions because our puny human brains work best at that level. Imagine a topographical map of a mountain range, with North-South and East-West axes standing in for two variant annotation scales. Your job is to define a subset of territory that contains mostly mountain peaks, and as few lowlands as possible. Traditional hard-filtering forces you to set a single longitude cutoff and a single latitude cutoff, resulting in one rectangular quadrant of the map being selected, and all the rest being greyed out. It’s about as subtle as a sledgehammer and forces you to make a lot of compromises. VQSR allows you to select contour lines around the peaks and decide how low or how high you want to go to include or exclude territory within your subset.
How this is achieved is another can of worms. The key point is that we use known, highly validated variant resources (omni, 1000 Genomes, hapmap) to select a subset of variants within our callset that we’re really confident are probably true positives (that’s the training set). We look at the annotation profiles of those variants (in our own data!), and we from that we learn some rules about how to recognize good variants. We do something similar for bad variants as well. Then we apply the rules we learned to all of the sites, which (through some magical hand-waving) yields a single score for each variant that describes how likely it is based on all the examined dimensions. In our map analogy this is the equivalent of determining on which contour line the variant sits. Finally, we pick a threshold value indirectly by asking the question “what score do I need to choose so that e.g. 99% of the variants in my callset that are also in hapmap will be selected?”. This is called the target sensitivity. We can twist that dial in either direction depending on what is more important for our project, sensitivity or specificity.
The purpose of variant recalibration is to assign a well-calibrated probability to each variant call in a call set. This enables you to generate highly accurate call sets by filtering based on this single estimate for the accuracy of each call.
The approach taken by variant quality score recalibration is to develop a continuous, covarying estimate of the relationship between SNP call annotations (QD, SB, HaplotypeScore, HRun, for example) and the the probability that a SNP is a true genetic variant versus a sequencing or data processing artifact. This model is determined adaptively based on "true sites" provided as input (typically HapMap 3 sites and those sites found to be polymorphic on the Omni 2.5M SNP chip array, for humans). This adaptive error model can then be applied to both known and novel variation discovered in the call set of interest to evaluate the probability that each call is real. The score that gets added to the INFO field of each variant is called the VQSLOD. It is the log odds ratio of being a true variant versus being false under the trained Gaussian mixture model.
The variant recalibrator contrastively evaluates variants in a two step process, each performed by a distinct tool:
Please see the VQSR tutorial for step-by-step instructions on running these tools.
The tool takes the overlap of the training/truth resource sets and of your callset. It models the distribution of these variants relative to the annotations you specified, and attempts to group them into clusters. Then it uses the clustering to assign VQSLOD scores to all variants. Variants that are closer to the heart of a cluster will get a higher score than variants that are outliers.
During the first part of the recalibration process, variants in your callset were given a score called VQSLOD. At the same time, variants in your training sets were also ranked by VQSLOD. When you specify a tranche sensitivity threshold with ApplyRecalibration, expressed as a percentage (e.g. 99.9%), what happens is that the program looks at what is the VQSLOD value above which 99.9% of the variants in the training callset are included. It then takes that value of VQSLOD and uses it as a threshold to filter your variants. Variants that are above the threshold pass the filter, so the FILTER field will contain PASS. Variants that are below the threshold will be filtered out; they will be written to the output file, but in the FILTER field they will have the name of the tranche they belonged to. So VQSRTrancheSNP99.90to100.00 means that the variant was in the range of VQSLODs corresponding to the remaining 0.1% of the training set, which are basically considered false positives.
The variant recalibration step fits a Gaussian mixture model to the contextual annotations given to each variant. By fitting this probability model to the training variants (variants considered to be true-positives), a probability can be assigned to the putative novel variants (some of which will be true-positives, some of which will be false-positives). It is useful for users to see how the probability model was fit to their data. Therefore a modeling report is automatically generated each time VariantRecalibrator is run (in the above command line the report will appear as path/to/output.plots.R.pdf). For every pair-wise combination of annotations used in modeling, a 2D projection of the Gaussian mixture model is shown.
The figure shows one page of an example Gaussian mixture model report that is automatically generated by the VQSR from the example HiSeq call set. This page shows the 2D projection of mapping quality rank sum test versus Haplotype score by marginalizing over the other annotation dimensions in the model.
In each page there are four panels which show different ways of looking at the 2D projection of the model. The upper left panel shows the probability density function that was fit to the data. The 2D projection was created by marginalizing over the other annotation dimensions in the model via random sampling. Green areas show locations in the space that are indicative of being high quality while red areas show the lowest probability areas. In general putative SNPs that fall in the red regions will be filtered out of the recalibrated call set.
The remaining three panels give scatter plots in which each SNP is plotted in the two annotation dimensions as points in a point cloud. The scale for each dimension is in normalized units. The data for the three panels is the same but the points are colored in different ways to highlight different aspects of the data. In the upper right panel SNPs are colored black and red to show which SNPs are retained and filtered, respectively, by applying the VQSR procedure. The red SNPs didn't meet the given truth sensitivity threshold and so are filtered out of the call set. The lower left panel colors SNPs green, grey, and purple to give a sense of the distribution of the variants used to train the model. The green SNPs are those which were found in the training sets passed into the VariantRecalibrator step, while the purple SNPs are those which were found to be furthest away from the learned Gaussians and thus given the lowest probability of being true. Finally, the lower right panel colors each SNP by their known/novel status with blue being the known SNPs and red being the novel SNPs. Here the idea is to see if the annotation dimensions provide a clear separation between the known SNPs (most of which are true) and the novel SNPs (most of which are false).
An example of good clustering for SNP calls from the tutorial dataset is shown to the right. The plot shows that the training data forms a distinct cluster at low values for each of the two statistics shown (haplotype score and mapping quality bias). As the SNPs fall off the distribution in either one or both of the dimensions they are assigned a lower probability (that is, move into the red region of the model's PDF) and are filtered out. This makes sense as not only do higher values of HaplotypeScore indicate a lower chance of the data being explained by only two haplotypes but also higher values for mapping quality bias indicate more evidence of bias between the reference bases and the alternative bases. The model has captured our intuition that this area of the distribution is highly enriched for machine artifacts and putative variants here should be filtered out!
The recalibrated variant quality score provides a continuous estimate of the probability that each variant is true, allowing one to partition the call sets into quality tranches. The main purpose of the tranches is to establish thresholds within your data that correspond to certain levels of sensitivity relative to the truth sets. The idea is that with well calibrated variant quality scores, you can generate call sets in which each variant doesn't have to have a hard answer as to whether it is in or out of the set. If a very high accuracy call set is desired then one can use the highest tranche, but if a larger, more complete call set is a higher priority than one can dip down into lower and lower tranches. These tranches are applied to the output VCF file using the FILTER field. In this way you can choose to use some of the filtered records or only use the PASSing records.
The first tranche (90) which has the lowest value of truth sensitivity but the highest value of novel Ti/Tv, is exceedingly specific but less sensitive. Each subsequent tranche in turn introduces additional true positive calls along with a growing number of false positive calls. Downstream applications can select in a principled way more specific or more sensitive call sets or incorporate directly the recalibrated quality scores to avoid entirely the need to analyze only a fixed subset of calls but rather weight individual variant calls by their probability of being real. An example tranche plot, automatically generated by the VariantRecalibrator walker, is shown below.
This is an example of a tranches plot generated for a HiSeq call set. The x-axis gives the number of novel variants called while the y-axis shows two quality metrics -- novel transition to transversion ratio and the overall truth sensitivity.
Note that the tranches plot is not applicable for indels and will not be generated when the tool is run in INDEL mode.
We use a Ti/Tv-free approach to variant quality score recalibration. This approach requires an additional truth data set, and cuts the VQSLOD at given sensitivities to the truth set. It has several advantages over the Ti/Tv-targeted approach:
We have used hapmap 3.3 sites as the truth set (genotypesr27nr.b37_fwd.vcf), but other sets of high-quality (~99% truly variable in the population) sets of sites should work just as well. In our experience, with HapMap, 99% is a good threshold, as the remaining 1% of sites often exhibit unusual features like being close to indels or are actually MNPs, and so receive a low VQSLOD score.
Note that the expected Ti/Tv is still an available argument but it is only used for display purposes.
- Can I use the variant quality score recalibrator with my small sequencing experiment?
This tool is expecting thousands of variant sites in order to achieve decent modeling with the Gaussian mixture model. Whole exome call sets work well, but anything smaller than that scale might run into difficulties.
One piece of advice is to turn down the number of Gaussians used during training. This can be accomplished by adding --maxGaussians 4
to your command line.
maxGaussians
is the maximum number of different "clusters" (=Gaussians) of variants the program is "allowed" to try to identify. Lowering this number forces the program to group variants into a smaller number of clusters, which means there will be more variants in each cluster -- hopefully enough to satisfy the statistical requirements. Of course, this decreases the level of discrimination that you can achieve between variant profiles/error modes. It's all about trade-offs; and unfortunately if you don't have a lot of variants you can't afford to be very demanding in terms of resolution.
- Why don't all the plots get generated for me?
The most common problem related to this is not having Rscript accessible in your environment path. Rscript is the command line version of R that gets installed right alongside. We also make use of the ggplot2 library so please be sure to install that package as well. See the Common Problems section of the Guide for more details.
Updated on 2017-06-12
From Kurt on 2015-04-09
@mhernaez ,
I would suggest starting off with removing MQ from the model. that’s usually been my achilles heel when dealing with targeted capture. I’ve had MQRankSum bite me a couple of times (for a different reason).
From mhernaez on 2015-04-09
@Kurt Thanks! I will try that.
mikel
From SteveL on 2015-04-20
I found this sentence VERY confusing, if not wrong (I am still not 100% sure):
> The first tranche (from the bottom, with lowest values) is exceedingly specific but less sensitive, and each subsequent tranche in turn introduces additional true positive calls along with a growing number of false positive calls.
Should it not be the first tranche from the TOP? i.e. in tranche-90 everything is considered TP (at least in this figure).
From tommycarstensen on 2015-04-20
@SteveL No, starting from “tranche-90” and going up your specificity will decrease and your sensitivity will increase. The sentence is correct.
From SteveL on 2015-04-20
Thanks @tommycarstensen – however the issue I have is that the image (immediately following paragraph with quoted sentence) on page one of this thread has the “tranche-90” at the top, so it isn’t clear, to novices, what the word “bottom” refers to – also the “lowest” values, could be interpreted as tranche or the ti/tv, which unfortunately go in opposing directions. I just think it could be stated more clearly, but maybe it’s just me.
From Geraldine_VdAuwera on 2015-04-20
@SteveL I see what you mean, it’s a fair point. We’ll try to clarify.
From shiroann on 2015-04-21
Is the model better informed by using a VCF containing all the chromosomes or does the model still work efficiently when the recalibration is done per chromosome?
From tommycarstensen on 2015-04-21
@shiroann It is better to maximize the number of data points by running simultaneously across all chromosomes. That being said I think VQSR still might give reasonable results for individual chromosomes.
From shiroann on 2015-04-21
Thanks @tommycarstensen . I’m assuming the per chromosome VQSR will be greatly influenced by the number of variants in that particular chromosome.I wonder if that introduces some underlying variation in the VQSLOD score sensitivity across the genome/exome?
From tommycarstensen on 2015-04-21
@shiroann Yes, basically don’t run VQSR per chromosome, unless you need the results quickly, before the Death Star reaches Alderaan or something like that. And you would have to run ApplyRecalibration separately per chromosome as well, if you were to do it. JUST don’t DO IT (not a shoe company TM).
… Unless you are dealing with chromosomes, which you think might have annotations very different from your other chromosomes. I think I remember @Geraldine_VdAuwera telling me to run across all chromosomes irrespective of these differences. Here is the thread I was thinking of:
http://gatkforums.broadinstitute.org/discussion/2895/vqsr-and-sex-chromosomes
I posted a figure in that thread showing the the chromY variants more frequently fail to the DP annotation for obvious reasons:
https://us.v-cdn.net/5019796/uploads/FileUpload/99/99c914e1b209e5615f9de0c07ca55f.png
From Geraldine_VdAuwera on 2015-04-21
We really don’t recommend running VQSR per chromosome, no. It’s going to cause what are essentially batch effects between chromosomes of different sizes, among other issues.
From WVNicholson on 2015-04-27
I posted a question about VQSR and issues with the VariantRecalibrator and VariantAnnotator about two weeks ago; but it still hasn’t appeared anywhere so I’m concerned that it has been blocked outright. I’m not re-posting it here right now; because it’s possible it was blocked earlier due to my inadvertently making multiple posts,
William
From Sheila on 2015-04-27
@WVNicholson
Hi William.
I just verified your account. I don’t see your question anywhere, so it is best to re-post again. I should be able to see it now.
-Sheila
From Kath on 2015-04-28
Hi,
This might have been asked before (sorry…) but if I have less than 30 exome samples and matching 1000 genomes samples is going to be difficult (because I am running my samples through my pipeline and samples will differ each time according to enrichment method, intervals, ethnicity etc.) is it better to leave out the VQSR step completely or carry on using it on <30 samples?
Thanks
Kath
From Geraldine_VdAuwera on 2015-04-28
@Kath I would recommend running some tests and evaluating your results. Generally speaking we think it’s worth the additional effort to put together a VQSR-worthy dataset.
From WVNicholson on 2015-04-28
As mentioned earlier, I attempted to post the message below before; so it could actually still be in the system somewhere. (In particular, I sent it to "Ask the GATK Team" which may be distinct from the forums.)
I'm trying to carry out VQSR on a dataset. When I run the VariantRecalibrator it fails to find annotations, even when added either by running the VariantAnnotator or by re-running GenotypeGVCFs with the arguments for the annotations to be generated. (The variants were originally called using the HaplotypeCaller.) My most recent command line for the VariantRecalibrator is:
java -jar /home/u1374090/software/gatk/gatkonly/GenomeAnalysisTK.jar \ -T VariantRecalibrator -R ../ncbi/ncbisorghumcombogenome.fasta \ -input Sallmoderncombojointan.vcf \ -resource:ncbivariations,known=false,training=true,truth=true,prior=10.0 ../ncbi/variations/nstd63Zhengetal2011.Sorbi1.submitted.variantcall.germline.wvned.vcf \ -an DP -an FS -an MQRankSum \ -mode SNP \ -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \ -recalFile recalibrateSNP.recal \ -tranchesFile recalibrateSNP.tranches \ -nt ${ntPara}
The shell variable ntPara was set to "10". I've tried several variations of the above command line, all without success. What happens is that the final .recal and .tranches file are empty and I get the following output including an error message (which I've edited because it is quite long):
INFO 17:36:29,136 HelpFormatter - Program Args: -T VariantRecalibrator -R ../ncbi/ncbisorghumcombogenome.fasta -input Sallmoderncombojointan.vcf -resource:ncbivariations,known=false,training=true,truth=true,prior=10.0 ../ncbi/variations/nstd63Zhengetal2011.Sorbi1.submitted.variantcall.germline.wvned.vcf -an DP -an FS -an MQRankSum -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 -recalFile recalibrateSNP.recal -tranchesFile recalibrateSNP.tranches -nt 10 INFO 17:36:29,141 HelpFormatter - Executing as u1374090@hamilton on Linux 3.13.0-46-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.025-b17. . . . INFO 17:36:29,715 GenomeAnalysisEngine - Strictness is SILENT INFO 17:36:29,832 GenomeAnalysisEngine - Downsampling Settings: Method: BYSAMPLE, Target Coverage: 1000 INFO 17:36:29,951 MicroScheduler - Running the GATK in parallel mode with 10 total threads, 1 CPU thread(s) for each of 10 data thread(s), of 64 processors available on this machine INFO 17:36:30,027 GenomeAnalysisEngine - Preparing for traversal INFO 17:36:30,030 GenomeAnalysisEngine - Done preparing for traversal INFO 17:36:30,031 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 17:36:30,031 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 17:36:30,031 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime INFO 17:36:30,035 TrainingSet - Found ncbivariations track: Known = false Training = true Truth = true Prior = Q10.0 INFO 17:36:48,155 VariantDataManager - DP: mean = NaN standard deviation = NaN INFO 17:36:51,231 GATKRunReport - Uploaded run statistics report to AWS S3
ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 3.3-0-g37228af):
. . .
ERROR MESSAGE: Bad input: Values for DP annotation not detected for ANY training variant in the input callset. VariantAnnotator may be used to add these annotations. See http://gatkforums.broadinstitute.org/discussion/49/using-variant-annotator
From Sheila on 2015-04-28
@WVNicholson
Hi,
It looks like your training file may not have the DP annotation in it. Can you check and make sure it has it?
-Sheila
From WVNicholson on 2015-05-15
Indeed it looks like my training (VCF) file does not have the DP annotation or apparently any annotations that would be useful for VQSR. I’m still looking; but at this stage it’s possibly none of the publicly available sets of known SNPs, at least as available in VCF format actually have the required annotations. It appears I may be able to get data in SRA format (apparently convertible to BAM) for some of the SNPs in the VCF file from Ensembl Plants. So presumably I can make my own annotated VCF from that one and the SRA files (after converting to BAM) by using VariantAnnotator. This looks like it’s going to be an involved process; so I thought I should check if that’s a sensible course of action?
William
From Geraldine_VdAuwera on 2015-05-15
@WVNicholson There seems to have been a miscommunication in our team about what is required for VQSR. It is not necessary for the training resource file to contain any annotations; the program only uses the location of the sites from that file.
From WVNicholson on 2015-05-18
Okay, I’m not sure where to go from here then. Inspection of the input VCF file (from my data rather than the other VCF file used for training) by eye appears to suggest it does actually have the DP annotation. The training file does not have the DP annotation; but you just said the training file is not required to contain any annotations. The only thing I can think of is that somehow I managed to call a lot of variants in the new data and none of them with agree with the known variants in the training set. That seems a bit unlikely; but would it give an error message like the one above?
William
From WVNicholson on 2015-05-18
On further investigation, my training VCF file may not be suitable. It looks like it has no SNPs – just indels and something described as “DUP” in the ALT field (CNVs, I think),
William
From WVNicholson on 2015-05-18
It turns out I have another VCF file with known SNPs and MNPs that should be suitable. However, when I attempt to use that I get a different error message:
ERROR MESSAGE: Input files /home/u1374090/sorghum/gatkwork/../ensembl/variations/sorghumbicolor.snpsmnpsonlywvned2.vcf and reference have incompatible contigs: Relative ordering of overlapping contigs differs, which is unsafe.
I'm confused by the error message; because the contigs in the various files refer to chromosomes and should be treated as non-overlapping. I can however re-order the data for the contigs in the training VCF to match those in my VCF with my variant calls; but the error message may be a clue that there are other problems,
William
From Geraldine_VdAuwera on 2015-05-18
@WVNicholson Sorry for the confusion. What the error message means by “overlapping contigs” is “contigs that are present in both files”. It doesn’t mean to imply that the sequences overlap. We’ll try to improve the phrasing to clarify this.
From blueskypy on 2015-06-04
INFO=
This doc does not say how the VQSLOD is computed. Is VQSLOD the log odds ratio or log odds? My understanding is that the term “odds ratio” always associates with at least two factors.
From Geraldine_VdAuwera on 2015-06-04
It’s the log odds ratio of being a true positive vs. being a false positive. See this presentation (~slide 12): https://drive.google.com/open?id=0BwTg3aXzGxEDanNtNFgyamtaMXM&authuser=0
From Geraldine_VdAuwera on 2015-06-04
Whoops sorry wrong version, see this one instead, which has the new slides: https://drive.google.com/open?id=0BwTg3aXzGxEDNW5tWUhSM2hZRlk&authuser=0
From blueskypy on 2015-06-05
@Geraldine_VdAuwera
Thanks so much for the slides! If my understanding is correct, the VQSLOD is the log(p/q) where p is the prob of being TP and q is the prob of being FP (although p and q are from different model, so p!= 1-q). I think it should be called log odds since it is not the ratio of two odds, as compared to the log odds ratio which is (p/(1-p)) / (q/(1-q)) and should be stated as the log odds ratio of being true under the Positive model, instead of the log odds ratio of being true.
But just my two cents!
From Geraldine_VdAuwera on 2015-06-05
Oh, I may have been abusing the terminology on this one. I’ll consult with our resident mathketeers for a definitive answer :)
From Geraldine_VdAuwera on 2015-06-05
Yep, I can now confirm that I was wrong to call it the log odds ratio, it’s just log odds, period. For a bit of additional mathy detail:
> It does not matter that p + q != 1 since you can re-normalize: Let p’ = p/(p+q) and q’ = q/(p+q). Now p’ + q’ = 1 and p/q = p’/q’ so log(p/q) can rightly be called log odds. Blueskypy is right that this is different from the log odds ratio which would be log((p/(1-p)) / (q/(1-q))) but unless Geraldine used that term in her discussion [G: oops] it should be irrelevant.
And they remind me that VQSLOD = “VQS-Log-ODds”
From blueskypy on 2015-06-08
@Geraldine_VdAuwera I’m glad that I helped a bit to the great work you guys has been doing! So probably the header of the VCF file needs to be changed a little:
`INFO=`
From Geraldine_VdAuwera on 2015-06-09
@blueskypy I think that’s the same thing as what’s currently there… do you mean to say we should take “ratio” out? I would agree with that.
From blueskypy on 2015-06-10
@Geraldine_VdAuwera
Yes, that’s what I mean.
From Geraldine_VdAuwera on 2015-06-10
OK we’ll fix that in the next release. I put in a ticket so it doesn’t fall off the radar.
From nchuang on 2015-06-19
dumb question, but would it be erroneous to combine my cases and controls together for joint genotyping and even VQSR?
From Geraldine_VdAuwera on 2015-06-21
@nchuang Not erroneous at all, that is what you should do.
From zzy on 2015-07-29
Hi, I have been working with GATK recently and I just came across this situation,
VariantRecalibrator produced tranches.pdf as shown in the attachment.
I am computing four human exome samples and I used UnifiedGenotyper to call SNPs (four samples together).
Here is my code:
`java -XX:+UseParallelGC -XX:ParallelGCThreads=15 -Xmx15g -jar ../GenomeAnalysisTK.jar -T VariantRecalibrator -R hg19.fa -input $one.snp_indel.raw.vcf -resource:hapmap,VCF,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf.gz -resource:omni,VCF,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.b37.sites.vcf.gz -resource:dnsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_135.b37.vcf.gz -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an MQ -recalFile var_plots/$one.recal -tranche 100 -tranche 99.9 -tranche 99.0 -tranche 90 -tranchesFile ./var_plots/$one.tranches -rscriptFile ./var_plots/$one.plots.R -mode SNP`
Thanks for any help!
Zeyu
From tommycarstensen on 2015-07-29
@zzy I think you forgot to ask your question.
From zzy on 2015-07-29
> @tommycarstensen said:
> zzy I think you forgot to ask your question.
Thank you and I am sorry.
My question is, why my attached tranches.pdf file is full of false positives? Is this due to any mistakes I made during the pipeline or just because my data are less qualified or insufficient sample size?
Thanks again, @tommycarstensen
Zeyu
From Sheila on 2015-07-29
@zzy
Hi Zeyu,
We recommend using at least 30 exome samples in VQSR. If you don’t have 30 samples, you can either add data from 1000 Genomes project or use hard filtering.
-Sheila
From zzy on 2015-07-30
>
Sheila said: > zzy > Hi Zeyu, > > We recommend using at least 30 exome samples in VQSR. If you don't have 30 samples, you can either add data from 1000 Genomes project or use hard filtering. > > -Sheila Hi,
Sheila ,
Thanks for your reply!
Zeyu
From Matteodigg on 2015-09-15
Hi there,
I have some questions for you. I’m trying to use VQSR on my data and I’m working on Target sequencing extracted from a WES (using a .list file). To be clear, I have 3 datasets, the first one with 3 samples (more or less 20M TOTAL reads per sample extracted from TruSight panel), the second one with 48 samples (1,5M TOTAL reads per sample, target) and the last one with 6 samples (20M reads per sample, extracted from TruSight). Reading on your forum and your previous answers, you suggest to use VQSR on at least 1 sample WGS or 30 WES samples. As concern targeted sequencing, I read this link:
https://www.broadinstitute.org/gatk/guide/article?id=39
you suggest to work on —maxGaussians options if I want to use VQSR, so my question is: Can I use VQSR on all my datasets by working on —maxGaussians parameter? If no, can I try to merge all my vcf coming out from my 3 datasets (target and exome) and using them as a sort of db to use VQSR on small target? I hope I was clear and sorry for my dummy question.
Thank you,
Matteo
From vsvinti on 2015-09-17
What does ‘culprit=x’ mean for variants that get the PASS vqsr filter? I take it that for variants that fail, it indicates the annotation that caused it to not make the PASS filter? Is the PASS filter assigned based on a particular VQSLOD cutoff?
I have done two analyses with using UG and HC. The latter has ~ 400 samples less than the former (the rest are identical). I have some variants that pass the vqsr form the HC results, but didn’t in the UG analysis (possibly also the other way around). I am trying to determine if I should trust these variants.
For example, one particular variant was given ReadPosRankSum=43.912 in the UG analysis, and failed (culprit=ReadPosRankSum), but ReadPosRankSum=1.16 in the HC analysis, and PASSed.
Is ReadPosRankSum calculated by ranking distance to end of read for all reads in all samples? If so, then the difference in score could be caused by the 400 sample difference?
In such cases, do you recommend any other sanity checks to have confidence in these variants?
Thanks
Vicky
From SteveL on 2015-09-17
Hi Vicky,
Yes, I understand that culprit means the same i.e. that the filter applied following your UG analysis was based on ReadPosRankSum – the value in the example you give shows there was a very strong difference within the reads of the variant calls versus the reference calls in the case of the UG analysis.
However it seems strange, assuming that you used the same BAMs as input, that the difference should be so great. Unless the ~400 samples extra that you added were called using a different capture kit? (I assume this was WES). You also don’t mention how many samples you have in the UG analysis?
Also, were all other steps in your workflow the same, alignment and joint-gentoyping? I would not be surprised if different versions of various tools would lead to different findings due to batch effects of some sort.
Steve
From vsvinti on 2015-09-17
@SteveL
Thanks for your message! I have ~ 2200 samples in the UG analysis, ~ 1800 in the HC. These are exomes, and yes, a mixture of 2 target captures. The 400 excluded were all with one capture, but I have ~ 600 left with the same capture as the excluded ones (in the HC analysis).
The VQSR for UG was done with gatk2.8 – I remember I had done everything with 2.7 but the vqsr performed terribly on my data, and had to update to the 2.8 release when it came out.
I have done all the re-analysis including the re-alignments for all the samples, so I shouldn’t have any differences in workflow. UG was done by polled calling across all the samples, HC via the new single-sample gvcf, batches of 100 and then calling across these.
So I guess I have a different ratio of samples coming from the different capture platforms for UG and HC. Do you think that could cause the issue? Is there a way to deal with this?
Vicky
From SteveL on 2015-09-17
Hi Vicky,
I have no experience with UG myself, only HC, so I guess we’ll have to ask those who really no. Geraldine_VdAuwera
Sheila ?
It is not surprising that a few variants fail here and there, given that the batches were different, and there are always some extreme outliers in any NGS analysis comparisons in my experience.
Which percentage of your variants that were PASS in the smaller dataset are no longer PASS in the larger dataset?
Steve
From vsvinti on 2015-09-17
22% of variants from UG not in HC (~ 170,000)
4.5% of variants from HC not in UG (~ 30,000)
(comparing chr pos for PASS variants)
Thanks for your input :) I’ll see what the others think ..
From Sheila on 2015-09-17
@vsvinti
Hi,
Are you asking about results from version 2.8? Or, when you say you re-ran the analysis, do you mean you used the latest version? Unfortunately, it is really hard to comment on results from version 2.8. I was not even working here at the time! :smile:
In general, the differences could certainly be due to the way the annotations were calculated by Haplotype Caller and Unified Genotyper. Also remember, Haplotype Caller does a reassembly step which may shift the reads arounds. http://gatkforums.broadinstitute.org/discussion/4146/hc-step-2-local-re-assembly-and-haplotype-determination
-Sheila
From vsvinti on 2015-09-17
Thanks, @Sheila
I am asking about whether I should trust the new variants that pass VQSR in the HC analysis. I guess if I hadn’t done a previous analysis with UG, I might not have questioned the results, but I am getting some signals that look they might be too good to be true with these new variants. Quoting one snp as eg (mentioned above):
For example, one particular variant was given ReadPosRankSum=43.912 in the UG analysis, and failed (culprit=ReadPosRankSum), but ReadPosRankSum=1.16 in the HC analysis, and PASSed.
Would the difference in score be due to the 400 sample difference, or local reassembly, or a number of other factors? Would you expect to see such a big discrepancy between the calculated value between two runs?
2.8 was not that long ago, when HC was not do-able for the size of my data…
From Sheila on 2015-09-17
@vsvinti
Yes, the reassembly could explain the large difference in the ReadPosRankSum. You can take a look at the bamout file of the region and compare it to the original bam file. https://www.broadinstitute.org/gatk/guide/article?id=5484
Of course, the best thing to do is run again with Haplotype Caller in GVCF mode. https://www.broadinstitute.org/gatk/guide/article?id=3893
-Sheila
From Sheila on 2015-09-17
@Matteodigg
Hi Matteo,
I am not sure if this is the right assumption, but it sounds like you have 57 whole exome samples. You have subset the exome to the targets of interest, and you want to run VQSR on those targets.
If that is the case, it is best to simply run VQSR on the whole exomes themselves, then subset the filtered VCF to your targets of interest.
-Sheila
From vsvinti on 2015-09-18
Thanks @Sheila
I still need some degree of confidence in my results, and I am becoming a bit unsure what what comes out of VQSR as PASS (it seems to change a lot..).
Also, the proportion of FP in my 90s tranch seems to be quite high – what could be causing this? Your example shows only TPs in the 90s, so I’m quite far off that! I ploted this using the target_titv 3.2 (which is recommended for exomes)..
Thanks!
Vicky
ps. yes, I am dealing with exomes and two capture platforms, I pushed everything through the same workflow, and I am only analysing the intersection of captures + 50bp padding
Screen Shot 2015-09-18 at 15.56.28.png
From Geraldine_VdAuwera on 2015-09-18
Hi @vsvinti,
It’s not unexpected to see substantial differences in results between UG and HC. The variant modeling is more accurate in HC, and reads are realigned locally, so many annotations values will be different, and this in turn impacts the modeling in VQSR. In addition, having such a significant difference in cohort size, and a mix of capture platforms, makes it difficult to do a direct comparison. You also don’t mention whether you have done any evaluation against a common truth set, which would at least give you some objective indication of which callset is closer to the expected truth. Relative numbers of variants are just that, relative.
In brief, we can’t give you any measure of confidence (good or bad) based on this information, aside from saying that we have greater confidence in our newer tools and in our latest version. If you want greater confidence, you’ll need to do a more thorough evaluation.
From vsvinti on 2015-09-18
thanks @Geraldine_VdAuwera
From Matteodigg on 2015-09-25
Hi @Sheila ,
thank you for your answer, maybe I have not been so clear. I have 9 whole exome sequencing (ok, I can use all the exome, not only target for variant calling) but also 48 target sequencing and I’d like to use VQSR on all of these data. So, is it a good Idea to use 9 exome + 48 target at the same time in VQSR ?
I have also another question for you, if I have 2 vcf files with 10 samples each, can I use CombineGVCF to merge them and then use VQSR? I’m quite sure that I cannot use more than one input file in VQSR, right?
Thank you very much for your patience!
From Geraldine_VdAuwera on 2015-09-25
@Matteodigg
No on both. The error modes are different depending on the experimental design; if you combine them you will confuse the model and obtain inferior results.
From TracyBallinger on 2015-11-10
> @SteveL said:
> I found this sentence VERY confusing, if not wrong (I am still not 100% sure):
>
> > The first tranche (from the bottom, with lowest values) is exceedingly specific but less sensitive, and each subsequent tranche in turn introduces additional true positive calls along with a growing number of false positive calls.
>
> Should it not be the first tranche from the TOP? i.e. in tranche-90 everything is considered TP (at least in this figure).
I also found this very confusing, even though I think it is now clear what the first tranche is :
“The first tranche (from the bottom, with the highest value of truth sensitivity but usually the lowest values of novel Ti/Tv) is exceedingly specific but less sensitive.“
Shouldn’t the tranche with the highest value of truth sensitivity be exceedingly sensitive and less specific?
From everestial007 on 2015-11-20
I am a little confused/in dillema with what I should do/do not for VQSR. In my situation I don’t have database of highly confident variants (that is used for training the recalibration model) for VQSR process.
Based on what we did in the BQSR process I think similar model can applied for VQSR process (but with recalibration step only one time). Is it ok if do the following process:
1. Variant Calling in each biological sample using HC and UG using strict filtering options.
2. Merge the variants (from HC and UG) and call the common (intersecting) variants called by both the callers for this particular biological sample. https://www.broadinstitute.org/gatk/guide/article?id=53
3. Do the above two steps for all other biological samples (I have 6 biological sample from same population).
4. Now, merge the variants from different biological samples and call the common (intersecting) variants. If the biological samples belong to same family and/or population my hunch is that we should now have a very high and confident variants that belong to that family and/or population.
So, can I use this highly confident variant (at least as non-true sites) in further VQSR process for training the recalibration model.
https://www.broadinstitute.org/gatk/guide/article?id=1259
Thanks in advance !
From Sheila on 2015-11-23
@TracyBallinger
Hi,
Sorry for the confusion! You are correct. I just tried to make the sentence a little clearer. Hopefully it makes sense now.
Thank you for the catch.
-Sheila
From Sheila on 2015-11-23
@everestial007
Hi,
Are you working with 6 whole exome samples or 6 whole genome samples? We recommend using VQSR with at least 30 whole exome samples or 1 whole genome sample. You might be better off using hard filtering. We have some basic recommendations here: http://gatkforums.broadinstitute.org/discussion/2806/howto-apply-hard-filters-to-a-call-set However, you may need to play around with the thresholds to better suit your data.
-Sheila
From aachueva on 2015-11-25
I have a VCF file that has several custom annotations(ABHet, blat score, etc) that were calculated with the in-house software and that I've added to VCF INFO field using bcf-annotate. I would like to re-run VQSR recalibration on this VCF file using those custom annotations. So far I had no luck getting it run without the error when using custom annotations (the same command with GATK annotations like DP, QD, FS, etc works just fine). Here is the error message I get when attempting to run VQSR with custom annotations:
##### ERROR MESSAGE: Invalid command line: No tribble type was provided on the command line and the type of the file 'file_1.recal' could not be determined dynamically. Please add an explicit type tag: NAME listing the correct type from among the supported types
So, I'm wondering if it is even possible to build a Gaussian model using custom annotations or is VQSR recalibration allowed only using annotations generated by GATK e.g.: QD\DP\FS\SOR|MQRankSum\etc?
From Sheila on 2015-11-25
@aachueva
Hi,
Can you tell us which version of GATK you are using and the exact commands you tried to run?
Thanks,
Sheila
From aachueva on 2015-12-01
Hi Sheila, thanks for the response. Im using GenomeAnalysisTK-3.4-46.
Here are the commands im using (NOTE: -an foo.ABHet and -an foo.uniq20 are my custom annotations that are in the vcf file):
```
PREFIX=“file_1“
VERSION=“v1”
java -jar $GATK_DIR/GenomeAnalysisTK.jar \
-T VariantRecalibrator \
-R $REF_GENOME \
—input:VCF $PREFIX”.vcf” \
-resource:hapmap,known=false,training=true,truth=true,prior=15.0 $REF_DIR/hapmap_3.3.b37.vcf.gz \
-resource:omni,known=false,training=true,truth=true,prior=12.0 $REF_DIR/1000G_omni2.5.b37.vcf.gz \
-resource:1000G,known=false,training=true,truth=false,prior=10.0 $REF_DIR/1000G_phase1.snps.high_confidence.b37.vcf.gz \
-resource:dbsnp,known=true,training=false,truth=false,prior=2.0 $REF_DIR/dbsnp_138.b37.vcf.gz \
-mode SNP \
-an QD \
-an MQRankSum \
-an MQ \
-an ReadPosRankSum \
-an SOR \
-an FS \
-an BaseQRankSum \
-an GQ_MEAN \
-an foo.ABHet \
-an foo.uniq20 \
-mG 10 \
-mNG 3 \
-tranche 100.0 \
-tranche 99.9 \
-tranche 99.8 \
-tranche 99.7 \
-tranche 99.6 \
-tranche 99.5 \
-tranche 99.4 \
-tranche 99.3 \
-tranche 99.2 \
-tranche 99.1 \
-tranche 99.0 \
-tranche 98.0 \
-tranche 97.0 \
-tranche 96.0 \
-tranche 95.0 \
-tranche 90.0 \
-recalFile $PREFIX”_”$VERSION”.recal” \
-tranchesFile $PREFIX”_”$VERSION”.tranches” \
-rscriptFile $PREFIX”_”$VERSION”.R”
java -jar $GATK_DIR/GenomeAnalysisTK.jar \
-T ApplyRecalibration \
-R $REF_GENOME \
—input:VCF $PREFIX”.vcf” \
—ts_filter_level 99.5 \
-recalFile $PREFIX”_”$VERSION”.recal” \
-tranchesFile $PREFIX”_”$VERSION”.tranches” \
-mode SNP \
-o $PREFIX”_”$VERSION”_output.vcf.gz“
```
Thank you in advance
From Sheila on 2015-12-02
@aachueva
Hi,
It should be possible to use custom annotations; you just have to make sure the annotations you specify exactly match the annotation names in the VCF. Can you confirm that the .recal file is not empty?
-Sheila
From everestial007 on 2015-12-02
>
Sheila said: >
everestial007
> Hi,
>
> Are you working with 6 whole exome samples or 6 whole genome samples? We recommend using VQSR with at least 30 whole exome samples or 1 whole genome sample. You might be better off using hard filtering. We have some basic recommendations here: http://gatkforums.broadinstitute.org/discussion/2806/howto-apply-hard-filters-to-a-call-set However, you may need to play around with the thresholds to better suit your data.
>
> -Sheila
HI Sheila,
Sorry for late reply but just had chance to start working on my data again.
I am working with the 6 whole genome sample/population; 2 populations (so 12 whole genome samples in total).
I have done the filtering part (using stringent filtering parameters) and collected highly confident variants for each population. But I am also interested in rare variants within-between populations to work on EHH (extendend haplotype homozygosity).
I have also called variants in g.vcf mode for each sample.
So, I am wondering if it is possible to take the highly confident variants identified from these samples to run VQSR (I am thinking of it similar to BQSR process) on the g.vcf file, which has the possibility of identifying rare variants. I am not sure if its a good question to ask, but I am not sure if I can take VQSR process in this way. Any suggestions would be helpful?
Thank you,
From Sheila on 2015-12-08
@everestial007
Hi,
I think Geraldine’s answer [here](http://gatkforums.broadinstitute.org/discussion/comment/25693#Comment_25693) sums up the issue.
Good luck!
-Sheila
From everestial007 on 2015-12-10
Thanks @Sheila
The discussion on the link you posted is obviously going to be helpful.
Thank you !
From everestial007 on 2015-12-10
The drop box link mentioned in the beginning of the article isn’t working.
From Sheila on 2015-12-11
@everestial007
Hi,
Yes, that does not work anymore. I removed it from the document. The best thing to do is check out the slides from the talks (from the Events Webpage). The latest BroadE workshop slides and presentations will be uploaded soon. But, you can have a look at the BroadE workshop slides and talks from March [here](https://www.broadinstitute.org/gatk/guide/presentations?id=5992#materials).
-Sheila
From everestial007 on 2015-12-11
>
Sheila said: >
everestial007
> Hi,
>
> Yes, that does not work anymore. I removed it from the document. The best thing to do is check out the slides from the talks (from the Events Webpage). The latest BroadE workshop slides and presentations will be uploaded soon. But, you can have a look at the BroadE workshop slides and talks from March [here](https://www.broadinstitute.org/gatk/guide/presentations?id=5992#materials).
>
> -Sheila
Its going to be helpful.
Thank you !
From jena on 2016-01-04
Hi,
I work with exome-capture data of several non-model fish species from one genus, and their hybrids. I called variants with HC in GVCF mode and then joint genotyped them using GenotypeGVCFs tool. This first analysis was done on 18 priority samples of 2 species and their hybrids (and one outgroup). Another about 60 samples are currently running through the HC pipeline and I will repeat the whole procedure with them included – I’m just learning this tool on smaller dataset I have at hand.
I tried this VQSR on our 18 samples. For training I used our own database of high-confidence SNPs (~30k) that are diagnostic between the two species (parental species are homozygous either for ref or alt and hybrids are heterozygous). I’ve set tranches as described in the best practice document, i.e. `-tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0` and also `—maxGaussians 4`. I also excluded DP and InbreedingCoeff annotations from the call (we don’t have just one population so we don’t expect HWE).
However the final recalibrated_snps.vcf contains only PASS or VQSRTrancheSNP99.90to100.00 filters, no other tranches seem to be present in my data. How is it possible? Could it be that my training dataset is too small?
I was also wondering if the VQSR is OK with data from more than one species (I’ve excluded InbreedingCoeff for this reason) and for that matter – what about the rest of GATK pipeline? As far as I understand the tools this shouldn’t be a problem for most of them, possibly only VQSR, am I right?
Thanks in advance for any insights.
Jan
From jena on 2016-01-06
Hi again,
just realised my previous question was quite dumb because there is only one trenche in my data due to ApplyRecalibration procedure. My bet :)
The question about multiple species (same genus) in VQSR still stands though :)
Thanks
Jan
From Geraldine_VdAuwera on 2016-01-08
Hi @jena, no worries.
All the GATK tools do require that you run your analysis relative to a single reference genome. If you can meet that condition then it should be fine. But to be honest we don’t have any direct experience with such a use case so we won’t be able to advise you much.
From chaozhongyinxiang on 2016-01-21
Hi, we have been trying to use VQSR to annotate/classify our callset. I think we understand the nature of VQSLoD score as well as the source of the “true positives” during training, but just wonder what would be the “false positives” that are fed to the GMM?
Thanks,
Chao
From Geraldine_VdAuwera on 2016-01-22
“False positives” or “bad variants” are derived from the training set too; they are the worst scoring variants from the training set.
From mscjulia on 2016-08-26
Hi, I’m just wondering if GATK has only build-in utilities to compare genotypes between samples of a cohort study (used g.vcf from HaplotypeCaller), once I’m done with VQSR. Thanks a lot.
From Sheila on 2016-08-29
@mscjulia
Hi,
Yes, there are a few different tools that can be useful. Have a look at Picard’s [GenotypeConcordance](https://broadinstitute.github.io/picard/command-line-overview.html#GenotypeConcordance) and [SelectVariants](https://software.broadinstitute.org/gatk/documentation/tooldocs/org_broadinstitute_gatk_tools_walkers_variantutils_SelectVariants.php). You may find [VariantEval](http://gatkforums.broadinstitute.org/gatk/discussion/6308/evaluating-the-quality-of-a-variant-callset) helpful too.
-Sheila
From cmorgan1 on 2016-09-21
Hello,
I have a query on the best practices for the ‘VariantRecalibrator’ step in the GATK pipeline.
I have sequenced ~15 Mb of non-coding regions across 96 samples to a median depth of 50X. I am specifically interested in identifying rare variants (MAF<1%).
Using the GATK best practices pipeline, i trained ‘VariantRecalibrator’ on the follow data using the following priors
-resource:hapmap,known=false,training=true,truth=true,prior=15.0 $Variants/hapmap_3.3.hg19.sites.vcf.gz \
-resource:omni,known=false,training=true,truth=true,prior=12.0 $Variants/1000G_omni2.5.hg19.sites.vcf.gz \
-resource:1000G,known=false,training=true,truth=false,prior=10.0 $Variants/1K.wgs.phase3.vcf.gz \
-resource:dbsnp,known=true,training=false,truth=false,prior=2.0 $Variants/All_20160601.vcf.gz \
The resulting tranche plot (see attached) has extremely low Ti/Tv ratios (0.246-0.385), indicating lots of false positives, which go in the opposite direction to the WES example provided, with seemingly no variants marked as ‘cumulative TPs’ or ‘tranch specific TPs’
At a first glance, my data looks absolutely awful … however.. i am wondering if because i am looking at ‘non-coding’ regions and because the per base coverage in these regions is really high in my dataset (~50X) and quite low in the training datasets (~5X), is it possible that these training sets are not a suitable prior to effectively help in identifying ‘true positives’? Should i down weight the priors?
I reduced the ‘—maxGaussians’ to 4, as was advised in some of the other comments to avoid over fitting, but there was non significant improvement to data.
If someone has come across a situation like this before and could advise on a way to ensure i am reducing the number of false positives in this dataset, it would be much appreciated.
Going back to ‘hard filtering’ is plan B, but I would sincerely like to optimise the parameters for ‘VariantRecalibrator’ for this particular dataset type.
Thank you in advance,
Claire
From Sheila on 2016-09-28
@cmorgan1
Hi Claire,
Sorry for the late response. Those tranche plots look terrible. Notice, you have no true positives in any of them.
How did you generate the input VCF? Please post the exact command you ran.
Thanks,
Sheila
From mscjulia on 2016-09-28
Hello,
I have a question about how VQSR considers the diversity between samples. Let’s say I have VQSR applied to two different cohort studies, each of which has 4 samples. For Study A, chr 1 position 1, the genotypes for each of my 4 samples are all different (e.g. AA, TT, AT, GC), whereas for Study B, same position, all 4 samples have a same genotype. When VQSR evaluation these two cases in the cohort mode, would it give a lower tranche score for study A please? Thanks a lot.
From Sheila on 2016-10-04
@mscjulia
Hi,
First, can you confirm that the samples are whole genome? We recommend running VQSR on at least 30 whole exome samples or 1 whole genome sample (more than one whole genome sample does much better).
VQSR does not actually take into account the genotypes of the samples, it only looks at the site-level annotation distributions. We do have some new allele-specific annotations, that calculate annotations for each alternate allele, not for all alternate alleles. You can try those out using [`-AS`](https://software.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantrecalibration_VariantRecalibrator.php#—useAlleleSpecificAnnotations) in VariantRecalibrator and [`-AS`](https://software.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantrecalibration_ApplyRecalibration.php#—useAlleleSpecificAnnotations) in ApplyRecalibration.
-Sheila
From la_br2016 on 2016-12-02
Hi Sheila
Geraldine_VdAuwera @
For a WES (target ~50Mb plus +-100pb padded) project of about 2500 samples would you advise to change the —maxGaussians to 4 or keep it at default 8 (or something in between)?
thank you,
-lili
- Can I use the variant quality score recalibrator with my small (?) sequencing experiment?
This tool is expecting thousands of variant sites in order to achieve decent modeling with the Gaussian mixture model. Whole exome call sets work well, but anything smaller than that scale might run into difficulties.
One piece of advice is to turn down the number of Gaussians used during training. This can be accomplished by adding —maxGaussians 4 to your command line.
From Sheila on 2016-12-06
@la_br2016
Hi Lili,
I think it will be best to try [hard filtering](https://software.broadinstitute.org/gatk/documentation/article?id=2806). Small targeted experiments simply do not have enough data for VQSR to make a good model.
-Sheila
From la_br2016 on 2016-12-07
Hi @Sheila this is whole exome sequencing. I have samples (n=2500) captured with Agilent and IDT so when I intersected the target region from these 2 kits I am left with ~50Mb. I thought VQSR work well with whole exome.
I am evaluating with hard filtering and different levels of sensitivity when using VQSR that is why I wanted to clarify the —maxGaussians
From Geraldine_VdAuwera on 2016-12-09
Since you have very many samples there’s a good chance this will work, yes. My recommendation would be to try a range of settings for the number of Gaussians (e.g. 4, 6 and 8) and see what results in the most stable model. You can typically use more for the SNP recalibration than for indels (they don’t have to be the same).
From la_br2016 on 2016-12-09
thank you @Geraldine_VdAuwera
also I noticed three slightly different documentation about hard filtering.
I was wondering regarding the SOR recomentation is this correct SOR > 3.0 for SNPs, and SOR > 10.0 for INDELs ?
SOR > 3.0 for SNPs and SOR > 10.0 for INDELs
source: http://gatkforums.broadinstitute.org/gatk/discussion/3225/i-am-unable-to-use-vqsr-recalibration-to-filter-variants
no SOR recommendations here?
https://software.broadinstitute.org/gatk/documentation/article?id=2806
“our hard filtering recommendation of failing variants with an SOR value greater than 3 will at least remove the long tail of variants that show fairly clear bias according to the SOR test” I guess this document is only for SNPs?
source: https://software.broadinstitute.org/gatk/guide/article?id=6925
From Geraldine_VdAuwera on 2016-12-09
@la_br2016 We usually have different cutoffs for SNPs and indels because their profiles are different, so that’s normal. The tutorial doesn’t have SOR because it was written before SOR was created — we’ll probably need to update that at some point. And that last doc is indeed about SNPs, yep.
From la_br2016 on 2016-12-14
thank you @Geraldine_VdAuwera.
Do you know if hard filter cutoff MQ < 40 for SNPs (now RAW_MQ) is still the same? I see a very broad range for RAW_MQ
summary(SNP$RAW_MQ)
RAW_MQ
Min. : 729
1st Qu.: 97200
Median : 169200
Mean : 215615
3rd Qu.: 270000
Max. :10312336
thanks,
-lili
From Magda on 2016-12-16
Hello,
I have some questions about the plot expected about the TP-FP and tranches and the Ti/Tv value.
I have a set of 72 Expanded Exomes, which after performing all steps from the Best Practices recomendations, I’ve done the VQSR.
The commands used for SNPs (afterwards for Indels)
icav:mag$ java -Xmx6g -jar Programes/GenomeAnalysisTK-3.4-46/GenomeAnalysisTK.jar -T VariantRecalibrator
-R Programes/GenomeAnalysisTK-3.4-46/hg19.fa -input Samples_run2/11.GATK_HaploCaller_ip/Output_Joint_D.vcf
-recalFile Samples_run2/12.v2.GATK_VarRecalibrator_ip/recalibrate_SNP.recal -tranchesFile Samples_run2/12.v2.GATK_VarRecalibrator_ip/recalibrate_SNP.tranches -nt 4 -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37_chr.vcf -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37_chr.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.b37_chr.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.b37_chr.vcf
-an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an InbreedingCoeff -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.5 -tranche 99.0 -tranche 90.0
-rscriptFile Samples_run2/12.v2.GATK_VarRecalibrator_ip/recalibrate_SNP_plots.R
-U ALLOW_SEQ_DICT_INCOMPATIBILITY
icav:mag$ java -Xmx3g -jar Programes/GenomeAnalysisTK-3.4-46/GenomeAnalysisTK.jar -T ApplyRecalibration
-R Programes/GenomeAnalysisTK-3.4-46/hg19.fa -input 48_samples_run2/11.GATK_HaploCaller_ip/Output_Joint_D.vcf
-recalFile Samples_run2/12.v2.GATK_VarRecalibrator_ip/recalibrate_SNP.recal
-tranchesFile Samples_run2/12.v2.GATK_VarRecalibrator_ip/recalibrate_SNP.tranches
-o Samples_run2/12.v2.GATK_VarRecalibrator_ip/recalibrated_snps_raw_indels.vcf
—ts_filter_level 99.5
-mode SNP
Regarding the tranches, I’ve seen that in the lower tranche (90) are all (or mainly) TP (according to the plot you have here) and has the highest Ti/Tv, which decreases while you increase the tranche.
a) In my case I have the opposite, the 90 tranche have lots of FP too and the Ti/Tv values in fact increase when I increase the tranche (the 100 tranche have the highest Ti/Tv value).
Is my pattern ok? or there is something wrong?
b) I applied 99.5 filter-level, but, since my Ti/Tv value is higher for the 100 tranche, should I use the filter 100?
c) Even the highest Ti/Tv (100) I get is still very low (1.5). Does that mean that I need to do post filtering after the VQSR?
My individuals are African, so I expect to have a high number of new variants, but still many variants would be the same as the ones from the resource files, so, why do I have such a big FP proportion when using tranche 90?. I don’t understand it.
Can you give me some indication of what to do? or whether I can trust my SNP data?
Thank you,
Magda
From shlee on 2016-12-16
Hi @Magda,
Can you explain in detail what your expanded exome data represents? Also, was calling done on intervals or over the entire dataset? We are looking for violations of assumptions that VQSR uses in its modeling.
The trends your data show appear to be the converse of what is desired (I’ve attached an example of what is conventional). Take a look at the recommendations in [this document](https://software.broadinstitute.org/gatk/guide/article?id=1259) and see if you are adhering to them.
Screenshot 2016-12-16 15.59.33.png
From Magda on 2016-12-19
Hi Shlee,
Thank you for answering.
My expanded Exome data is the result of using the Nextera capture enrichment Expanded Exome kit of Illumina, with includes the exomes as well as untranslated regions (UTRs) and miRNA.
The calling was done on the intervals plus -ip 100 (I used the recomendations of the Best practices).
Variant Calling:
java -Xmx6g -jar Programes/GenomeAnalysisTK-3.4-46/GenomeAnalysisTK.jar -T HaplotypeCaller
-R Programes/GenomeAnalysisTK-3.4-46/hg19.fa
-I Samples_run2/10.GATK_BaseRecalibrator_ip/${ind}_Recal.bam
-o Samples_run2/11.GATK_HaploCaller_ip/${ind}_raw_g.vcf
—dbsnp dbsnp_138.b37_chr.vcf
—genotyping_mode DISCOVERY
-stand_emit_conf 10
-stand_call_conf 30
-L nexterarapidcapture_exome_targetedregions_v1.2.bed -ip 100
-ERC GVCF
-variant_index_type LINEAR
-variant_index_parameter 128000
java -Xmx6g -jar ../../Programes/GenomeAnalysisTK-3.4-46/GenomeAnalysisTK.jar -T GenotypeGVCFs
-R ../../Programes/GenomeAnalysisTK-3.4-46/hg19.fa
-U ALLOW_SEQ_DICT_INCOMPATIBILITY
-V H01_raw.g.vcf -V H08_raw.g.vcf -V H10_raw.g.vcf -V H29_raw.g.vcf -V H91_raw.g.vcf -V H121_raw.g.vcf -V H138_raw.g.vcf -V H196_raw.g.vcf -V H202_raw.g.vcf …
… (72 inds in total)
H201_raw.g.vcf -V B353_raw.g.vcf -V H130_raw.g.vcf -V H71_raw.g.vcf -V B424_raw.g.vcf -V B872_raw.g.vcf
-D ../../dbsnp_138.b37_chr.vcf
-o Output_Joint_D.vcf
- VQSR
(as I put in the previous message)
- The only things that I can thing of that could affect the results is that:
1) in some places it appears that the Ommi resource as truth=true and in others as truth=false. I used truth=true.
2) In several parts of my pipeline there was incompatibilities with the reference, the input and the dbSNP. For example I had an Error due to incompatilities of reference and dbSNP contigs files (ERROR omni contigs = [1, 2, 3… ERROR reference contigs = [chr1, chr2, chr3…), so I had to add “chr” to the resource files, or reorder.
Thamk you
From shlee on 2016-12-20
Hi @Magda,
I will get to your question soon. We are attending a company retreat currently. In the meanwhile, I want to let you know that one other person using expanded exomes appears to observe similar low TiTv ratios ([this thread](http://gatkforums.broadinstitute.org/gatk/discussion/8269/low-novel-ti-tv-ratio-in-expanded-exome-data)). Also [this thread](http://gatkforums.broadinstitute.org/gatk/discussion/comment/34509#Comment_34509) may be of interest to you.
From Geraldine_VdAuwera on 2016-12-21
Hi @Magda, I’m very concerned that your issue may be due to a reference mismatch. Between using ‘-U ALLOW_SEQ_DICT_INCOMPATIBILITY’ and modifying config names, you have made it possible for the program to make blind assumptions about the overlaps between your callset and the resource datasets you are using, without any guarantee that the sites match at all. This is completely unsupported and is probably why your filtering results look terrible. I can only recommend redoing everything from scratch, checking this time that you are using matching references. If you get reference mismatch errors, don’t try to hack your way around them; you have to fix the source of the problem, which may involve using different resources, or sometimes realigning and recalling the data from raw reads.
From RaviKumarSindhu on 2017-01-04
Hi, I am working with Human sample WGS data (30X, PE reads, HiSeq platform) and facing problem at VQSR and variant annotation step. I ran following commands (modified here, path deleted) to get my files:
1. Converted each sample’s bam file into gVCF using GATK’s HaplotypeCaller (-ERC GVCF function) for joint genotyping and VQSR, command:
/usr/bin/java -jar /opt/GenomeAnalysisTK.jar -R ucsc.hg19.fasta -T HaplotypeCaller -I ETH003388_final.bam -D rs_snp.vcf -ERC GVCF -variant_index_type LINEAR -variant_index_parameter 128000 —alleles rs_snp.vcf —max_alternate_alleles 3 -L rs_snp.vcf -ip 150 -o ETH003388.g.vcf -stand_call_conf 0 -stand_emit_conf 0
(where rs_snp.vcf is customized dbSNP.vcf with only variants having a SNP rsID).
2. Combined all my individual sample’s g.vcf file to form a cohort.g.vcf, command :
/usr/bin/java -jar /opt/GenomeAnalysisTK.jar -T CombineGVCFs -R ucsc.hg19.fasta -D rs_snp.vcf -o cohort.g.vcf —variant sample1.g.vcf —variant sample2.g.vcf —variant sample3.g.vcf …….. —variant sampleN.g.vcf
3. Converted cohort.g.vcf to genotype.g.vcf, command:
/usr/bin/java -jar /opt/GenomeAnalysisTK.jar -T GenotypeGVCFs -R ucsc.hg19.fasta -D rs_snp.vcf -allSites -stand_call_conf 0 -stand_emit_conf 0 -maxAltAlleles 3 —variant cohort.g.vcf -o genotype.g.vcf
4. Running VariantRecalibrator command
/usr/bin/java -jar /opt/GenomeAnalysisTK.jar -T VariantRecalibrator -R ucsc.hg19.fasta -input genotype.g.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.hg19.sites.vcf -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.hg19.sites.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.hg19.sites.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 rs_snp.vcf -an DP -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an InbreedingCoeff -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 -recalFile recalibrate_SNP.recal -tranchesFile recalibrate_SNP.tranches -rscriptFile recalibrate_SNP_plots.R
Question : I am getting all my Variants in “False Positive” category in ‘recalibrate_SNP.tranches.pdf’. Any idea why all my variants are falling in FP category ?. I have put this question at Biostars website also but didn’t get any response except contacting gatk discussion forum (https://www.biostars.org/p/229247/).
From Sheila on 2017-01-05
@RaviKumarSindhu
Hi,
There are 3 things to try:
1) Remove `-stand_call_conf 0 -stand_emit_conf 0` from the GenotypeGVCFs command and see what the plots look like.
2) Can you post some example records from the final VCF?
3) Please also post the model plots from VQSR.
Thanks,
Sheila
From vsvinti on 2017-01-09
Hi there, I have a set of variants for several hundred exomes for which VQSR performs very poorly ( I had posted on this before but we didn’t find a solution: I did have samples sequenced with different capture platforms, but it didn’t make a difference on how VQSR plots looked if I just called on the ~600 exomes that used the same capture). So I have been using hard filters instead up to now. However, I adjusted the thresholds based on internal features of the data : I have estimated them from a set of variants that I know to be real (99% confidence) and a set of variants that I know to be false in my data.This does get rid of false positives, but it is very stringent and I am losing a lot of variants.
Is it possible to use the custom list of true or false variants in the VQSR training model, to see if it can perform better? I am talking about a couple of thousand variants.
If so, which option would allow me to specify these: would they be specified as truth or as training sets, what priors to use etc?
From
-resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap.vcf \ -resource:omni,known=false,training=true,truth=true,prior=12.0 omni.vcf \ -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G.vcf \ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.vcf \
I am assuming that my true variants will be similar to hapmap/omni, eg
-resource:custom_true,known=false,training=true,truth=true,prior=? custom_true.vcf
? And can I use my false set anywhere?
I would appreciate hearing your thoughts on this.
Thanks, Vicky
From Sheila on 2017-01-13
@vsvinti
Hi Vicky,
Can you tell us which version of GATK you are using and the exact command you ran?
Also, can you post the plots? Why do you say VQSR is performing badly?
Thanks,
Sheila
From vsvinti on 2017-01-17
Hi @Sheila
Thanks for your reply. We have gone through a few cycles of understanding/diagnosing problems with the vqsr plots in the forums. This data was analysed several times, starting with versions from gatk 2.5 to current one I am using which is 3.5-0-g36282e4. The conclusion we came to was that, because I am dealing with samples sequenced in different centres, at different times, with different capture platforms, this would be the cause for the bad performance. I am not entirely convinced, but have moved on to use hard filters as I couldn’t improve the plots. Commands are from best practices for the matching version. I am attaching the plots with version 3.5 (I have also tried including different annotations, different subsets or samples, different depth ranges, etc).
So instead of going backwards, perhaps you would be able to suggest whether it’s possible to add custom annotations to training the vqsr model? I have estimated true variants and false variants from my data based on other datasets we have in house and validations we have carried out, and I am curious to see if this would improve things. Also, I don’t know how much difference a set of ~2k variants can actually make in the model?
Please let me know if this is possible.
Thanks
Vicky
From Sheila on 2017-01-19
@vsvinti
Hi Vicky,
Okay, so you are running on exome samples? How many do you have? Have you been running on the intersection of the intervals? Can you please post the exact command you ran for VQSR?
I don’t understand what you mean by “Also, I don’t know how much difference a set of ~2k variants can actually make in the model?”
Thanks,
Sheila
From vsvinti on 2017-01-19
@Sheila, sorry, I have gone through all this already with Geraldine a while ago. I have 2000 exomes, called on intersect, etc
At this point, I am just interested to know if I can use a custom set of true/false variants to train the VQSR model. If you can answer that question, that would be great. Otherwise, thanks for your help, but I am not going through it all over again.
From Sheila on 2017-01-19
@vsvinti
Hi Vicky,
Oh, I see. So, you are interested in using a custom set of resource VCFs? Are you only using the custon set of resources, or are you using the custom set along with the recommended human resource VCFs?
Have a look at [this thread](http://gatkforums.broadinstitute.org/gatk/discussion/comment/25693#Comment_25693) which may help.
-Sheila
From vsvinti on 2017-01-20
Hi @Sheila
I only have ~2k variants that I am pretty confident are real (based on genotyping of same samples on arrays), and a similar number that I am pretty confident are false (from trios). So I would use them alongside the human resources recommended in your best practices.
At the moment, I am using these variants to obtain custom cut offs for annotations that seem to characterize true variants (ie QD, etc), and use these thresholds to ‘hard filter’ my calls. So I could:
I was learning towards the first point, but the link you refer me to seems to indicate that my true / false sets may be too small to make a difference. Regardless, I can try to see what I get with either.
So back to my question: if I want to use these custom sets in addition to recommended, how would that change my commands below:
-resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap.vcf \ -resource:omni,known=false,training=true,truth=true,prior=12.0 omni.vcf \ -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G.vcf \ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.vcf \
Would I just add another line:
-resource:custom_true,known=false,training=true,truth=true,prior=? custom_true.vcf
? And can I use my false set anywhere?
Thanks
Vicky
From Sheila on 2017-01-23
@vsvinti
Hi Vicky,
Yes, you can add in the custom resource file like you have above. For the prior, you can try playing around with different values, but because you are very sure about those variants, you should be safe starting with a pretty high value.
I don’t think you can use your negative/false set anywhere in VQSR.
Good luck! Let us know how it goes.
-Sheila
From Magda on 2017-01-30
Hi @Geraldine_VdAuwera
As recommended I started everything from scratch using the genome ref. provided in the bundle (as well as the resource files), and after doing the whole pipeline again, the pattern I have is more or less the same.
To remind you: I’ve done Expanded Exome (illumina) (64Mb) in 72 samples. The pipeline I’ve done is here.
1. sickle-master/sickle pe -f Samples/${ind4}_${adaptors}_L004_R1.fastq.gz -r Samples/${ind4}_${adaptors}_L004_R2.fastq.gz -t sanger -o Samples/Sickle_30/${ind4}_filtered_L004_R1.fastq -p Samples/Sickle_30/${ind4}_filtered_L004_R2.fastq -s Samples/Sickle_30/${ind4}_filtered_L004_singles.fastq -q 30 > Samples/Sickle_30/${ind4}_L004_sickle_stats
2. bwa/bwa mem -M -t 16 -R “@RG\tID:${ind4}.Run2_L4\tLB:Lib2\tPL:illumina\tPU:Run2_L4\tSM:${ind4}” human_g1k_v37_decoy.fasta.gz Samples/2.Sickle_30/${ind4}_filtered_L004_R1.fastq.gz Samples/2.Sickle_30/${ind4}_filtered_L004_R2.fastq.gz > Samples/3.Alignment_bwa_V2/${ind4}_L004.sam
3. samtools-1.2/samtools view -b -S -o Samples/4.Samtools_bam_V2/${ind3}_L003.bam Samples/3.Alignment_bwa_V2/${ind3}_L003.sam.gz
4. samtools-1.2/samtools sort Samples/4.Samtools_bam_V2/${ind4}_L004.bam Samples/5.Bam_sorted_V2/${ind4}_L004.sorted
5. samtools-1.2/samtools index Samples/5.Bam_sorted_V2/${ind4}_L004.sorted.bam
6. samtools-1.2/samtools merge Samples/6.Bam_Merged_Samples_V2/${ind}_Merged.sorted.bam Samples/5.Bam_sorted_V2/${ind}_L001.sorted.bam Samples/5.Bam_sorted_V2/${ind}_L002.sorted.bam Samples/5.Bam_sorted_V2/${ind}_L003.sorted.bam Samples/5.Bam_sorted_V2/${ind}_L004.sorted.bam
7. java -Xmx4g -jar picard.jar MarkDuplicates I=../Samples/6.Bam_Merged_Samples_V2/${ind}_Merged.sorted.bam O=../Samples/7.MarkDup_Picard_V2/${ind}_Mkdup.sorted.bam METRICS_FILE=../Samples/7.MarkDup_Picard_V2/${ind}_Mkdup_metrics.txt VALIDATION_STRINGENCY=LENIENT; done;
8. sed ‘s/^chr//g’ nexterarapidcapture_expandedexome_targetedregions.bed > nexterarapidcapture_expandedexome_targetedregions_nochr.bed java -jar GenomeAnalysisTK-3.4-46/GenomeAnalysisTK.jar -T RealignerTargetCreator -R human_g1k_v37_decoy.fasta -I Samples/7.MarkDup_Picard_V2/${ind}_Mkdup.sorted.bam —known 1000G_phase1.indels.b37.vcf —known Mills_and_1000G_gold_standard.indels.b37.vcf -L nexterarapidcapture_expandedexome_targetedregions_nochr.bed -ip 100 -o Samples/8.GATK_ReaTarInt_V2/${ind}_RealignerTarget.intervals
9. java -Xmx4g -jar GenomeAnalysisTK-3.4-46/GenomeAnalysisTK.jar -T IndelRealigner -R human_g1k_v37_decoy.fasta -I Samples/7.MarkDup_Picard_V2/${ind}_Mkdup.sorted.bam -known 1000G_phase1.indels.b37.vcf -known Mills_and_1000G_gold_standard.indels.b37.vcf -targetIntervals Samples/8.GATK_ReaTarInt_V2/${ind}_RealignerTarget.intervals -o Samples/9.GATK_IndelRealigner_V2/${ind}_Mkdup_Realigned.sorted.bam
10.1 java -Xmx6g -jar GenomeAnalysisTK-3.4-46/GenomeAnalysisTK.jar -T BaseRecalibrator -R human_g1k_v37_decoy.fasta -I Samples/9.GATK_IndelRealigner_V2/${ind}_Mkdup_Realigned.sorted.bam -knownSites 1000G_phase1.snps.high_confidence.b37.vcf -knownSites 1000G_phase1.indels.b37.vcf -knownSites Mills_and_1000G_gold_standard.indels.b37.vcf -L nexterarapidcapture_expandedexome_targetedregions_nochr.bed -ip 100 -o Samples/10.GATK_BaseRecalibrator_V2/${ind}_Recalibrated.table
10.3 java -Xmx4g -jar GenomeAnalysisTK-3.4-46/GenomeAnalysisTK.jar -T PrintReads -R human_g1k_v37_decoy.fasta -I Samples/9.GATK_IndelRealigner_V2/${ind}_Mkdup_Realigned.sorted.bam -BQSR Samples/10.GATK_BaseRecalibrator_V2/${ind}_Recalibrated.table -o Samples/10.GATK_BaseRecalibrator_V2/${ind}_Recal.bam
11.1 java -Xmx6g -jar GenomeAnalysisTK-3.4-46/GenomeAnalysisTK.jar -T HaplotypeCaller -R human_g1k_v37_decoy.fasta -I Samples/10.GATK_BaseRecalibrator_V2/${ind}_Recal.bam -o Samples/11.GATK_HaploCaller_V2/${ind}_raw_g.vcf —dbsnp dbsnp_138.b37.vcf —genotyping_mode DISCOVERY -stand_emit_conf 10 -stand_call_conf 30 -L nexterarapidcapture_expandedexome_targetedregions_nochr.bed -ip 100 -ERC GVCF -variant_index_type LINEAR -variant_index_parameter 128000
11.2 java -Xmx6g -jar ../../GenomeAnalysisTK-3.4-46/GenomeAnalysisTK.jar -T GenotypeGVCFs -R ../../human_g1k_v37_decoy.fasta -V H01_raw.g.vcf -V H08_raw.g.vcf -V H10_raw.g.vcf -V H29_raw.g.vcf -V H91_raw.g.vcf -V N424_raw.g.vcf -V N872_raw.g.vcf …… (in total 72 inds) -D ../../dbsnp_138.b37.vcf -o Output_Joint_D.vcf
12.1.1 java -Xmx6g -jar ../../GenomeAnalysisTK-3.4-46/GenomeAnalysisTK.jar -T VariantRecalibrator -R ../../human_g1k_v37_decoy.fasta -input ../11.GATK_HaploCaller_V2/Output_Joint_D.vcf -recalFile recalibrate_SNP.recal -tranchesFile recalibrate_SNP.tranches -nt 4 -resource:hapmap,known=false,training=true,truth=true,prior=15.0 ../../hapmap_3.3.b37.vcf -resource:omni,known=false,training=true,truth=true,prior=12.0 ../../1000G_omni2.5.b37.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 ../../1000G_phase1.snps.high_confidence.b37.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 ../../dbsnp_138.b37.vcf -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an InbreedingCoeff -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.5 -tranche 99.0 -tranche 90.0 -rscriptFile recalibrate_SNP_plots.R
12.1.2 java -Xmx3g -jar ../../GenomeAnalysisTK-3.4-46/GenomeAnalysisTK.jar -T ApplyRecalibration -R ../../human_g1k_v37_decoy.fasta -input ../11.GATK_HaploCaller_V2/Output_Joint_D.vcf -recalFile recalibrate_SNP.recal -tranchesFile recalibrate_SNP.tranches -o recalibrated_snps_raw_indels.vcf —ts_filter_level 99.5 -mode SNP
And I still have the same type of plot, that is, higher TiTv fro the highest tranches, which should be the other way round. What I don’t understand is why it says that the TiTv at 99.5 (the tranche I finally applied) is 1.584, and when I calculate it with bcftools I get Ts/tv = 2.2 for SNPs?
In any case, I’m starting doing some filtering of the data, that is, filtering for a minimal VQSLOD score, sites with a max missing data, etc….
We have also applied the same pipeline (with the newest GATK version) for another Expanded Exome dataset of another population, and we also have the same pattern in the tranches plot.
Well, I don’t know what can produce this. In any case, I’ll just combine the VQSR + filtering until I get a High quality SNP dataset.
Thank you.
Magda
From Geraldine_VdAuwera on 2017-02-08
Hi @Magda,
The TiTv reported in the BQSR graph is the novel TiTv, ie the TiTv of SNPs in your callset that have not been previously reported (= not seen in the resource files you’re using). If we looked at TiTv across everything, the number would be heavily biased by all the known SNPs that are probably real. So we look at only the novel ones to evaluate whether their TiTv number looks good overall.
As we’ve discussed elsewhere, our expectations for TiTv are based on what parts of the genome are being examined. Since we haven’t looked at the Expanded Exome specific regions ourselves, we have no basis for predicting what their value should be — but to be frank I would be surprised if it was that different from the rest of the exome.
I would recommend you evaluate your callset with other methods to determine whether you have a problem with excessive false positive rate, or something else. See the documentation on hard filters for some tips and tricks relating to looking at the distribution of annotation values.
From bassu on 2017-05-18
Hi,
I have 10+ RNASeq and Whole genome data (human) with me. For RNAseq I followed the GATK best practice method and for WGS I followed the usual bowtie2 approach. In both the analysis I used BQSR and the vcf used was “All_20160527.vcf” from ncbi. I had done filtering on the VCF (RNAseq) as below
gatk -Xmx30048M -T HaplotypeCaller -R Homo_sapiens.GRCh38.dna.primary_assembly.fa -I recal.bam -dontUseSoftClippedBases -stand_call_conf 20.0 -stand_emit_conf 20.0 -o bqsr.vcf
gatk -Xmx30048M -T VariantFiltration -R Homo_sapiens.GRCh38.dna.primary_assembly.fa -V bqsr.vcf -window 35 -cluster 3 -filterName FS -filter “FS’>‘30.0” -filterName QD -filter “QD’<‘2.0” -o filtered_bqsr.vcf
My questions are as follows
1) Does it make sense to do VQSR on the RNA Seq data set? If so is there any filters I need to lookinto
2) Do I need to do VQSR on the merged VCF (I have one with bcftools) or individually
3) In the “-resource” should I need to use the same “All_20160527.vcf” too or use the hapmap data or both
4) Some of the samples here are related (siblings,cousin) so does I need to do any extra filtration ?
5) Does Question 4 has any effect with “-an InbreedingCoeff” ? As I might use this data of Association mapping(plink)
From mscjulia on 2017-05-18
Hello,
I have run VQSR on a WGS study with 35 samples. Now I need to compare among some particular samples out of this 35 pool. For example, I would like to compare sample1 to sample2. I understand that I can use SelectVariants to easily extract these 2 samples to a different vcf, but my question is what should I do to the VQSR Tranches? The “PASS” is for all of the 35 samples, correct?
If so, is there other filters I should use for such comparison? Would DP combined with GQ be a good choice? Thanks a lot
From mscjulia on 2017-05-18
Just to add to the above, I worry that when I compare sample1 and sample 2, if I only look at “PASS” snps, I will miss some snps GT that are in good quality for sample1 and sample2, but not sample9, and thus marked as “VQSRTranchesSNP99.9to100”.
From Sheila on 2017-05-22
@bassu
Hi,
I think [this thread](http://gatkforums.broadinstitute.org/gatk/discussion/4280/vqsr-in-non-human) and [this thread](http://gatkforums.broadinstitute.org/gatk/discussion/5431/inbreeding-coefficient-calculation-documentation-available) will help.
-Sheila
From Sheila on 2017-05-23
@mscjulia
Hi,
VQSR takes into account the site-level annotations which are calculated from all samples. It is unlikely that a site will fail because one sample has “bad” annotation values.
I am not sure what your end goal is, but you can do a comparison of PASS sites and all sites in SelectVariants. SelectVariants by default keeps all sites, but if you would like to compare only PASS sites, you can use [`—excludeFiltered`](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_variantutils_SelectVariants.php#—excludeFiltered).
-Sheila
From mscjulia on 2017-05-23
@Sheila
Thank you for your reply. You mentioned that if one sample has bad annotation, all the others are good, the site won’t fall; but how about one sample has good annotation, all the others are bad?
The “PASS” is for each site, but not for each sample at the site – is this understanding correct please? (In another word, if I want to look at one sample for the site, the “PASS”, which represents 35 samples at this site, would not mean much.)
Thank you so much!
From SteveL on 2017-05-23
Dear @mscjulia ,
If I understand VQSR correctly, what it does is tell you how good the variant calls in your dataset are at a particular POSITION, but not specific calls within particular samples at that position..
It does this by:
1) comparing positions in your dataset to “known” polymorphic positions – for all these known positions, it looks at the characteristics of the specific calls in your dataset for these positions. which should be your “good” positions.
2) comparing the calls in the rest of your positions to these “good” positions
3) if the call characteristics in 2 and 1 look more or less the same, the position will likely be given a “PASS”
4) as the call characteristics differ more and more between 1 and 2 above, then those in 2 will be put into different “tranches”, representing reduced confidence that we can make reliable calls at that position.
Thus the “PASS” is position information – it means that generally that position is reliably callable in your dataset – this does not mean that every call in that position will be correct – this is why you have call specific DP and GQ etc – if one sample was particularly badly covered relative to the others in a particular position, then the positions is still likely to be a PASS, but the specific all, say a heterozygote with an AD of 1,4, should be taken with a big pinch of salt.. Likewise, in a position that is not “PASS”, this does not mean every call will be unreliable, though sometimes it might – e.g. ALL your samples are heterozygote at a particular position – this likely indicates an underlying mappability issue.
In your example of one good, and all the others bad, then it is likely that the position will not be a PASS. As ever here there is a trade off between sensitivity and specificity and there is no perfect answer.
In your example of a PASS site with 35 samples, this indicates that most of the calls are likely to be reliable, but again, if one sample only has a DP of 5 and all the others had a DP ~30, then I would wonder what was going on there.
Hope this helps a little.
From mscjulia on 2017-05-24
@SteveL Thank you so much for the detailed explanation. It makes perfect sense – thank you!
From zhaoqiang90 on 2017-07-21
Hi,@Geraldine_VdAuwera
If I use GATK VQSR to get a good result, is that mean that I must use Broad Bundle’s data in order to train my variants. I have 12 exome samples, and I wonder that should I use hard-filtering to get the final result or to conduct VQSR to my vcf file(which as your recommendation I should add another 18 exomes samples to get a reliable result). Which is better as you suggestion?
From shlee on 2017-07-21
Hi @zhaoqiang90,
You can use any known sites resource that contains confident sites of variation for your organism. According to [Article#1259](https://gatkforums.broadinstitute.org/gatk/discussion/1259/which-training-sets-arguments-should-i-use-for-running-vqsr), for exome capture data, you will want a minimum of 30 samples. The document also suggests two alternative approaches for those with fewer samples: (i) add more samples or (ii) perform VQSR with `—maxGaussians 4`.
From zhaoqiang90 on 2017-07-25
Hi, @shlee
If I pad my orginal 12 exome data set with 1000genomes up to 30 samples, should the paded exome data must match up my samples at pathology or disease?I have 12 scurvy patients exome data, but I added another 18 normal exome data. Will the irrelevant data set bring some error( for instance false positive rate) to my original data variants? If I want know some specific snps or indels among the small set of my data set, should I conduct joint genotyping with these 12 sample together or seperate?
From shlee on 2017-07-25
Hi @zhaoqiang90,
It’s news to me there is a genetic component to scurvy.
For the purposes of VQSR, it is okay that the additional exome data not match the disease of your cohort. However, there are other considerations in padding and Geraldine discusses them in part, e.g. in these two threads:
- https://gatkforums.broadinstitute.org/gatk/discussion/3225/i-am-unable-to-use-vqsr-recalibration-to-filter-variants
- https://gatkforums.broadinstitute.org/gatk/discussion/8406/minimising-batch-effects-in-variantrecalibration
Also, if it were me, I’d try multiple different approaches to filtering to see what gave the best results—hard-filtering, VQSR with just my 18 samples and VQSR with padding. Do you have orthogonal validation/truth data for your samples, e.g. SNP array data? Comparing against these could help your analysis.
Relatedly, we have a hands-on tutorial that we give at workshops that outlines some considerations in hard-filtering and shows two approaches to assessing filtering results against a truthset. It is the Variant Filtering Tutorial and you can find a link to the worksheet [here](https://software.broadinstitute.org/gatk/blog?id=9044). I developed and wrote the bulk of the content of this version of the tutorial, so let me know if anything is unclear.
From robertb on 2017-08-13
Hi,
I ran a VQSR, Genotype Refinement, variant filtration, etc., work flow for my wgs data but found that I had too many false positives. My VQSR command and variant filtration commands are below. For those sites that actually passed filtration I also only took genotypes with GQ > 35 and DP > 10. But that still was not enough. I was wondering if I had done something wrong in the VQSR step (below). I made sure to run VQSR on all the chromosomes simultaneously. I think the problem is still at the site level in terms of QUAL and QD but I didn’t think that I needed to hard filter on those if I was using VQSR. I mean if VQSR says there’s a variant there then that should be good enough right? So maybe I should have made the VQSR options more stringent? I’m not sure. If someone could recommend some good filtering options to compliment VQSR I’d appreciate it. Or if there’s some problems with my VQSR comman I’d like to hear about it. Thanks. _Robert
VQSR
/usr/lib/jvm/java-1.8.0-openjdk.x86_64/bin/java -Xmx8g -jar /work/rb14sp/GenomeAnalysisTK.jar \ -T VariantRecalibrator \ -R ucsc.hg19.fasta \ -input GATKVQSR.vcf \ -resource:hapmap,known=false,training=true,truth=true,prior=15.0 /hapmap_3.3.hg19.sites.vcf \ -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.hg19.sites.vcf \ -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.hg19.sites.vcf \ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.hg19.vcf \ -an DP \ -an QD \ -an MQ \ -an FS \ -an SOR \ -an MQRankSum \ -an ReadPosRankSum \ -an InbreedingCoeff \ -mode SNP \ -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \ -recalFile GATK_VQSR_recalibrate_SNP.recal \ -tranchesFile GATK_VQSR_recalibrate_SNP.tranches \ -rscriptFile GATK_VQSR_recalibrate_SNP_plots.R &&
\
FILTER GENOTYPES
#!/bin/bash
/usr/lib/jvm/java-1.8.0-openjdk.x86_64/bin/java -Xmx8g -jar /work/rb14sp/GenomeAnalysisTK.jar \
-T VariantFiltration \
-R ucsc.hg19.fasta \
-V GATK_recal_variants_geno.vcf \
-G_filter “GQ < 30.0” \
-G_filterName lowGQ \
-G_filter “DP < 10” \
-G_filterName lowDP \
-o GATK_ALL_FILT.vcf \
From Sheila on 2017-08-21
@robertb
Hi,
How do you know there are a lot of false positives left after VQSR?
You should not have to do any extra filtering after VQSR. You can try setting a more stringent filter level if you want to remove more potential false positives. Have you tried different filter levels? Do you see less false positives with a more stringent filter level?
-Sheila
From Katie on 2018-01-09
Hi – Thank you for a fantastic resource! Earlier in the thread, it’s mentioned that the resource SNP sets are only used for SNP locations along the genome and annotations are not used. If that’s the case, then would it be possible to use a BED file or list of SNP positions as an input? I’m working with a non-model organism and am trying to develop an approach to use VQSR based on known SNP differences between 2 published genomes.
Thanks for your help!
All the best,
From Sheila on 2018-01-10
@Katie
Hi,
I don’t think you can do that. The tool requires a VCF or BCF as input for resource files. However, if you do some googling, you can find some ways to convert a bed file to VCF which may work :smile:
-Sheila
From Katie on 2018-01-11
Got it – thanks for the help! It looks like there is a bedr R package that could help.
From Katie on 2018-01-19
Hi,
I’m running VQSR and get about 10% of SNP calls with a Nan VQSLOD score. This makes it difficult to use the score as a threshold for retaining or discarding variants. Do all of the variant annotations have to have numeric scores for VQSLOD to be calculated (i.e. some of the MQRankSum and ReadPosRankSum scores are Nan)? If so, how do you recommend dealing with Nan values in some of the annotations of interest? Thank you!
My call is below:
gatk —java-options ‘-Xmx10g’ VariantRecalibrator \
-R ${REF_DIR}${ref} \
—variant ${VARS_DIR}${vcf} \
-resource truePos,known=false,training=true,truth=true,prior=15.0:/ifs/labs/andrews/walter/varcal/results/sims/old/nucmer.vcf \
-an DP \
-an QD \
-an MQRankSum \
-an ReadPosRankSum \
-an FS \
-an SOR \
-an MQ \
-mode SNP \
-tranche 100.0 -tranche 99.0 -tranche 98.0 -tranche 97.0 -tranche 96.0 -tranche 95.0 -tranche 94.0 -tranche 93.0 -tranche 92.0 -tranche 91.0 -tranche 90.0 \
—output ${vcf%_*}”.recal” \
—tranches-file ${vcf%_*}”.tranches” \
—rscript-file ${vcf%_*}”.plots.R” \
—max-gaussians 4 \
—mq-cap-for-logit-jitter-transform $mqcap
From Geraldine_VdAuwera on 2018-01-19
Getting nans for VQSLOD is not expected. You should have a proper numeric values for all sites that have been put through recalibration. Did you run on both SNPs and indels in series?
From Katie on 2018-01-19
Okay great, I didn’t think so. I am only running for SNPs as I have already used gatk SelectVariants to select for only SNPs.
I have SNPeff annotations on my VCF and wondering if that poses a problem? Or if the problem is due to a small SNP set (~1000 SNPs along a 4.4 Mb genome).
I am trying to develop a pipeline to use VQSR for non-model organisms, so for my true positive VCF, I am using a VCF file of true positive variants developed when comparing 2 ref. genomes (i.e. it doesn’t include annotations, only SNP location information). I validated the true positives VCF with gatk ValidateVariants.
Any advice would be great, thank you!
From Katie on 2018-01-19
I am attaching my vcf file (bwa_samtools_all.vcf.gz) as well as my truth sites file (nucmer.vcf.gz) in case they will be helpful for diagnosing why VQSR is calculating NaNs for the VQSLOD score. Thank you so much for your help on this!
From Sheila on 2018-01-23
@Katie
Hi,
I think the issue may be related to a small dataset. We usually recommend running VQSR on at least 30 exomes or 1 whole genome. You may consider hard filtering. We have some helpful documents in the tutorials and methods and algorithms sections.
Keep in mind we have a new tool to replace VQSR coming out [soon](https://software.broadinstitute.org/gatk/blog?id=10996) which may not limit your work so much :smile:
-Sheila
From Katie on 2018-01-24
@Sheila,
Thank you for letting me know. Looking forward to the new tool. I experimented a bit and can make this work using the max-gaussians argument. All the best,
From Katie on 2018-01-25
Sorry – I take that back, the problem was not to do with too many gaussian mixture models. Instead I’d been using a mq-cap-for-logit-jitter-transform which somehow was introducing issues.
Just wanted to update in case anyone else is using this for non-model organisms. Cheers!
From timh on 2018-03-01
Hi. Is there a way of telling how many Gaussians have been identified by VQSR?
From Geraldine_VdAuwera on 2018-03-02
@timh In recent versions there is an option to output the details of the model that was produced ([`—output-model`](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_walkers_vqsr_VariantRecalibrator.php#—output-model)), which I believe should include that information.
From timh on 2018-03-02
@Geraldine_VdAuwera It does, thank you.
From Arynio on 2018-06-14
Hi,
Dear GATK team. I’m just learning about VQSR and its advantages over hard filters. I’m currently working with ~8x whole-genome data from a non-model species for which we have a reference genome, and I was wondering how to perform VQSR. Would it make sense to use called SNPs from a reduced dataset at ~30x as the highly validated “true” input? Or would that carry way too much noise?
Thanks in advance for your advice, and also for all your great documentation that has already helped me many times before.
From Sheila on 2018-06-20
@Arynio
Hi,
I think [this thread](https://gatkforums.broadinstitute.org/gatk/discussion/comment/25693) will help.
-Sheila
From Arynio on 2018-06-28
@Sheila
Thank you very much!
From liusiyu93 on 2018-07-11
Hi,
I am wondering what are commands in GATK4 to apply recalibration results to sequencing data? The line “-BQSR” in GATK3 was not recognized in GATK4.
Also, I am having a really hard time trying to find a webpage with step-by-step command lines for GATK4 (germline SNPs / Indels with exome-seq data). Any recommendations will be much appreciated.
Thanks!
Best,
Siyu
From Sheila on 2018-07-18
@liusiyu93
Hi Siyu,
Have a look at [this thread](https://gatkforums.broadinstitute.org/gatk/discussion/comment/43986#Comment_43986) and [the Best Practices](https://software.broadinstitute.org/gatk/best-practices/).
You may also find the WDLs provided [here](https://github.com/gatk-workflows) helpful.
-Sheila
From liuqi on 2018-07-19
Hello,
Dear GATK team, I use VQSR to filter variants, and when I see the recalibrated_snps.tranches.pdf, it is so strange. see the attachment. 

could you tell me why this occur?
Thank you very much!
Best,
Qi Liu
From Sheila on 2018-07-27
@liuqi
Hi Qi,
Can you tell us which version of GATK you used and what kind of data you are working with?
Thanks,
Sheila
From liuqi on 2018-07-30
@Sheila
Hello, Sheila,
Thank you for your reply. Sorry for the late reply.
I use GATK 3.5, my data is a human population data which is a >30X NGS data .
because I use GATK 3.5 to call SNP in my first data, so I use the same version to call SNP in my second data.
my first data’s picture look like normally, but the second is so strange.


Do I need use new version to do VQSR or calling?
Thank you very much!
Best,
Qi Liu
From melisakman on 2018-07-31
Hi,
I am doing VQSR on a non-model organism (SNPs called by GATK and verified by samtools mpileup or freebayes). I have 70+ samples sequenced to 5×-10X (WGS). I created a training set by using concordant sites between mpileup, freebayes and haplotypecaller with stringent filtering parameters. For VQSR, I used different max-gaussian settings (4,6 and the default 8). The plots generated look very different from each other and I was wondering if there is any resource that would help guide to choose this setting from the plots. All the plots look very different from each other. My G8 plots show 2 clusters, with high lod scores, mostly separated out by depth. Tranches plot also looks a bit odd. G4 and G6 give a single cluster. I am aiming for high specificity and towards using G4 (just because this was suggested here) – tranche 90 for further analyses. Do you think it would still be better to use G6 or G8? Thanks a lot for your insights!
Best,
Melis
G4_unfiltered_all.tranches.pdf
G8_unfiltered_all.tranches.pdf
G6_unfiltered_all.tranches.pdf
From Sheila on 2018-07-31
@liuqi
Hi Qi,
Yes, it may be best to try using GATK3 latest version. You can just test the VQSR step with it.
-Sheila
From Sheila on 2018-08-06
@melisakman
Hi Melis,
I will need to take some time and consult my team before getting back to you. If I don’t respond by the end of the week, please post again.
-Sheila
From liuqi on 2018-09-12
@Sheila
Hello Sheila,
I have tryed to use GATK-3.8.1 to test VQSR step, but get the same result. It is also so strange.
how can I resolve this problem?
I use the same argument as my first data.
Best,
Qi Liu
>
Sheila said: >
liuqi
> Hi Qi,
>
> Yes, it may be best to try using GATK3 latest version. You can just test the VQSR step with it.
>
> -Sheila
From MehulS on 2019-01-18
Hi, I plan to generate subsetted VCFs containing variants only on about 90 genes of my interest since I need to reduce time so I’m using Haplotype Caller on an interval using a .list file.
What parameters are recommended for using VQSR on interval-delimited data ? I can include more genes and expand the intervals but want to avoid calling on the whole genome because HC takes 6 hours.
I have ~100 low coverage Illumina paired-end WGS samples (5-10X), GATK 4.0.12