created by Sheila
on 2014-11-24
There has been a lot of confusion about the difference between QUAL and GQ, and we hope this FAQ will clarify the difference.
The basic difference is that QUAL refers to the variant site whereas GQ refers to a specific sample’s GT.
- QUAL tells you how confident we are that there is some kind of variation at a given site. The variation may be present in one or more samples.
- GQ tells you how confident we are that the genotype we assigned to a particular sample is correct. It is simply the second lowest PL, because it is the difference between the second lowest PL and the lowest PL (always 0).
QUAL (or more importantly, its normalized form, QD) is mostly useful in multisample context. When you are recalibrating a cohort callset, you’re going to be looking exclusively at site-level annotations like QD, because at that point what you’re looking for is evidence of variation overall. That way you don’t rely too much on individual sample calls, which are less robust.
In fact, many cohort studies don’t even really care about individual genotype assignments, so they only use site annotations for their entire analysis.
Conversely, QUAL may seem redundant if you have only one sample. Especially if it has a good GQ (and more importantly, well separated PLs) then admittedly you don’t really need to look at the QUAL — you know what you have. If the GQ is not good, you can typically rely on the PLs to tell you whether you do probably have a variant, but we’re just not sure if it’s het or hom-var. If hom-ref is also a possibility, the call may be a potential false positive.
That said, it is more effective to filter on site-level annotations first, then refine and filter genotypes as appropriate. That’s the workflow we recommend, based on years of experience doing this at fairly large scales…
Updated on 2014-11-25
From sdsmith on 2015-11-12
Are you able to help me understand a little better what QD is telling me? I understand it is the ration of the QUAL to the AD, but what is that number saying, in terms of how can I use that number to determine what I want my threshold to be for PASS/FAIL in my filter?
Thanks,
SS
From Sheila on 2015-11-24
@sdsmith
Hi SS,
We have some basic recommendations for hard filtering here: https://www.broadinstitute.org/gatk/guide/article?id=2806 However, it will be up to you to analyze your data and determine what cutoffs to use.
-Sheila
From Geraldine_VdAuwera on 2015-11-27
sdsmith After some discussion we realized that it can be difficult to understand the meaningfulness of the annotation threshold values used for filtering, so
Sheila is going to start a project to document this in a lot more detail. This will happen over the next few weeks.
From nkobmoo on 2016-03-28
Hi,
I’m really interested in a detailed documentation on SNP annotation threshold for filtering. If such document exists, could you pleas point us to it?
Thank you very much in advance.
From Sheila on 2016-03-29
@nkobmoo
Hi,
There is [this document](https://www.broadinstitute.org/gatk/guide/article?id=6925) that should help. I am working on adding some more explanations, but it should be a good place to start.
-Sheila
From SWATI on 2016-06-21
Hello Sheila,
As I understand, QUAL is a representation of accuracy of genotyping. But what does a ‘.’ represent under the QUAL column in a VCF file? I do not have any numeric value for Phred-scaled score for assertion of ALT allele in the entire column.
What does this mean for filtering low quality SNPs or genotypes?
Thanks
Swati
From SWATI on 2016-06-21
Dear Sheila,
This is an example of my filtered recode VCF file:
ETC6390828 41 S1612208905 G T . PASS .;DP=119 GT:AD:DP:GQ:PL ./.:0,0:0 ./.:0,0:0 ETC6410100 69 S1614033230 A G . PASS .;DP=2833 GT:AD:DP:GQ:PL 0/1:19,19:38:100:255,0,255 0/1:5,10:15:99:255,0,135 ETC6742648 84 S1647090026 C T . PASS .;DP=8447 GT:AD:DP:GQ:PL 0/1:48,9:57:99:152,0,255 0/1:28,4:32:99:48,0,255
From your previous discussion (http://gatkforums.broadinstitute.org/gatk/discussion/4688/qual-is-a-dot-and-filter-is-pass-in-vcf), I understood that 1. the sites with ./. genotypes are no-call sites, [...]. A no-call site means there was not enough information to make a genotype call. You can tell a no-call site because there is no QUAL and no genotype (GT). 2. the term 'PASS' was added during a subsequent filtering step (file named as filtered.recode.vcf) by the genomics facility provider. They have used MAF (>0.01) & missing data per site (<90%) to as filtering options. This is confirmed as I do not see any ##FILTER information mentioned in the VCF file header.
But I'm not sure what it means when I have a genotype and a 'dot' for QUAL.
Thank you.
From Sheila on 2016-06-21
@SWATI
Hi Swati,
Did you produce the VCF using GATK tools? If so, can you tell us the exact command you ran and what version of GATK you are using?
Thanks,
Sheila
From SWATI on 2016-06-22
Dear Sheila, Thank you for your reply. I got the VCF file from my GBS service provider who used their Tassel pipeline. The summary report says, "VCF is format for holding SNP information that retains information on depth of coverage for each allele, and can be output from the GBS pipeline by replacing the plugins ‘TagsToSNPByAlignmentPlugin’ and ‘MergeDuplicateSNPsPlugin’ with ‘tbt2vcfPlugin’ and ‘MergeDuplicateSNPvcfPlugin’. Genotype likelihood scores are calculated based on formula 3.8 of Etter et al 2013=1, and the most likely genotype is assigned. Genotype quality (GQ) score is calculated to the GATK version documented here: http://gatkforums.broadinstitute.org/discussion/1268/how-should-i-interpret-vcf-files-produced-by-the-gatk ."
I think these could be the commands used to generate the VCF: Memory Settings: -Xms512m -Xmx64G Tassel Pipeline Arguments: -fork1 -MergeDuplicateSNPvcfPlugin -i /workdir/qisun/working/qs105/VCF/MERGETBT.c1 -o /workdir/qisun/working/qs105/VCF/1.vcf -ak 3 -endPlugin -runfork1 [main] INFO net.maizegenetics.pipeline.TasselPipeline - Tassel Version: 3.0.165 Date: January 16, 2014
& The VCF header reads like:
P.S. I am a biologist and trying to learn and still trying to learn bioinformatics. I am afraid, I may not be familiar with very technical terms & command line. Hence, the long post.
Thanks
From Geraldine_VdAuwera on 2016-06-22
@SWATI, if these files were produced by a caller that is not part of GATK we can’t help you. You should ask the provider for help. Good luck.
From manasakg16 on 2017-05-25
Hi
can you please provide me the link where it explains about Genotype quality (GQ) score and commands
Thanks in advance
From Sheila on 2017-05-26
@manasakg16
Hi,
[This article](https://software.broadinstitute.org/gatk/documentation/article?id=5913) should help.
-Sheila