created by Geraldine_VdAuwera
on 2014-10-17
This document describes the purpose and general principles of the Genotype Refinement workflow. For the mathematical details of the methods involved, please see the Genotype Refinement math documentation. For step-by-step instructions on how to apply this workflow to your data, please see the Genotype Refinement tutorial.
The core GATK Best Practices workflow has historically focused on variant discovery --that is, the existence of genomic variants in one or more samples in a cohorts-- and consistently delivers high quality results when applied appropriately. However, we know that the quality of the individual genotype calls coming out of the variant callers can vary widely based on the quality of the BAM data for each sample. The goal of the Genotype Refinement workflow is to use additional data to improve the accuracy of genotype calls and to filter genotype calls that are not reliable enough for downstream analysis. In this sense it serves as an optional extension of the variant calling workflow, intended for researchers whose work requires high-quality identification of individual genotypes.
A few commonly asked questions are:
While every study can benefit from increased data accuracy, this workflow is especially useful for analyses that are concerned with how many copies of each variant an individual has (e.g. in the case of loss of function) or with the transmission (or de novo origin) of a variant in a family.
If a “gold standard” dataset for SNPs is available, that can be used as a very powerful set of priors on the genotype likelihoods in your data. For analyses involving families, a pedigree file describing the relatedness of the trios in your study will provide another source of supplemental information. If neither of these applies to your data, the samples in the dataset itself can provide some degree of genotype refinement (see section 5 below for details).
After running the Genotype Refinement workflow, several new annotations will be added to the INFO and FORMAT fields of your variants (see below), GQ fields will be updated, and genotype calls may be modified. However, the Phred-scaled genotype likelihoods (PLs) which indicate the original genotype call (the genotype candidate with PL=0) will remain untouched. Any analysis that made use of the PLs will produce the same results as before.
Begin with recalibrated variants from VQSR at the end of the best practices pipeline. The filters applied by VQSR will be carried through the Genotype Refinement workflow.
Tool used: CalculateGenotypePosteriors
Using the Phred-scaled genotype likelihoods (PLs) for each sample, prior probabilities for a sample taking on a HomRef, Het, or HomVar genotype are applied to derive the posterior probabilities of the sample taking on each of those genotypes. A sample’s PLs were calculated by HaplotypeCaller using only the reads for that sample. By introducing additional data like the allele counts from the 1000 Genomes project and the PLs for other individuals in the sample’s pedigree trio, those estimates of genotype likelihood can be improved based on what is known about the variation of other individuals.
SNP calls from the 1000 Genomes project capture the vast majority of variation across most human populations and can provide very strong priors in many cases. At sites where most of the 1000 Genomes samples are homozygous variant with respect to the reference genome, the probability of a sample being analyzed of also being homozygous variant is very high.
For a sample for which both parent genotypes are available, the child’s genotype can be supported or invalidated by the parents’ genotypes based on Mendel’s laws of allele transmission. Even the confidence of the parents’ genotypes can be recalibrated, such as in cases where the genotypes output by HaplotypeCaller are apparent Mendelian violations.
Tool used: VariantFiltration
After the posterior probabilities are calculated for each sample at each variant site, genotypes with GQ < 20 based on the posteriors are filtered out. GQ20 is widely accepted as a good threshold for genotype accuracy, indicating that there is a 99% chance that the genotype in question is correct. Tagging those low quality genotypes indicates to researchers that these genotypes may not be suitable for downstream analysis. However, as with the VQSR, a filter tag is applied, but the data is not removed from the VCF.
Tool used: VariantAnnotator
Using the posterior genotype probabilities, possible de novo mutations are tagged. Low confidence de novos have child GQ >= 10 and AC < 4 or AF < 0.1%, whichever is more stringent for the number of samples in the dataset. High confidence de novo sites have all trio sample GQs >= 20 with the same AC/AF criterion.
Tool options: SnpEff or Oncotator (both are non-GATK tools)
Especially in the case of de novo mutation detection, analysis can benefit from the functional annotation of variants to restrict variants to exons and surrounding regulatory regions. The GATK currently does not feature integration with any functional annotation tool, but SnpEff and Oncotator are useful utilities that can work with the GATK's VCF output.
The Genotype Refinement Pipeline adds several new info- and format-level annotations to each variant. GQ fields will be updated, and genotypes calculated to be highly likely to be incorrect will be changed. The Phred-scaled genotype likelihoods (PLs) carry through the pipeline without being changed. In this way, PLs can be used to derive the original genotypes in cases where sample genotypes were changed.
New INFO field annotation PG is a vector of the Phred-scaled prior probabilities of a sample at that site being HomRef, Het, and HomVar. These priors are based on the input samples themselves along with data from the supporting samples if the variant in question overlaps another in the supporting dataset.
New FORMAT field annotation PP is the Phred-scaled posterior probability of the sample taking on each genotype for the given variant context alleles. The PPs represent a better calibrated estimate of genotype probabilities than the PLs are recommended for use in further analyses instead of the PLs.
Current FORMAT field annotation GQ is updated based on the PPs. The calculation is the same as for GQ based on PLs.
New FORMAT field annotation JL is the Phred-scaled joint likelihood of the posterior genotypes for the trio being incorrect. This calculation is based on the PLs produced by HaplotypeCaller (before application of priors), but the genotypes used come from the posteriors. The goal of this annotation is to be used in combination with JP to evaluate the improvement in the overall confidence in the trio’s genotypes after applying CalculateGenotypePosteriors. The calculation of the joint likelihood is given as:
$$ -10*\log ( 1-GL{mother}[\text{Posterior mother GT}] * GL{father}[\text{Posterior father GT}] * GL_{child}[\text{Posterior child GT}] ) $$
where the GLs are the genotype likelihoods in [0, 1] probability space.
New FORMAT field annotation JP is the Phred-scaled posterior probability of the output posterior genotypes for the three samples being incorrect. The calculation of the joint posterior is given as:
$$ -10*\log (1-GP{mother}[\text{Posterior mother GT}] * GP{father}[\text{Posterior father GT}] * GP_{child}[\text{Posterior child GT}] )$$
where the GPs are the genotype posteriors in [0, 1] probability space.
New FORMAT field filter lowGQ indicates samples with posterior GQ less than 20. Filtered samples tagged with lowGQ are not recommended for use in downstream analyses.
New INFO field annotation for sites at which at least one family has a possible de novo mutation. Following the annotation tag is a list of the children with de novo mutations. High and low confidence are output separately.
Before:
1 1226231 rs13306638 G A 167563.16 PASS AC=2;AF=0.333;AN=6;… GT:AD:DP:GQ:PL 0/0:11,0:11:0:0,0,249 0/0:10,0:10:24:0,24,360 1/1:0,18:18:60:889,60,0
After:
1 1226231 rs13306638 G A 167563.16 PASS AC=3;AF=0.500;AN=6;…PG=0,8,22;… GT:AD:DP:GQ:JL:JP:PL:PP 0/1:11,0:11:49:2:24:0,0,249:49,0,287 0/0:10,0:10:32:2:24:0,24,360:0,32,439 1/1:0,18:18:43:2:24:889,60,0:867,43,0
The original call for the child (first sample) was HomRef with GQ0. However, given that, with high confidence, one parent is HomRef and one is HomVar, we expect the child to be heterozygous at this site. After family priors are applied, the child’s genotype is corrected and its GQ is increased from 0 to 49. Based on the allele frequency from 1000 Genomes for this site, the somewhat weaker population priors favor a HomRef call (PG=0,8,22). The combined effect of family and population priors still favors a Het call for the child.
The joint likelihood for this trio at this site is two, indicating that the genotype for one of the samples may have been changed. Specifically a low JL indicates that posterior genotype for at least one of the samples was not the most likely as predicted by the PLs. The joint posterior value for the trio is 24, which indicates that the GQ values based on the posteriors for all of the samples are at least 24. (See above for a more complete description of JL and JP.)
The Genotype Refinement Pipeline uses Bayes’s Rule to combine independent data with the genotype likelihoods derived from HaplotypeCaller, producing more accurate and confident genotype posterior probabilities. Different sites will have different combinations of priors applied based on the overlap of each site with external, supporting SNP calls and on the availability of genotype calls for the samples in each trio.
If the input VCF contains at least 10 samples, then population priors will be calculated based on the discovered allele count for every called variant.
Priors derived from supporting SNP calls can only be applied at sites where the supporting calls overlap with called variants in the input VCF. The values of these priors vary based on the called reference and alternate allele counts in the supporting VCF. Higher allele counts (for ref or alt) yield stronger priors.
The strongest family priors occur at sites where the called trio genotype configuration is a Mendelian violation. In such a case, each Mendelian violation configuration is penalized by a de novo mutation probability (currently 10-6). Confidence also propagates through a trio. For example, two GQ60 HomRef parents can substantially boost a low GQ HomRef child and a GQ60 HomRef child and parent can improve the GQ of the second parent. Application of family priors requires the child to be called at the site in question. If one parent has a no-call genotype, priors can still be applied, but the potential for confidence improvement is not as great as in the 3-sample case.
Right now family priors can only be applied to biallelic variants and population priors can only be applied to SNPs. Family priors only work for trios.
20141018-genotypingWorkflowDiagram.jpg
Updated on 2015-07-22
From estif74 on 2014-11-21
Hi @Geraldine_VdAuwera, I’m trying out this new genotype refinement pipeline and looking at a trio. In the documentation above, you mentioned that high- and low-confidence de novo mutations should be annotated in the INFO field of the output VCF. But what is the annotation exactly? I’m trying to figure out how to filter my VCF file to look for these, but not exactly sure what to filter on. Thanks in advance for any help as always!
From Geraldine_VdAuwera on 2014-11-21
Hi @estif74,
The annotation module is [PossibleDeNovo](https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_PossibleDeNovo.php) which annotates either “hiconfdenovo” or “loconfdenovo” to possible de novo mutations.
The formatting of the INFO fields in the doc got messed up (being interpreted as html code) which I need to fix; they’re supposed to show as:
= 10 for child, GQ > 0 for parents)”, comma-delimited list of child samples> = 20 for all trio members)”, comma-delimited list of child samples>
EDIT: capitalization
From estif74 on 2014-11-21
Ah never mind, I got it. For anyone else who this might be helpful to it’s hiConfDeNovo= or loConfDeNovo=
From estif74 on 2014-11-21
Oh great, even better. thanks!
From EvoStevo on 2015-03-10
Is performing genotype refinement on samples called with the gVCF workflow likely to be productive or have these features been incorporated into GenotypeGVCFs?
From Geraldine_VdAuwera on 2015-03-10
Genotype refinement is an add-on that provides functionality not performed by GenotypeGVCFs.
From nchuang on 2015-06-28
if i wish to proceed with phasing and imputation, would you recommend going through genotype refinement first? Sounds obvious but curious why it’s not best practices and seems more optional?
From Sheila on 2015-07-01
@nchuang
Hi,
Yes, we do recommend genotype refinement for phasing and imputation. It is not technically part of our best practices workflow because the workflow has focused on variant discovery. The optional part is because we know it adds more work, and not everyone is interested in the exact genotype (they may simply be interested if a variant exists at a site). Have a look at this article for more information: http://gatkforums.broadinstitute.org/discussion/4723/genotype-refinement-workflow
-Sheila
From Kelly135 on 2016-02-01
Hi @Geraldine_VdAuwera, thanks for the detailed workflow!
I have a question regarding this article though: https://www.broadinstitute.org/gatk/guide/article?id=4726
In the calculation of family priors, under no MV, the probability is calculated as 1-10*u-2*u^2. I wonder where the number 10 and 2 come from.
Thanks for the answer in advance!
From Sheila on 2016-02-03
@Kelly135
Hi,
I think Geraldine has responded [here](http://gatkforums.broadinstitute.org/gatk/discussion/comment/27464#Comment_27464).
-Sheila
From Kelly135 on 2016-03-18
Thanks. One more question.
For this genotype refinement workflow, only step 1 and 2 are applicable to population data (unrelated individuals)?
And after step 2, how can I exclude genotypes annotated as “lowGQ”?
From Sheila on 2016-03-18
@Kelly135
Hi,
Do you want to remove the sites with LowGQ or simply set them to no-calls?
You should be able to use [SelectVariants](https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_variantutils_SelectVariants.php) to do either.
-Sheila
From sjin_lee09 on 2016-04-27
To whom it may concern,
I wanted to ask one question regarding calculating the population prior.
In the article https://www.broadinstitute.org/gatk/guide/article?id=4727, it is recommended to use the phase 3 dataset from thousand genomes for calculating the population prior as shown below.
java -jar GenomeAnalysisToolkit.jar -R human_g1k_v37_decoy.fasta -T CalculateGenotypePosteriors —supporting 1000G_phase3_v4_20130502.sites.vcf -ped trio.ped -V recalibratedVariants.vcf -o recalibratedVariants.postCGP.vcf
The samples that I am dealing with currently exclusively deals with East Asian samples. Would you recommend filtering the phase 3 vcf file for samples with allele frequencies from the EAS population and using this filtered VCF file as input for GenotypeRefinement?
Many Thanks,
SangJin Lee
From Geraldine_VdAuwera on 2016-04-27
@sjin_lee09 Yes you should definitely subset the frequency resource to contain only the AF information that is appropriate for your specific population.
From sjin_lee09 on 2016-04-28
Cheers Geraldine!
From Vasgene on 2016-05-25
Hello,
I have a family trio with exome-seq data for analysis and I would like to know if my approach below is correct:
I used the GATK best practice workflow for variant discovery (with the last step being VariantRecalibrator and ApplyRecalibration for SNPs and Indels separately). I ended up with a single VCF for the family trio (I didn’t filter out LowQual variants based on the FILTER column) and I then moved to the genotype refinement workflow, where I followed the three steps recommended (Derive posterior probabilities of genotypes, Filter low quality genotypes and Annotate possible de novo mutations). Would the next logical step be removing the LowQual variants based on the FILTER column?
Thanks
From Sheila on 2016-06-06
@Vasgene
Hi,
Sorry for the delay. To answer your question, it really depends on what you’re trying to do. You have to decide what you want to consider as possible variants, (rare variants vs shared variants), and base your filtering on that.
-Sheila
From Cathy on 2016-09-05
Hy everyone, I would like to know, if it is correct to use the Genotype Refinement process, in order to analyze a multi species study of mouse?
I guess that I have to use the option —skipPopulationPriors and —skipFamilyPriors, because it is neither populatio, nor familly?
Thank’s for this information.
Cathy
From Geraldine_VdAuwera on 2016-09-08
We have not tested this procedure on non-human data, and without more information on your experimental design, it’s impossible to give you a definitive answer. My main worry is that I’m not sure how the multi-species population structure would affect results. Specifically, I would be concerned that if the group of samples do not constitute a population, nor families, then there is not really anything coherent for the tool to use as basis for refining genotypes. If you can identify sub-populations of 10 samples or more that are coherent, you can run them through refinement separately. No need to provide either argument though — family priors are automatically skipped if you don’t provide a pedigree, and in this case figure you do need the tool to look at population frequencies. But if you can’t do that, then I would recommend just skipping this step entirely.
From CardiffBioinf on 2017-03-11
Is it acceptable to apply this step after hard-filtering instead of VQSR? Does it matter how big your input dataset is?
Thanks
Matt
From Sheila on 2017-03-12
@CardiffBioinf
Hi Matt,
Yes, you can apply this after hard-filtering instead of VQSR. It does not matter how big your dataset is as long as you have a good truth dataset or family information. However, if you do not have either of those it will be useful to have many samples in your dataset so those samples can help derive accurate priors.
-Sheila
From mglclinical on 2017-04-05
I do not have trio samples. All my samples are independent samples.
I have an exome pipeline bwa+Haplotype caller pipeline that follows your Best Practice Recommendations . Would this Genotype Refinement add additional benefit to my existing pipeline ?
From mglclinical on 2017-04-07
Hi @Geraldine_VdAuwera ,
I will try to re-phrase my question in a more comprehensive way :
In our clinical lab, We have a bwa+HaplotypeCaller pipeline (Best Practice Recommendations) for exomes, and the pipeline implements Hard Filtering and it does not use VQSR as our expected sample count is less than 30. We also do not do Trios. I know that you recommend importing extra samples from 1000 Genomes Project for padding the 30 sample count limit, and it requires us to choose samples with same platform (i.e. technology, capture, read length, depth) and similar ethnicity. We could select padding samples from 1000G with same technology as compared to the technology that we use in our lab, but we don’t think we can select samples with similar ethnicity as we can’t predict the ethnicity of future patient samples that come to our lab. So, we have taken the “Hard Filtering” Approach.
In the 1st step of this article [“Run the genotype refinement workflow“
](https://software.broadinstitute.org/gatk/documentation/article.php?id=4727) , the workflow starts with an output of VQSR ‘recalibratedVariants.vcf’.
Since the ‘genotype refinement workflow’ requires the VQSR output as the first input to start with, does this mean we cannot use the ‘genotype refinement workflow’ ?
I will just re-phrase our situation :
1. Our pipeline processes only Human Samples
2. Our pipeline processes only Exomes
3. We process samples with a batch limit of less than 30
4. We do not process Trios
I am just trying to understand if I can use this “Genotype Refinement workflow” into my pipeline ?
please advise .
Thanks,
mglclinical
From Geraldine_VdAuwera on 2017-04-07
@mglclinical For VQSR the technical profile of the samples is much more important than ethnicity, so you could put together a reference panel that covers a wide diversity of origins and have that work well, at least in principle better than hard-filtering.
That being said, if you choose to go ahead with hard filtering, you can absolutely run genotype refinement afterward. The input here is written as “recalibratedVariants.vcf” because this was excerpted from a more complete doc that covered our Best Practices pipeline, which applies VQSR for filtering; but it’s not a requirement. You could substitute a callset filtered with any other technique.
From mglclinical on 2017-04-10
Hi @Geraldine_VdAuwera,
Thank you for the clarification on the input file “recalibratedVariants.vcf”.
Follow up Questions. I apologize if my questions are dumb. I am reading your tutorials and trying to make sure that I understand what Genotype Workflow does and how it improves genotype calls:
1. We don’t do trios, and hence we don’t have family information, so I guess I can skip the Step 3 from the [“Run the genotype refinement workflow”](https://software.broadinstitute.org/gatk/documentation/article.php?id=4727) that is used for de novo mutations ?
2. I am sorry if I was living under a rock. Is “Genotype Refinement Workflow” currently part of the Best Practice recommendation, even for non-trio samples?
3. What exact sites/variants gets affected due to Genotype Refinement Workflow?
a. Does this only affect genotype calls of sites/variants with low confidence or low coverage ?
b. Does this not affect genotype calls with high confidence and high coverage, irrespective of what Population Priors (1000G_phase3_v4_20130502.sites.vcf.gz) say ?
c. For Example : if there is a Het (Heterozygous Variant) at a specific genomic coordinate with high confidence and high coverage, and the same variant is found to be a Homozygous Variant in Population Priors, then does that original ‘Heterozygous Variant’ gets changed to ‘Homozygous Variant’ in the resulting vcf file (recalibratedVariants.postCGP.Gfiltered.vcf) ?
4. You have recommended @sjin_lee09 to subset the Population Priors file(1000G_phase3_v4_20130502.sites.vcf.gz) to a specific ethnicity that matches the samples that are being processed . Does this mean that the Population Priors file should not be used as is ? And we should subset it to the relevant ethnic population that match the patient samples ?
Thanks,
mglclinical
From Geraldine_VdAuwera on 2017-04-10
Hah, please don’t feel bad. Our docs around this are still a bit more barebones than I’d like.
1. Yep.
2. More like an optional “good to do if you can” but not considered a requirement. Let me put it this way: we currently don’t run it by default in our production pipeline.
3. Only affects low-confidence genotypes, where we have information that can correct/confirm them. There are a few examples in the slide deck on this step from the last workshop (see Presentations page for link to google drive where they live).
4. Right, the results will be better if you use an ethnically appropriate subset. The tools themselves don’t have any logic to recognize what cohort would be appropriate (though this could be added in future).
From mglclinical on 2017-04-12
@Geraldine_VdAuwera ,
Thank you for your answers to 1,3, and 4.
For 2, I am just curious about why ‘Genotype Refinement Workflow’ is not run by default in your production pipeline ?
thanks,
mglclinical
From Geraldine_VdAuwera on 2017-04-12
I’m not sure but I expect it’s because it’s more project dependent (the controlling logic is harder to boiler-plate) and therefore left up to the receiving researcher to run or not in whatever way is appropriate for their project.
From wangsheng on 2017-12-04
Hello everyone,
I’m wondering how the genotype refinement workflow will do if there’s multiple alternative alleles in my VCF files? Since our data was combined from hundreds of trio families, like this:
19 49965227 . G T,A 4050.34 PASS
We are using VCFv4.2.
Thanks,
Sheng
From Sheila on 2017-12-07
@wangsheng
Hi Sheng,
Hmm. I am thinking the tools only deal with biallelic sites, but I could be wrong. Can you post a before and after record of a multiallelic site?
Thanks,
Sheila
From alphahmed on 2018-01-23
I’m trying to run the refinement pipeline using GATK4, but I couldn’t find VariantAnnotator as part of the 4.0.0.0 release.
Is the PossibleDeNovo annotation part of another tools now?
The only other tool I could find is FindMendelianViolations, is this the only alternative?
As always, thank you very much for the help!
From Sheila on 2018-01-25
@alphahmed
Hi,
VariantAnnotator is being ported over for the next point release. It should be coming out soon. You can keep track of the issue [here](https://github.com/broadinstitute/gatk/issues/51).
-Sheila
From alphahmed on 2018-01-26
Thank you @Sheila !
From Tiffany_at_Broad on 2018-09-03
Hi @Sheila :) are these caveats still true:
Right now family priors can only be applied to biallelic variants and population priors can only be applied to SNPs. Family priors only work for trios.
From Sheila on 2018-09-07
@Tiffany_at_Broad
Hi Tiffany :smile:
Yes, I think those still hold true.
-Sheila
From cri_cri on 2019-07-27
Hello everyone
I have some troubleshoot to call de novo variants in my trio. After get snp variants from hard filtering I get 3 vcf files (father, mother, child). So I apply the follow steps:
java -jar /home/cri/Desktop/GATK/GenomeAnalysisTK.jar \ -T CombineVariants \ -R /media/cri/TRIO/hg38ensbl/97release/hg38.fa \ —variant 02filtered_snps_final.vcf \ —variant 06filtered_snps_final.vcf \ —variant 08filtered_snps_final.vcf \ -o output268merged.vcf \ -genotypeMergeOptions UNIQUIFY
java -jar /home/cri/Desktop/GATK/GenomeAnalysisTK.jar -R /hg38ensbl/97release/hg38.fa -T SelectVariants —variant output268merged.vcf -restrictAllelesTo BIALLELIC -o bioutput268merged.vcf
java -jar /home/cri/Desktop/GATK/GenomeAnalysisTK.jar -R /hg38ensbl/97release/hg38.fa -T CalculateGenotypePosteriors —supporting 1000G_phase3_v4_20130502.sites.vcf.gz -ped fam02.ped -V bioutput268merged.vcf -o recalibratedVariants.postCGP.vcf
java -jar /home/cri/Desktop/GATK/GenomeAnalysisTK.jar -T VariantFiltration -R /hg38ensbl/97release/hg38.fa -V recalibratedVariants.postCGP.vcf -G_filter “GQ < 20.0” -G_filterName lowGQ -o recalibratedVariants.postCGP.Gfiltered.vcf
java -jar /home/cri/Desktop/GATK/GenomeAnalysisTK.jar -T VariantAnnotator -R /hg38ensbl/97release/hg38.fa -V recalibratedVariants.postCGP.Gfiltered.vcf -A PossibleDeNovo -ped fam02.ped -o recalibratedVariants.postCGP.Gfiltered.deNovos.vcf
But afterwards I dont get any annotated hiConfDeNovo or lowConfDeNovo is that normal?
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample02.variant sample06.variant2 sample08.variant3
….
I hope I do everything right.
Thanks for the help
From qserenali on 2020-01-03
For the CalculateGenotypePosteriors step in genotype refinement workflow , is there any advantage in providing a ped file with say 100 pedigrees vs. one pedigree at a time? The objective is for de novo mutation detection. One pedigree at a time is easier for parallel processing and downstream result review. Does the principle “Better together” described in https://software.broadinstitute.org/gatk/documentation/article?id=11019 apply here? Thanks.