Genotype Refinement workflow for germline short variants

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by GATK_Team

on 2017-12-28

Contents

Overview
Summary of workflow steps
Output annotations
Example
More information about priors
Mathematical details

1. Overview

The core GATK Best Practices workflow has historically focused on variant discovery --that is, the existence of genomic variants in one or more samples in a cohorts-- and consistently delivers high quality results when applied appropriately. However, we know that the quality of the individual genotype calls coming out of the variant callers can vary widely based on the quality of the BAM data for each sample. The goal of the Genotype Refinement workflow is to use additional data to improve the accuracy of genotype calls and to filter genotype calls that are not reliable enough for downstream analysis. In this sense it serves as an optional extension of the variant calling workflow, intended for researchers whose work requires high-quality identification of individual genotypes.

While every study can benefit from increased data accuracy, this workflow is especially useful for analyses that are concerned with how many copies of each variant an individual has (e.g. in the case of loss of function) or with the transmission (or de novo origin) of a variant in a family.

If a “gold standard” dataset for SNPs is available, that can be used as a very powerful set of priors on the genotype likelihoods in your data. For analyses involving families, a pedigree file describing the relatedness of the trios in your study will provide another source of supplemental information. If neither of these applies to your data, the samples in the dataset itself can provide some degree of genotype refinement (see section 5 below for details).

After running the Genotype Refinement workflow, several new annotations will be added to the INFO and FORMAT fields of your variants (see below). Note that GQ fields will be updated, and genotype calls may be modified. However, the Phred-scaled genotype likelihoods (PLs) which indicate the original genotype call (the genotype candidate with PL=0) will remain untouched. Any analysis that made use of the PLs will produce the same results as before.

2. Summary of workflow steps

Input

Begin with recalibrated variants from VQSR at the end of the germline short variants pipeline. The filters applied by VQSR will be carried through the Genotype Refinement workflow.

Step 1: Derive posterior probabilities of genotypes

Tool used: CalculateGenotypePosteriors

Using the Phred-scaled genotype likelihoods (PLs) for each sample, prior probabilities for a sample taking on a HomRef, Het, or HomVar genotype are applied to derive the posterior probabilities of the sample taking on each of those genotypes. A sample’s PLs were calculated by HaplotypeCaller using only the reads for that sample. By introducing additional data like the allele counts from the 1000 Genomes project and the PLs for other individuals in the sample’s pedigree trio, those estimates of genotype likelihood can be improved based on what is known about the variation of other individuals.

SNP calls from the 1000 Genomes project capture the vast majority of variation across most human populations and can provide very strong priors in many cases. At sites where most of the 1000 Genomes samples are homozygous variant with respect to the reference genome, the probability of a sample being analyzed of also being homozygous variant is very high.

For a sample for which both parent genotypes are available, the child’s genotype can be supported or invalidated by the parents’ genotypes based on Mendel’s laws of allele transmission. Even the confidence of the parents’ genotypes can be recalibrated, such as in cases where the genotypes output by HaplotypeCaller are apparent Mendelian violations.

Step 2: Filter low quality genotypes

Tool used: VariantFiltration

After the posterior probabilities are calculated for each sample at each variant site, genotypes with GQ < 20 based on the posteriors are filtered out. GQ20 is widely accepted as a good threshold for genotype accuracy, indicating that there is a 99% chance that the genotype in question is correct. Tagging those low quality genotypes indicates to researchers that these genotypes may not be suitable for downstream analysis. However, as with the VQSR, a filter tag is applied, but the data is not removed from the VCF.

Step 3: Annotate possible de novo mutations

Tool used: VariantAnnotator

Using the posterior genotype probabilities, possible de novo mutations are tagged. Low confidence de novos have child GQ >= 10 and AC < 4 or AF < 0.1%, whichever is more stringent for the number of samples in the dataset. High confidence de novo sites have all trio sample GQs >= 20 with the same AC/AF criterion.

Step 4: Functional annotation of possible biological effects

Tool options: Funcotator (experimental)

Especially in the case of de novo mutation detection, analysis can benefit from the functional annotation of variants to restrict variants to exons and surrounding regulatory regions. Funcotator is a new tool that is currently still in development. If you would prefer to use a more mature tool, we recommend you look into SnpEff or Oncotator, but note that these are not GATK tools so we do not provide support for them.

3. Output annotations

The Genotype Refinement workflow adds several new info- and format-level annotations to each variant. GQ fields will be updated, and genotypes calculated to be highly likely to be incorrect will be changed. The Phred-scaled genotype likelihoods (PLs) carry through the pipeline without being changed. In this way, PLs can be used to derive the original genotypes in cases where sample genotypes were changed.

Population Priors

New INFO field annotation PG is a vector of the Phred-scaled prior probabilities of a sample at that site being HomRef, Het, and HomVar. These priors are based on the input samples themselves along with data from the supporting samples if the variant in question overlaps another in the supporting dataset.

Phred-Scaled Posterior Probability

New FORMAT field annotation PP is the Phred-scaled posterior probability of the sample taking on each genotype for the given variant context alleles. The PPs represent a better calibrated estimate of genotype probabilities than the PLs are recommended for use in further analyses instead of the PLs.

Genotype Quality

Current FORMAT field annotation GQ is updated based on the PPs. The calculation is the same as for GQ based on PLs.

Joint Trio Likelihood

New FORMAT field annotation JL is the Phred-scaled joint likelihood of the posterior genotypes for the trio being incorrect. This calculation is based on the PLs produced by HaplotypeCaller (before application of priors), but the genotypes used come from the posteriors. The goal of this annotation is to be used in combination with JP to evaluate the improvement in the overall confidence in the trio’s genotypes after applying CalculateGenotypePosteriors. The calculation of the joint likelihood is given as:

where the GLs are the genotype likelihoods in [0, 1] probability space.

Joint Trio Posterior

New FORMAT field annotation JP is the Phred-scaled posterior probability of the output posterior genotypes for the three samples being incorrect. The calculation of the joint posterior is given as:

where the GPs are the genotype posteriors in [0, 1] probability space.

Low Genotype Quality

New FORMAT field filter lowGQ indicates samples with posterior GQ less than 20. Filtered samples tagged with lowGQ are not recommended for use in downstream analyses.

High and Low Confidence De Novo

New INFO field annotation for sites at which at least one family has a possible de novo mutation. Following the annotation tag is a list of the children with de novo mutations. High and low confidence are output separately.

4. Example

Before:

1 1226231 rs13306638 G A 167563.16 PASS AC=2;AF=0.333;AN=6;… GT:AD:DP:GQ:PL 0/0:11,0:11:0:0,0,249 0/0:10,0:10:24:0,24,360 1/1:0,18:18:60:889,60,0

After:

1 1226231 rs13306638 G A 167563.16 PASS AC=3;AF=0.500;AN=6;…PG=0,8,22;… GT:AD:DP:GQ:JL:JP:PL:PP 0/1:11,0:11:49:2:24:0,0,249:49,0,287 0/0:10,0:10:32:2:24:0,24,360:0,32,439 1/1:0,18:18:43:2:24:889,60,0:867,43,0

The original call for the child (first sample) was HomRef with GQ0. However, given that, with high confidence, one parent is HomRef and one is HomVar, we expect the child to be heterozygous at this site. After family priors are applied, the child’s genotype is corrected and its GQ is increased from 0 to 49. Based on the allele frequency from 1000 Genomes for this site, the somewhat weaker population priors favor a HomRef call (PG=0,8,22). The combined effect of family and population priors still favors a Het call for the child.

The joint likelihood for this trio at this site is two, indicating that the genotype for one of the samples may have been changed. Specifically a low JL indicates that posterior genotype for at least one of the samples was not the most likely as predicted by the PLs. The joint posterior value for the trio is 24, which indicates that the GQ values based on the posteriors for all of the samples are at least 24. (See above for a more complete description of JL and JP.)

5. More information about priors

The Genotype Refinement Pipeline uses Bayes’s Rule to combine independent data with the genotype likelihoods derived from HaplotypeCaller, producing more accurate and confident genotype posterior probabilities. Different sites will have different combinations of priors applied based on the overlap of each site with external, supporting SNP calls and on the availability of genotype calls for the samples in each trio.

Input-derived Population Priors

If the input VCF contains at least 10 samples, then population priors will be calculated based on the discovered allele count for every called variant.

Supporting Population Priors

Priors derived from supporting SNP calls can only be applied at sites where the supporting calls overlap with called variants in the input VCF. The values of these priors vary based on the called reference and alternate allele counts in the supporting VCF. Higher allele counts (for ref or alt) yield stronger priors.

Family Priors

The strongest family priors occur at sites where the called trio genotype configuration is a Mendelian violation. In such a case, each Mendelian violation configuration is penalized by a de novo mutation probability (currently 10-6). Confidence also propagates through a trio. For example, two GQ60 HomRef parents can substantially boost a low GQ HomRef child and a GQ60 HomRef child and parent can improve the GQ of the second parent. Application of family priors requires the child to be called at the site in question. If one parent has a no-call genotype, priors can still be applied, but the potential for confidence improvement is not as great as in the 3-sample case.

Caveats

Right now family priors can only be applied to biallelic variants and population priors can only be applied to SNPs. Family priors only work for trios.

6. Mathematical details

Note that family priors are calculated and applied before population priors. The opposite ordering would result in overly strong population priors because they are applied to the child and parents and then compounded when the trio likelihoods are multiplied together.

Review of Bayes’s Rule

HaplotypeCaller outputs the likelihoods of observing the read data given that the genotype is actually HomRef, Het, and HomVar. To convert these quantities to the probability of the genotype given the read data, we can use Bayes’s Rule. Bayes’s Rule dictates that the probability of a parameter given observed data is equal to the likelihood of the observations given the parameter multiplied by the prior probability that the parameter takes on the value of interest, normalized by the prior times likelihood for all parameter values:

$$ P(\theta|Obs) = \frac{P(Obs|\theta)P(\theta)}{\sum_{\theta} P(Obs|\theta)P(\theta)} $$

In the best practices pipeline, we interpret the genotype likelihoods as probabilities by implicitly converting the genotype likelihoods to genotype probabilities using non-informative or flat priors, for which each genotype has the same prior probability. However, in the Genotype Refinement Pipeline we use independent data such as the genotypes of the other samples in the dataset, the genotypes in a “gold standard” dataset, or the genotypes of the other samples in a family to construct more informative priors and derive better posterior probability estimates.

Calculation of Population Priors

Given a set of samples in addition to the sample of interest (ideally non-related, but from the same ethnic population), we can derive the prior probability of the genotype of the sample of interest by modeling the sample’s alleles as two independent draws from a pool consisting of the set of all the supplemental samples’ alleles. (This follows rather naturally from the Hardy-Weinberg assumptions.) Specifically, this prior probability will take the form of a multinomial Dirichlet distribution parameterized by the allele counts of each allele in the supplemental population. In the biallelic case the priors can be calculated as follows:

$$ P(GT = HomRef) = \dbinom{2}{0} \ln \frac{\Gamma(nSamples)\Gamma(RefCount + 2)}{\Gamma(nSamples + 2)\Gamma(RefCount)} $$

$$ P(GT = Het) = \dbinom{2}{1} \ln \frac{\Gamma(nSamples)\Gamma(RefCount + 1)\Gamma(AltCount + 1)}{\Gamma(nSamples + 2)\Gamma(RefCount)\Gamma(AltCount)} $$

$$ P(GT = HomVar) = \dbinom{2}{2} \ln \frac{\Gamma(nSamples)\Gamma(AltCount + 2)}{\Gamma(nSamples + 2)\Gamma(AltCount)} $$

where Γ is the Gamma function, an extension of the factorial function.

The prior genotype probabilities based on this distribution scale intuitively with number of samples. For example, a set of 10 samples, 9 of which are HomRef yield a prior probability of another sample being HomRef with about 90% probability whereas a set of 50 samples, 49 of which are HomRef yield a 97% probability of another sample being HomRef.

Calculation of Family Priors

Given a genotype configuration for a given mother, father, and child trio, we set the prior probability of that genotype configuration as follows:

$$ P(GM,GF,G_C) = P(\vec{G}) \cases{ 1-10\mu-2\mu^2 & no MV \cr \mu & 1 MV \cr \mu^2 & 2 MVs} $$

where the 10 configurations with a single Mendelian violation are penalized by the de novo mutation probability μ and the two configurations with two Mendelian violations by μ^2. The remaining configurations are considered valid and are assigned the remaining probability to sum to one.

This prior is applied to the joint genotype combination of the three samples in the trio. To find the posterior for any single sample, we marginalize over the remaining two samples as shown in the example below to find the posterior probability of the child having a HomRef genotype:

This quantity P(Gc|D) is calculated for each genotype, then the resulting vector is Phred-scaled and output as the Phred-scaled posterior probabilities (PPs).

Updated on 2019-01-15

From sergey_ko13 on 2018-07-02

Dear GATK_Team,

great and comprehensive howto. Thank You!

I am just stuck with VariantAnnotator part as I couldn’t find commandline guide for that tool in GATK4.

Without -ped file it writes: must provide a valid PED file (-ped) from the command line

With -ped: A USER ERROR has occurred: Pedigree argument “pedigree” or “founder-id” was specified without a pedigree annotation being requested, (eg: ))

What should I do? What kind of pedigree annotation am I missing?

Thank You!

Sergey

From miguelb on 2018-07-03

I too am struggling with this error! Example command I’m running:

```/gatk VariantAnnotator -R Homo_sapiens_assembly38.fasta -O recalibratedVariants.postCGP.Gfiltered.deNovos.vcf.gz -V recalibratedVariants.postCGP.Gfiltered.vcf.gz -A PossibleDeNovo —pedigree input.ped```

Any advixe would be greatly appreciated!

From shlee on 2018-07-05

Hi sergey_ko13 and miguelb,

You need to provide a PED format pedigree file if you will use family relations to refine genotypes. [Article#7696](https://gatkforums.broadinstitute.org/gatk/discussion/7696/pedigree-ped-files) describes the PED format. I just recently made such a pedigree file myself a week ago to update a GATK Workshop hands-on tutorial and it works fine with [CalculateGenotypePosteriors](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_walkers_variantutils_CalculateGenotypePosteriors.php).

It looks like VariantAnnotator is ported to GATK4 release 4.0.2.0+ as a BETA tool. It is not showing up in forum documentation but you can read the details for the tool directly from the GATK jar using `gatk VariantAnnotator —help`. This prints to the terminal:

```

—pedigree,-ped:File Pedigree file for determining the population “founders” Default value: null.

```

I’ll look into why the tool doc isn’t showing up on the website.

From sergey_ko13 on 2018-07-06

Hi @shlee ,

We have ped files. And it worked fine for CalculateGenotypePosteriors too. But not with VariantAnnotator.

So we downraded to GATK3 with same files on same parameters in order to finish that part – we hope calculations are same.

Thank You!

Sergey

From shlee on 2018-07-06

Thanks, @sergey_ko13 for the clarification and for reporting this. I’ve let the developers know about this seeming bug in VariantAnnotator at https://github.com/broadinstitute/gatk/issues/4987. It’s likely they just haven’t gotten to wiring this functionality as VariantAnnotator was recently ported to GATK4 and is still in beta. Again, we really appreciate researchers like yourself testing and reporting such inconsistencies.

From student-t on 2018-08-07

@shlee

A USER ERROR has occurred: Pedigree argument “pedigree” or “founder-id” was specified without a pedigree annotation being requested, (eg: ))

From Sheila on 2018-08-10

@student-t

Hi,

Can you post the exact command you ran?

Thanks,

Sheila

From blueskypy on 2018-09-23

In P(G|D)=P(G)P(D|G)/P(D), HaplotypeCaller use a flat prior for P(G); but here the refinement uses more informative prior, then why HC doesn’t use this approach? Seems to me the prior computed from cohort allele frequency is more accurate than the flat one.

From shlee on 2018-09-24

Hi @blueskypy,

You can provide HaplotypeCaller a callset to use towards calculating genotype priors. Specify this callset with `—population-callset`. The [tooldoc](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_walkers_haplotypecaller_HaplotypeCaller.php) describes this parameter thusly:

> Callset to use in calculating genotype priors

> Supporting external panel. Allele counts from this panel (taken from AC,AN or MLEAC,AN or raw genotypes) will be used to inform the frequency distribution underlying the genotype priors. These files must be VCF 4.2 spec or later. Note that unlike CalculateGenotypePosteriors, HaplotypeCaller only allows one supporting callset.

From blueskypy on 2018-09-24

Thanks @shlee ! In joint calling, shouldn’t HC automatically use the cohort allele frequencies to compute a prior instead of a flat one?

From shlee on 2018-09-24

Hey @blueskypy,

I’m just helping out our new forum support person today and I’m noticing that you’ve posted four similar questions across different threads for the last few days. Would it be possible for you to consolidate all of your questions into a [single new post](https://gatkforums.broadinstitute.org/gatk/post/question/gatk-support-forum)?

Here are your four threads:

- https://gatkforums.broadinstitute.org/gatk/discussion/comment/52389#Comment_52389

- https://gatkforums.broadinstitute.org/gatk/discussion/comment/52386#Comment_52386

- https://gatkforums.broadinstitute.org/gatk/discussion/comment/52368#Comment_52368

- https://gatkforums.broadinstitute.org/gatk/discussion/comment/52183#Comment_52183

Going forward, it would help us tremendously if you could not ask duplicate questions and to group your related questions into the same thread. Thank you!

From blueskypy on 2018-09-24

Thanks @shlee ! Seems I don’t have the permission to post questions.

From shlee on 2018-09-24

@blueskypy,

That’s no good! I just checked your account details and you should be able to post questions. Do you know how long you’ve been unable to post questions? Can you please try again using these links:

1. https://gatkforums.broadinstitute.org/gatk/post/discussion/gatk4-methods

2. https://gatkforums.broadinstitute.org/gatk/post/question/gatk-support-forum

In the meanwhile, I will let my colleague @KateN, who manages the Vanilla side of things, know. Thanks.

From blueskypy on 2018-09-24

@shlee

clicking of those links took me to a blank page saying “Permission problem”. Also, in page https://gatkforums.broadinstitute.org/gatk/discussions/mine, I don’t see any option to “Ask the team”. I haven’t posted any question since Oct. last year.

From shlee on 2018-09-26

Hi @blueskypy,

@KateN, as well as a Vanilla forum support specialist, has looked into your account and from your account settings allow you to post like anyone else. Kate is on vacation so I am following up. It’s not clear what is going on and perhaps for starters you can switch to a different browser. It may be a security setting issue for your browser for the website. Please let us know if the problem of being unable to post new threads persists.

From blueskypy on 2018-09-26

Thanks @shlee ! I posted two questions.

From aschoenr on 2018-09-28

Hi,

I am currently putting together a pipeline to identify de novo mutations within a collection of huan families. So far I have implemented the pre-processing pipeline and I just wanted to be sure that I am proceeding correctly. What I am planning to do next is implement “Germline short variant discovery” workflow (https://software.broadinstitute.org/gatk/best-practices/workflow?id=11145) and then the workflow described here as Step 3 identifies de novo mutations. My questions about this:

1) Would this be the correct way of going about this? Would this be considered the GATK best practice to identify de novo mutations in human families?

2) I noticed there are other GATK tools to identify de novo mutations (specifically phasebytransmission). The tool I mentioned (PBT) seems to have fewer steps involved than implementing these two workflows. Should I just stick with what is described in these workflows?

3) This may be a dumb question, but both of the workflows I have mentioned talk specifically of germline variant discovery. Obviously this is relevant when talking about de novo mutations but I’m just not sure why these workflows seem to be specific to germline mutations. I apologize, my background is not in biology so I may just have some terms confused here.

4) For the Germline short variant discovery workflow, I assume I should apply this workflow on families at a time, not all of the families at once? In other words, I want to perform joint genotyping on all of the members of a family as a group.

I think that’s all my questions for now. Just looking for some reassurance that I’m on the wrong track and that I’m not incorporating too many steps to identify the specific type of mutations I’m looking for.

Thanks!

From aschoenr on 2018-10-03

Hi shlee, Sheila,

Just wanted to tag you here as it seems you’ve been quite helpful with other questions.

Thanks!

From shlee on 2018-10-03

Hi @aschoenr,

Great questions.

Our pre-processing and germline variant discovery Best Practices Workflows should enable you to call variants in your germline samples. They have been the industry standard for germline short variant calling (SNPs and indels <=50bp) for many years now and are used by large population cohort studies [ExAc](http://exac.broadinstitute.org/) and [gnomAD](http://gnomad.broadinstitute.org/). As for family analyses, you can refer to the [1000 Genomes Project](http://www.internationalgenome.org/)’s methods to analyze trios (mother/father/child). It’s my impression that the focus in the field is shifting from tabulating shared-population germline variants to accurately detecting rare variants on a per-sample basis, towards solving rare diseases. Part of this shift is to not just call a variant site but to accurately detect alleles (`A` vs. `G`) and zygosity (heterozygous vs. homozygous-variant). Thus far, it appears our workflows, e.g. the HaplotypeCaller GVCF workflow, are tailored to and have been used extensively in large population analyses.

You ask whether such workflows are appropriate to discover sample-level de novo variants and how to group the families—per family or group everyone.

If you have the power of a family cohort, then variants that are shared between family members will have great resolution. As for de novo variants, given ample read support, GATK should enable calling these as well. Here are some points to consider.

[1] Do you have coverage depth and quality reads that allow resolving per-sample variants across the genomic loci of interest? Here you can test the sensitivity of your pipeline with a known rare variant in your sample. The sensitivity will be different for SNPs, deletions and insertions. Furthermore, the size of deletions and the size of insertions you can resolve will vary pending your read lengths and insert sizes. For one, if you pool your family members and call on all of their BAMs jointly with HaplotypeCaller, you can increase your sensitivity to resolve certain shared variants. It is possible to run HaplotypeCaller in various modes and it is up to you to choose the mode that best fits your analysis aims. Here’s how I like to summarize the modes:

![](https://us.v-cdn.net/5019796/uploads/editor/lh/akfx4bbk5fu8.png “”)

[2] If we look into [PhaseByTransmission](https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_phasing_PhaseByTransmission.php) documentation, we see this note:

> DeNovoPrior: Prior probability of de novo mutations. The default value of 1e-8 is fairly stringent, so if you are interested in maximizing sensitivity at the expense of specificity (i.e. are ok with seeing some false positives as long as all true positives are detected) you will need to relax this value.

So it appears that to enable sensitive detection of de novo variants that may only present in single samples, you should relax this prior with the `—DeNovoPrior` parameter.

[3] GATK also offers a [Genotype Refinement workflow](https://gatkforums.broadinstitute.org/gatk/discussion/4723/genotype-refinement-workflow) that uses [CalculateGenotypePosteriors](https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_variantutils_CalculateGenotypePosteriors.php) to refine genotype calls in a family callset.

[4] The [PossibleDeNovo](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_walkers_annotator_PossibleDeNovo.php) annotation should interest you and it appears it is part of the Genotype Refinement Workflow.

[5] HaplotypeCaller/GenotypeGVCFs has a new qual model that we recommend that is not yet the default. It will help increase sensitivity for your case.

[6] Finally, be sure to search the forum for related threads. For example, [this thread](https://gatkforums.broadinstitute.org/gatk/discussion/10552/what-is-the-best-way-to-find-denovo-mutations-in-trios) seems pertinent.

I hope this is helpful.

From shlee on 2018-10-04

Just some additional comments @aschoenr, from having thought about your question overnight, which is more than the few minutes I had spent yesterday at the end of the day. I think the metric that is important to you is the depth per allele and the sensitivity of your caller to that depth. Also, we must consider the nature of the de novo germline variants that arise each generation and whether you are interested in all or a subset of these.

If your data presents low depth per allele for whatever reason, then the alternative to HaplotypeCaller/GenotypeGVCFs calling is (i) HaplotypeCaller per-sample GVCF calling (some folks in the field use this towards somatic calling) and (ii) Mutect2 calling.

And here is a question to consider. Do you need to resolve those variations that occur in low complexity regions, e.g. homopolymer runs, or repetitive sequence, e.g. alpha satellites? Because no conventional germline caller out there can really resolve these. You would have to use specialized tools such as [lobSTR](http://lobstr.teamerlich.org/) to profile short tandem repeats (STRs). Not only do these types of regions present challenges in sequencing, but by their nature, they are prone to polymerase slippage and mutation within the cell as well. Even if you are uninterested in such types of variation, these will show up in your callset as differences between samples and present as many false positive callse. To filter such types of variation, the [Mutect2 workflow](https://gatkforums.broadinstitute.org/gatk/discussion/11136/how-to-call-somatic-mutations-using-gatk4-mutect2) allows for the filtering of sites in a [Mutect2 panel of normals (PoN)](https://gatkforums.broadinstitute.org/gatk/discussion/11127/somatic-calling-is-not-simply-a-difference-between-two-callsets). Within this workflow, the PoN’s primary purpose is to collect systematic artifacts of sequencing, including those that arise from low complexity and repetitive sequence regions, that are common but do not necessarily manifest for every sample.

Here is another question. Some rare variants present as structural, e.g. copy number variation. I think you might be interested in our [germline CNV (gCNV) calling pipeline](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_copynumber_GermlineCNVCaller.php), which is still in beta release. The GATK4 gCNV workflow will call both common and rare copy number variants, given a proper panel of normals. Since your aim is to find variants that are different between family members, you should be able to use the family samples for the cohort panel. There is no forum tutorial on this workflow yet. I have been working on the tutorial but the forum has been inundated with questions and, well, see [here](https://software.broadinstitute.org/gatk/blog?id=9063) about the time-pie constant.

Best,

Soo Hee

From aschoenr on 2018-10-05

Wow @shlee thanks so much for all of your input. Just for the record, I seem to have an average coverage of 30-35X, would you consider this to be high depth? I also plan on filtering calls in low complexity regions, specifically segmental duplications, simple repeats and those identified by repeatmasker.

With respect to the [thread](https://gatkforums.broadinstitute.org/gatk/discussion/10552/what-is-the-best-way-to-find-denovo-mutations-in-trios) you mentioned, this is basically where I got my initial plan from. The steps in the original post were:

1 -Alignment to reference genome

2 – marking duplicates

3 – base recalibration

4 – realigning indels

5 – Haplotype caller per sample with the -ERC GVCF option (this will call the ReadBackedPhasing, correct?)

6 – Joint genotyping

7 – Varinat recalibration

8 – Genotype refinement workflow, where pedegree information is used and de novos are annotated using VariantAnnotator.

Steps 1-3 are basically the GATK pre-processing pipeline, step 4 does not need to be done and steps 5-7 are basically the Germline short variant discovery workflow. At this point I would have analysis ready VCFs for all of my families. I was planning on doing step 5 (Haplotypecaller) on each sample individually and then doing joint genotyping on the families. After this discussion and a discussion with a colleague of mine I think I may swap out step 8 (Genotype refinement workflow) for PhaseByTransmission. I am interested strictly in de novo mutations so I think the Genotype refinement workflow may be overkill. The nice thing about all of these approaches is that my pipeline up until step 8 can be used for either the Genotype refinement workflow or PhaseByTransmission, so if PBT doesn’t work out I can always revert to the GATK workflow. Does this seem reasonable to you? The only reason I leaned toward the Genotype refinement workflow was because of @Sheila’s first response.

I also wanted to note that I can only guess how busy you are and I really appreciate your help!

From shlee on 2018-10-09

Hi @aschoenr,

First, I think you might find [this thread](https://gatkforums.broadinstitute.org/gatk/discussion/comment/52438#Comment_52438) and the poster linked therein of interest. I believe 30-35X high complexity sequences, as would result from PCR-free sample prep, is generally considered sufficient for calling short variants (SNP and indel) in high confidence regions.

I would follow @Sheila’s recommendation. She was our full-time front-line support specialist and therefore more aware of all the goings-ons. For example, it seems PhasByTransmission is only available in GATK3, as [porting it to GATK4 has been given low priority](https://github.com/broadinstitute/gatk/issues/40). Let us know if this tool is vital to your research and we can request it be bumped in priority.

Yes, the Communications team is busy these days gearing up for ASHG. You are welcome. It is our pleasure to enable science.

From artitandon on 2018-12-14

I had a question regarding CalculateGenotypePosteriors caller, is it possible to generate a vcf.gz (zipped VCF) file as output instead of the .vcf that it is generating? If so what option can one use? Thanks!

From shlee on 2018-12-14

Hi @artitandon,

All you have to do is specify a `.vcf.gz` extension in the output name and the tool will block compress the callset automatically.

From artitandon on 2018-12-14

I tried that, it didn’t work last time with GATK 4.0 , let me try again.

From artitandon on 2018-12-14

Tried again it does not work, it is not in zipped format

From shlee on 2018-12-17

Hi @artitandon,

Last I checked with v4.0.11.0, CalculateGenotypePosteriors does successfully output a block-compressed `.vcf.gz` file. Note this is a different type of compression than the `.zip` file type. The former allows indexed access. To better help you figure out what is going on, can you please provide the command you are using and explain how you have concluded the output is not in zipped format? For example, do downstream tools produce an error when given the output of CalculateGenotypePosteriors?

From artitandon on 2018-12-17

I know and specified .vcf.gz format, here is the command I am running first:

1) java -jar /mnt/task/01f18da8-592b-4348-8f1e-af3b2ff43524/gatk-package-4.0.1.2-local.jar CalculateGenotypePosteriors —variant /mnt/task/01f18da8-592b-4348-8f1e-af3b2ff43524/Test_sorted.vcf.gz -O /mnt/task/01f18da8-592b-4348-8f1e-af3b2ff43524/Test_trio_calcgp.vcf.gz -R /mnt/task/01f18da8-592b-4348-8f1e-af3b2ff43524/hg19.fa -ped /mnt/task/01f18da8-592b-4348-8f1e-af3b2ff43524/BCHTrio.ped

Next I run the following command:

2) java -jar /mnt/task/01f18da8-592b-4348-8f1e-af3b2ff43524/gatk-package-4.0.1.2-local.jar VariantFiltration -V /mnt/task/01f18da8-592b-4348-8f1e-af3b2ff43524/Test_trio_calcgp.vcf.gz -O /mnt/task/01f18da8-592b-4348-8f1e-af3b2ff43524/Test_filter_GT.vcf.gz —genotype-filter-expression GQ<20.0 —genotype-filter-name lowGQ

When I run this I get the following error:

Unable to parse header with error: Not in GZIP format, for input source: file:///mnt/task/01f18da8-592b-4348-8f1e-af3b2ff43524/Test_trio_calcgp.vcf.gz

You can see detailed log below. Any help in resolving this will be appreciated:

```

17:03:57.888 INFO FeatureManager – Using codec VCFCodec to read file file:///mnt/task/01f18da8-592b-4348-8f1e-af3b2ff43524/Test_trio_calcgp.vcf.gz

17:03:57.896 INFO VariantFiltration – Shutting down engine

[December 14, 2018 5:03:57 PM UTC] org.broadinstitute.hellbender.tools.walkers.filters.VariantFiltration done. Elapsed time: 0.01 minutes.

Runtime.totalMemory()=737673216

org.broadinstitute.hellbender.exceptions.GATKException: Error initializing feature reader for path /mnt/task/01f18da8-592b-4348-8f1e-af3b2ff43524/Test_trio_calcgp.vcf.gz at org.broadinstitute.hellbender.engine.FeatureDataSource.getTribbleFeatureReader(FeatureDataSource.java:346) at org.broadinstitute.hellbender.engine.FeatureDataSource.getFeatureReader(FeatureDataSource.java:297) at org.broadinstitute.hellbender.engine.FeatureDataSource.(FeatureDataSource.java:244) at org.broadinstitute.hellbender.engine.VariantWalker.initializeDrivingVariants(VariantWalker.java:55) at org.broadinstitute.hellbender.engine.VariantWalkerBase.initializeFeatures(VariantWalkerBase.java:47) at org.broadinstitute.hellbender.engine.GATKTool.onStartup(GATKTool.java:558) at org.broadinstitute.hellbender.engine.VariantWalker.onStartup(VariantWalker.java:43) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:134) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:179) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:198) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:153) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:195) at org.broadinstitute.hellbender.Main.main(Main.java:277)

Caused by: htsjdk.tribble.TribbleException$MalformedFeatureFile: Unable to parse header with error: Not in GZIP format, for input source: file:///mnt/task/01f18da8-592b-4348-8f1e-af3b2ff43524/Test_trio_calcgp.vcf.gz at htsjdk.tribble.TribbleIndexedFeatureReader.readHeader(TribbleIndexedFeatureReader.java:262) at htsjdk.tribble.TribbleIndexedFeatureReader.(TribbleIndexedFeatureReader.java:101) at htsjdk.tribble.TribbleIndexedFeatureReader.(TribbleIndexedFeatureReader.java:126) at htsjdk.tribble.AbstractFeatureReader.getFeatureReader(AbstractFeatureReader.java:110) at org.broadinstitute.hellbender.engine.FeatureDataSource.getTribbleFeatureReader(FeatureDataSource.java:342) … 12 more

Caused by: java.util.zip.ZipException: Not in GZIP format at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:165) at java.util.zip.GZIPInputStream.(GZIPInputStream.java:79) at java.util.zip.GZIPInputStream.(GZIPInputStream.java:91) at htsjdk.tribble.TribbleIndexedFeatureReader.readHeader(TribbleIndexedFeatureReader.java:256) … 16 more

overall launching time of tool run, (commands are [bash, ./VariantFiltration.sh, /mnt/task/01f18da8-592b-4348-8f1e-af3b2ff43524/Test_trio_calcgp.vcf.gz, /mnt/task/01f18da8-592b-4348-8f1e-af3b2ff43524/Test_filter_GT.vcf.gz, lowGQ, GENOTYPE]): 3 sec.

```

From shlee on 2018-12-17

@artitandon, I vaguely recall some bug with gzipping VCFs in earlier releases of GATK where some tools could only output VCF format and not VCF.GZ. Can you try with the latest v4.0.11.0 and see if that solves the error?

From artitandon on 2018-12-17

An issue is when I run with just .vcf extension for CalculateGenotypePosteriors, the next step VariantFiltration gives me the following error:

java.lang.IllegalArgumentException: Features added out of order: previous (TabixFeature{referenceIndex=0, start=1243934, end=1243934, featureStartFilePosition=3191224091, featureEndFilePosition=-1}) > next (TabixFeature{referenceIndex=0, start=1243932, end=1243934, featureStartFilePosition=3191224392, featureEndFilePosition=-1})

the commands I am running are:

1) /gatk CalculateGenotypePosteriors -V in.vcf -O out_calcgp.vcf -R ref.fa -ped trio.ped

2) /gatk VariantFiltration -V out_calcgp.vcf -O out_filter.vcf.gz —genotype-filter-expression “GQ<20.0” —genotype-filter-name “lowGQ”

I don’t know how to fix this features error, any help with this will be appreciated.

From shlee on 2018-12-18

@artitandon, sounds like an issue with sorting or with the index.

Can you check that the input and output VCFs have indices and validate with ValidateVariants? I would try to address any issues ValidateVariants brings up, try sorting with SortVcf, and finally, try regenerating indices by deleting existing ones then reindexing with IndexFeatureFile. If these do not work, as I suggested above, can you see if the latest version of GATK4, currently v4.0.11.0, recapitulates the error? If it does, we’ll need some of your data to figure out what may be going on.

As a side note, in GATK4 CalculateGenotypePosteriors doesn’t require the reference anymore.

From shlee on 2018-12-18

Correction, the latest release is v4.0.12.0!

From artitandon on 2018-12-18

Thanks Shlee, I used SortVCF and that works!

I had another question regarding running germline workflow for a proband only, what is the recommended steps post CalculateGenotypePosterior?

From shlee on 2018-12-19

Hi @artitandon,

Recommendations post genotype refinement are context-based annotation with [VariantAnnotator](https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_annotator_VariantAnnotator.php) (link is to GATK3 tooldoc) and/or functional annotation with [Funcotator](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_funcotator_Funcotator.php). Assuming your sequencing tech, toolchain and data allow for detection of the genotype of interest, then the latter should enable filtering for functionally significant short SNP and indel variants.

Otherwise, you may be interested in GATK4’s [gCNV workflow](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_copynumber_GermlineCNVCaller.php) for detecting both common and rare germline CNV events in your proband against a panel of normals (PoN). If you are only interested in rare germline CNV events, say de novo events not present in the rest of the family, e.g. parents, then be sure to include the parental samples in the cohort-mode analysis.

From artitandon on 2018-12-19

HI Shlee, Thanks for the response!

in absence of parental data can I use VariantAnnotator, as far as I can see it requires a ped file, and detects de novo variants.

Thanks,

Arti

From Negin on 2019-05-16

Dear GATK team,

I am working on bacteria dna sequences and no hapmap/omni data is available for my case. So, do you think I should calculate genotype posterior for my case? If not, then should I still filter low GQ or is not valid anymore in the absence of genotype posterior?

Thanks in advance,

Negin

From AlFatlawi on 2019-05-17

Hi, I have the same question. I didn’t apply genotype posterior. Should I still filer low GQ or filtering GQ is dependent on calculated genotype posterior?

From Negin on 2019-05-20

Any answer, please?

Report abuse