077. Version highlights for GATK version 37

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by Geraldine_VdAuwera

on 2016-12-12

Here it is at last… as in, last release for 2016, and possibly the last point release of GATK 3 ever!

Aside from the usual pile of bug fixes, the new features in this version are actually (almost) all features or improvements that were developed for GATK 4. We backported them to the GATK 3 framework to make them widely available sooner rather than later, since we still have some work to do to make GATK 4 complete enough to become the new standard. Of course there are a lot of other new things in the GATK 4 alpha version that we can't backport (especially those related to speed/performance improvements) because they depend on the new framework. But what we could backport, we did.

The hottest change here is a new model for calculating the QUAL score, but be aware it's there on an opt-in basis, not enabled by default. This also comes with a lower default value for the -stand_call_conf threshold, and deprecation of the confusing and ultimately rather pointless threshold -stand_emit_conf. We're also introducing some logic for prioritizing alleles to improve performance in messy regions. And we've got some improvements for MuTect2, although that tool does remain in beta status for now.

As usual, see the release notes for a full list of changes, and read on below for details on what we think you'll care most about.

A certain je ne sais QUAL ...

There's been a recent infusion of new blood in the development team -- meaning new members, to be clear (GATK development does not actually involve any demonic rituals, despite any rumors you might have heard). With that came a renewal of ideas on how to calculate key variant metrics, including QUAL.

As we explain in loving detail in this document, the QUAL score is the Phred-scaled posterior probability that all samples in your callset are homozygous reference. In other words, it represents the probability that all the variant evidence you saw in your data is wrong. It's not a very reliable way of ranking variant quality as such, because it's vulnerable to artifacts like inflation at high read depth -- but it does allow us to rule out the majority of glaringly false calls at the low end of the scale.

The current model for calculating QUAL has some flaws that manifest, among other things, as a tendency to excessively penalize singletons and doubletons (variants observed only in one or two samples in a cohort), especially at large cohort sizes. It also uses different and needlessly complicated logic for dealing with haploid, diploid and polyploid cases, leading to "amusing" inconsistencies. No one likes that.

So then some magic happened and now we have a new model that is simpler and behaves better, according to our tests. We're using it in our own work already, but we're not switching it on by default in the public release because it is a pretty big change. Instead, it's available as an opt-in feature that you can enable by setting -newQual in any command line invoking a germline caller (HaplotypeCaller, GenotypeGVCFs or UnifiedGenotyper if you really must use it).

Assuming it continues to behave well in our hands and yours (those of you who switch it on), it will be the default model in GATK 4. And then it will get documented in loving detail too, of course.

One threshold for calling

One of the last steps in the germline short variant calling process is the calculation of the QUAL score for each candidate variant. Once that's done, a threshold is applied on the QUAL score and we discard any variants that scored lower than the given threshold value. When you're running HaplotypeCaller in "GVCF mode", i.e. with -ERC GVCF or -ERC BP_RESOLUTION, that threshold is set to zero and every record is written to the output file. In every other case, the threshold is set by the -stand_call_conf argument, which stands for "standard calling confidence".

That sounds perfectly reasonable, doesn't it? Well, in practice that's not exactly how it works -- or has worked so far, anyway. No, we looove to provide sooo many additional options, we just had to use two arguments to do the QUAL thresholding. One called -stand_call_conf, and a second one called -stand_emit_conf. The first one works as advertised; the second was meant to make it possible to include candidate variants with a lower QUAL score to be written to the output, BUT with a "LowQual" tag of shame in the FILTER field. It was supposed to provide a sort of filtering preview.

Frankly, for the past few years we've used the same value for both, which effectively cancels out the -stand_emit_conf functionality, and generally speaking the presence of that argument has been sowing confusion since the dawn of time. In my view it's at odds with our philosophy of "call everything that moves then filter things out properly with other annotations". So we're killing it. It's gone.

So to summarize, the -stand_call_conf is the last threshold left standing for variant calling. Oh, and while we were at it we lowered the default value to 10 instead of 30. This is more generous than it needs to be, but you can always filter whatever's in the output that you don't want. Whereas you can't easily go back and relax the threshold if it was higher than you wanted -- or if you hadn't even realized it was a thing you could do to increase sensitivity.

Meaningful nod to people who do variant caller comparisons...

Culling the genotype herd to avoid a stampede

In some regions where the sequence context is very repetitive, we tend to find many candidate alleles for the same position, even within a single sample. When that happens, and especially if the requested ploidy is high (e.g. in pooled experiments), the number of possible genotypes that we have to evaluate (i.e. calculate likelihoods for) becomes downright astronomical. Depending on conditions, the consequences can range from unacceptably long runtimes to complete crash.

We previously tried to solve this by providing an argument called -maxAltAlleles to limit the number of alleles the caller has to consider, but the way it was wired up only limited the alleles that were output, not those considered internally. So it only solved superficial problems, and it didn't account for ploidy directly.

Now we're trying a new approach that involves setting a limit on the number of genotypes that we're willing to consider, instead of a number of alleles. Under the hood, this ties into some logic that drops alt alleles that are really unlikely until we get to a number of possible genotypes that we deem acceptable. The default value is set to be comfortable enough that this only kicks in at complex sites when ploidy is high, but it can be modified with the argument -maxGenotypeCnt to be more or less generous.

Note that -maxAltAlleles is still applicable, but the current implementation is set to give precedence to -maxGenotypeCnt. So at sites where sample ploidy and -maxAltAlleles combine to give a genotype count higher than the value in -maxGenotypeCnt, the -maxAltAlleles limit will be ignored, and alternate alleles will be removed based on ploidy and -maxGenotypeCnt.

If you want to tweak these settings, keep in mind their interactions and the rule of precedence, otherwise you might run into surprises (and not necessarily of the surprise birthday cake at your team meeting kind) that we will NOT consider to be bugs. For example, let's say you provide -maxAltAlleles with a high value, leave the -maxGenotypeCnt as default, and works with a high ploidy sample. Due to the newly imposed maximum genotype count, alt alleles actually used in genotyping will be limited to far less than the maximum you requested. For example, with ploidy 18 and maximum genotype count set to 1024 (the current, arbitrary default value, but definitely reasonable in most cases), the maximum allele count is 3 (alt allele count 2), potentially much lower than the -maxAltAlleles you requested.

MuTect2 starting to see the light at the end of the beta tunnel

MuTect2 is a next-generation somatic SNP and indel caller that combines the DREAM challenge-winning somatic genotyping engine of the original MuTect with the assembly-based machinery of HaplotypeCaller. It was first made publicly available in GATK version 3.5 as a beta tool earmarked for experimental work only (no production or commercial work). In evaluations performed with our colleagues in the Broad's Cancer Genome Analysis group (CGA), we found that MuTect2 was doing a great job on indels, but its sensitivity on SNVs was slightly inferior to the original MuTect on which it was based.

Due to a shift in priorities we then had to put MuTect2 development on hiatus, which is why it stayed virtually unchanged in GATK 3.6. But I'm happy to report that MuTect2 is now back in active development! In this release, we have a small but appreciable crop of improvements to MuTect2, which will be the last ever made in a 3.x version.

The improvements made in this version mainly have to do with cleaning up the code and simplifying parts where hybridization with the HaplotypeCaller machinery got a bit Frankenstein-y. As part of that, we're now exposing a couple of downsampling-related arguments that were previously hardcoded, so -maxReadsInRegionPerSample and -minReadsPerAlignment can now be set from command line. We've also added back a few components that were in the original MuTect but weren't ported in the move to GATK, including the clustered read position and strand bias filters. Work is still ongoing to determine exactly what is the best way to leverage these components for best results.

To be clear, despite these improvements we're still keeping MuTect2 in beta status pending full satisfaction from our CGA friends -- so that does mean that the eventual fully-supported version of MuTect2 will be released in GATK 4 only. We'll post a roadmap/expected timeline of GATK4 and MuTect2 development in the coming weeks.

From wenbinmei on 2016-12-12

Hi, GATK development team. If I have already called individual sample using HaplotypeCaller and will do joint calling using GenotypeGVCFs. Do you think it make sense for me to use new -newQual in the GenotypeGVCFs or not use it? Thank you for help.

From Geraldine_VdAuwera on 2016-12-12

Hi @wenbinmei,

Yes, you can use the new QUAL model on GVCFs generated with an earlier (recent) version.

From wenbinmei on 2016-12-12

It is ok to use new QUAL model on GVCFs even I used the old QUAL model on HaplotypeCaller ? Does it will improve SNP calling ? Thanks!

From Geraldine_VdAuwera on 2016-12-13

Yes, because GenotypeGVCFs will ignore the QUAL originally calculated by HaplotypeCaller and will recalculate it based on the PLs of the genotypes recorded in the GVCFs. This may improve sensitivity to rare variants.

From MattB on 2016-12-13

Hi Dev team / Geraldine, I've been experimenting with -newQual using the HC in classic none GVCF mode, and you should note that by using -newQual on it's own, I'm seeing standard_min_confidence_threshold_for_calling=10.0 standard_min_confidence_threshold_for_emitting=30.0 in the VCF header under ##GATKCommandLine.HaplotypeCaller=<>. So it at least appears that the emitting threshold is still persists at the old value according to the VCF header (I' guessing it's actually ignored). Of course if I attempt to explicitly set -stand_emit_conf 10 I get the deprecation warning. I've not yet tested this behaviour in other tools but expect it arises from something common.

From Geraldine_VdAuwera on 2016-12-13

Oh, I think I know why that happens, it’s an oversight from the refactoring process. It shouldn’t have any functional consequence beyond the residual line in the VCF header. Thanks for pointing this out! We’ll make sure to remove it.

From wisonlee on 2017-03-13

Hi Dev team / Geraldine, I’d like to know how to use MuTect2’s `-maxReadsInRegionPerSample`. I tried several combinations with `-dt` and `-dcov` but to no avail. Could you please give me some hints on it?

Thanks,

Richard

From Sheila on 2017-03-15

@wisonlee

Hi Richard,

I am not sure I understand your question. Can you post the exact command you ran and the issue you are having? Perhaps the [documentation](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_cancer_m2_MuTect2.php#—maxReadsInRegionPerSample) will help?

Thanks,

Sheila

From wisonlee on 2017-03-16

Hi @Sheila

I run into the following error if I combine -dt BY_SAMPLE, -dcov 1000 and --maxReadsInRegionPerSample 1000: ``` INFO 14:22:28,492 HelpFormatter - --------------------------------------------------------------------------------- INFO 14:22:28,528 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.7-0-gcfedb67, Compiled 2016/12/12 11:21:18 INFO 14:22:28,529 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute INFO 14:22:28,529 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk INFO 14:22:28,529 HelpFormatter - [Mon Mar 13 14:22:28 CST 2017] Executing on Linux 2.6.32-642.3.1.el6.x8664 amd64 INFO 14:22:28,530 HelpFormatter - OpenJDK 64-Bit Server VM 1.8.091-b14 INFO 14:22:28,537 HelpFormatter - Program Args: -T MuTect2 -R hs37d5.fa -I:tumor incase.bam -I:normal inctrl.bam -L myroi.bed --dbsnp dbsnp138.b37.vcf.gz --cosmic b37cosmic.vcf.gz -contamination 0 --maxaltallelesinnormalcount 3 --maxaltallelesinnormalqscoresum 40 --maxaltalleleinnormalfraction 0.02 -dt BYSAMPLE -dcov 1000 --maxReadsInRegionPerSample 1000 -o out.vcf

ERROR A USER ERROR has occurred (version 3.7-0-gcfedb67):

ERROR

ERROR This means that one or more arguments or inputs in your command are incorrect.

ERROR The error message below tells you what is the problem.

ERROR

ERROR If the problem is an invalid argument, please check the online documentation guide

ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.

ERROR

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions https://software.broadinstitute.org/gatk

ERROR

ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.

ERROR

ERROR MESSAGE: Invalid command line: Cannot use -dcov or --downsampletocoverage for ActiveRegionWalkers, use another downsampling argument

ERROR ------------------------------------------------------------------------------------------

```

If I remove the -dt and -dcov parameters, it runs okay but with the dt set to default (Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 ) in the log. I don't know whether the --maxReadsInRegionPerSample 2000 parameter is working as intended. ``` INFO 08:43:59,994 HelpFormatter - --------------------------------------------------------------------------------- INFO 08:43:59,998 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.7-0-gcfedb67, Compiled 2016/12/12 11:21:18 INFO 08:43:59,998 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute INFO 08:43:59,998 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk INFO 08:43:59,998 HelpFormatter - [Thu Mar 16 08:43:59 CST 2017] Executing on Linux 2.6.32-642.3.1.el6.x8664 amd64 INFO 08:43:59,999 HelpFormatter - OpenJDK 64-Bit Server VM 1.8.091-b14 INFO 08:44:00,003 HelpFormatter - Program Args: -T MuTect2 -R hs37d5.fa -I:tumor incase.bam -I:normal inctrl.bam -L myroi.bed --dbsnp dbsnp138.b37.vcf.gz --cosmic b37cosmic.vcf.gz -contamination 0 --maxaltallelesinnormalcount 3 --maxaltallelesinnormalqscoresum 40 --maxaltalleleinnormalfraction 0.02 -dt BYSAMPLE -dcov 1000 --maxReadsInRegionPerSample 2000 -o out.vcf

#

INFO 08:44:00,086 GenomeAnalysisEngine - Strictness is SILENT INFO 08:44:00,265 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 08:44:00,275 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 08:44:00,349 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.07 INFO 08:44:01,021 IntervalUtils - Processing 153371 bp from intervals WARN 08:44:01,028 IndexDictionaryUtils - Track cosmic doesn't have a sequence dictionary built in, skipping dictionary validation WARN 08:44:01,028 IndexDictionaryUtils - Track dbsnp doesn't have a sequence dictionary built in, skipping dictionary validation INFO 08:44:01,132 GenomeAnalysisEngine - Preparing for traversal over 2 BAM files INFO 08:44:01,890 GenomeAnalysisEngine - Done preparing for traversal INFO 08:44:01,891 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 08:44:01,892 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 08:44:01,892 ProgressMeter - Location | active regions | elapsed | active regions | completed | runtime | runtime INFO 08:44:02,021 MuTect2 - Using global mismapping rate of 45 => -4.5 in log10 likelihood units Using AVX accelerated implementation of PairHMM INFO 08:44:05,667 VectorLoglessPairHMM - libVectorLoglessPairHMM unpacked successfully from GATK jar file INFO 08:44:05,668 VectorLoglessPairHMM - Using vectorized implementation of PairHMM ``` Thanks, Richard

From Sheila on 2017-03-17

@wisonlee

Hi Richard,

Ah, I see. Yes, we do not recommend using -dt or -dcov in MuTect2 or HaplotypeCaller. Have a look at [this thread](http://gatkforums.broadinstitute.org/gatk/discussion/3989/downsampling-to-coverage-and-the-3-x-haplotypecaller) and [this thread](http://gatkforums.broadinstitute.org/gatk/discussion/4614/haplotypecaller-and-downsampling).

-Sheila