126. Version highlights for GATK version 35

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by Geraldine_VdAuwera

on 2015-11-25

The last GATK 3.x release of the year 2015 has arrived!

The major feature in GATK 3.5 is the eagerly awaited MuTect2 (beta version), which brings somatic SNP and Indel calling to GATK. This is just the beginning of GATK’s scope expansion into the somatic variant domain, so expect some exciting news about copy number variation in the next few weeks! Meanwhile, more on MuTect2 awesomeness below.

In addition, we’ve got all sorts of variant context annotation-related treats for you in the 3.5 goodie bag -- both new annotations and new capabilities for existing annotations, listed below.

In the variant manipulation space, we enhanced or fixed functionality in several tools including LeftAlignAndTrimVariants, FastaAlternateReferenceMaker and VariantEval modules. And in the variant calling/genotyping space, we’ve made some performance improvements across the board to HaplotypeCaller and GenotypeGVCFs (mostly by cutting out crud and making the code more efficient) including a few improvements specifically for haploids. Read the detailed release notes for more on these changes. Note that GenotypeGVCFs will now emit no-calls at sites where RGQ=0 in acknowledgment of the fact that those sites are essentially uncallable.

We’ve got good news for you if you’re the type who worries about disk space (whether by temperament or by necessity): we finally have CRAM support -- and some recommendations for keeping the output of BQSR down to reasonable file sizes, detailed below.

Finally, be sure to check out the detailed release notes for the usual variety show of minor features (including a new Queue job runner that enables local parallelism), bug fixes and deprecation notices (a few tools have been removed from the codebase, in the spirit of slimming down ahead of the holiday season).

Introducing MuTect2 (beta): calling somatic SNPs and Indels natively in GATK

MuTect2 is the next-generation somatic SNP and indel caller that combines the DREAM challenge-winning somatic genotyping engine of the original MuTect with the assembly-based machinery of HaplotypeCaller.

The original MuTect (Cibulskis et al., 2013) was built on top of the GATK engine by the Cancer Genome Analysis group at the Broad Institute, and was distributed as a separate package. By all accounts it did a great job calling somatic SNPs, and was part of the winning entries for multiple DREAM challenges (including some submitted by groups outside the Broad). However it was not able to call indels; and the less said about the indel caller that accompanied it (first named SomaticIndelDetector then Indelocator) the better.

This new incarnation of MuTect leverages much of the HaplotypeCaller’s internal machinery (including the all-important graph assembly bit) to call both SNPs and indels together. Yet it retains key parts of the original MuTect’s internal genotyping engine that allow it to model somatic variation appropriately. This is a major differentiation point compared to HaplotypeCaller, which has expectations about ploidy and allele frequencies that make it unsuitable for calling somatic variants.

As a convenience add-on to MuTect2, we also integrated the cross-sample contamination estimation tool ContEst into GATK 3.5. Note that while the previous public version of this tool relied on genotyping chip data for its operation, this version of the tool has been upgraded to enable on-the-fly genotyping for the case where genotyping data is not available. Documentation of this feature will be provided in the near future. Both MuTect2 and ContEst are now featured in the Tool Documentation section of the Guide. Stay tuned for pipeline-level documentation on performing somatic variant discovery, to be added to the Best Practices docs in the near future.

Please note that this release of MuTect2 is a beta version intended for research purposes only and should not be applied in production/clinical work. MuTect2 has not yet undergone the same degree of scrutiny and validation as the original MuTect since it is so new. Early validation results suggest that MuTect2 has a tendency to generate more false positives as compared to the original MuTect; for example, it seems to overcall somatic mutations at low allele frequencies, so for now we recommend applying post-processing filters, e.g. by hard-filtering calls with low minor allele frequencies. Rest assured that data is being generated and the tools are being improved as we speak. We’re also looking forward to feedback from you, the user community, to help us make it better faster.

Finally, note also that MuTect2 is distributed under the same restricted license as the original MuTect; for-profit users are required to seek a license to use it (please email softwarelicensing@broadinstitute.org). To be clear, while MuTect2 is released as part of GATK, the commercial licensing has not been consolidated under a single license. Therefore, current holders of a GATK license will still need to contact our licensing office if they wish to use MuTect2.

Annotate this: new and improved variant context annotations

Whew that was a long wall of text on MuTect2, wasn’t it. Let’s talk about something else now. Annotations! Not functional annotations, mind you -- we’re not talking about e.g. predicting synonymous vs. non-synonymous mutations here. I mean variant context annotations, i.e. all those statistics calculated during the variant calling process which we mostly use to estimate how confident we are that the variants are real vs. artifacts (for filtering and related purposes).

So we have two new annotations, BaseCountsBySample (what it says on the can) and ExcessHet (for excess heterozygosity, i.e. the number of heterozygote calls made in excess of the Hardy-Weinberg expectations), as well as a set of new annotations that are allele-specific versions of existing annotations (with AS_ prefix standing for Allele-Specific) which you can browse here. Right now we’re simply experimenting with these allele-specific annotations to determine what would be the best way to make use of them to improve variant filtering. In the meantime, feel free to play around with them (via e.g. VariantsToTable) and let us know if you come up with any interesting observations. Crowdsourcing is all the rage, let’s see if it gets us anywhere on this one!

We also made some improvements to the StrandAlleleCountsBySample annotation, to how VQSR handles MQ, and to how VariantAnnotator makes use of external resources -- and we fixed that annoying bug where default annotations were getting dropped. All of which you can read about in the detailed release notes.

These Three Awesome File Hacks Will Restore Your Faith In Humanity’s Ability To Free Up Some Disk Space

CRAM support! Long-awaited by many, lovingly implemented by Vadim Zalunin at EBI and colleagues at the Sanger Institute. We haven’t done extensive testing, and there are a few tickets for improvements that are planned at the htsjdk level -- but it works well enough that we’re comfortable releasing it under a beta designation. Meaning have fun with it, but do your own thorough testing before putting it into production or throwing out your old BAMs!

Static binning of base quality scores. In a nutshell, binning (or quantizing) the base qualities in a BAM file means that instead of recording all possible quality values separately, we group them into bins represented by a single value (by default, 10, 20, 30 or 40). By doing this we end up having to record fewer separate numbers, which through the magic of BAM compression yields substantially smaller files. The idea is that we don’t actually need to be able to differentiate between quality scores at a very high resolution -- if the binning scheme is set up appropriately, it doesn’t make any difference to the variant discovery process downstream. This is not a new concept, but now the GATK engine has an argument to enable binning quality scores during the base recalibration (BQSR) process using a static binning scheme that we have determined produces optimal results in our hands. The level of compression is of course adjustable if you’d like to set your own tradeoff between compression and base quality resolution. We have validated that this type of binning (with our chosen default parameters) does not have any noticeable adverse effect on germline variant discovery. However we are still looking into some possible effects on somatic variant discovery, so we can’t yet recommend binning for that application.

Disable indel quality scores. The Base Recalibration process produces indel quality scores in addition to the regular base qualities. They are stored in the BI and BD tags of the read records, taking up a substantial amount of space in the resulting BAM files. There has been a lot of discussion about whether these indel quals are worth the file size inflation. Well, we’ve done a lot of testing and we’ve now decided that no, for most use cases the indel quals don’t make enough of a difference to justify the extra file size. The one exception to this is when processing PacBio data, it seems that indel quals may help model the indel-related errors of that technology. But for the rest, we’re now comfortable recommending the use of the --disable_indel_quals argument when writing out the recalibrated BAM file with PrintReads.

Updated on 2015-11-25

From TechnicalVault on 2015-11-25

So how does “Static binning of base quality scores” interact with Illumina’s binning and CRAM’s binning?

From ebanks on 2015-11-25

Good question. The answer is that this is along the same lines of those binning strategies, but allows the user to have a lot more control over the parameters. Here’s how we’re (roughly) doing things in the Broad’s production pipeline going forward:

1. The original quality scores obviously use Illumina’s binning since it comes like that from the machines.

2. We then use our new static binning code to create statically binned recalibrated quality scores. We’ve ensured through a lot of internal analysis that downstream calling in the GATK isn’t affected by the binning; note that we aren’t saying that any binning is necessarily okay, but rather that the way we bin is okay in our hands.

3. While we will produce CRAM files at the end of the pipeline, we don’t use CRAM’s binning. Not all of our downstream users/tools can handle CRAMs (yet!) so it was important that binning be independent from CRAM.

Hope this helps.

From fleharty on 2015-11-25

@ebanks hit on the major points here. I would like to add that when we do bin the bam, the choice of bins can impact variant calls in a negative way. We have found that bins 10, 20, 30, 40 are reasonable choices for making variant calls using HaplotypeCaller. After binning that bam you can create a cram file that will be even smaller than what you would have achieved by simply cramming the file without binning (in lossless mode).

From inter on 2015-12-17

Dear professor, I wanna know how to call somatic snps only. I can not find any information about that. how to set ???Thank you !

From Sheila on 2015-12-17

@inter

Hi,

There is a tool called [MuTect](https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_cancer_m2_MuTect2.php) that calls somatic variants. If you have the latest version of GATK (3.5), you can access the latest version (MuTect2) which calls both SNPs and indels. https://www.broadinstitute.org/gatk/blog?id=6495

-Sheila

From everestial007 on 2015-12-19

For the GATK enthusiasts its good to know that a new version of GATK is available. I had been doing all my analyses until now using GATK 3.4-46 version. I am now at the middle-end stages of GATK practices. I will now use HC, GenotypeGVCFs (again) and additionally FastaAlternateReferenceMaker. I don’t think taking the data analyzed by the previous version (mainly BQSR, HC, CombineVariants using GATK 3.4-46) will make so much difference if I switch to the new version for the latter part of the analyses. I have looked on the updates and I see that most of the improvements are on coding efficiency including addition of Mutec (for analyzing mutation on somatic genome, which is not any part of my analyses). So, I am guessing I am on a right track.

Thanks,

From rshahi on 2016-01-04

Hi Geraldine and Sheila,

Can you help me with normal-only calling for panel of normals with MuTect2.

The code below does not spit any error as well as variants in vcf, except vcf header.

Here is the code:

java -Xmx45g -XX:+AggressiveOpts -jar /nexus/GenomeAnalysisTK-3.5/GenomeAnalysisTK.jar \

-T MuTect2 \

-R /home/ngs/data/ucsc.hg19.fasta \

-I:tumor /nexus/normal.bam \

—dbsnp /home/ngs/data/dbsnp_137.hg19.vcf \

—cosmic /nexus/rshahi/data/CosmicCodingMuts.vcf5 \

—artifact_detection_mode \

—intervals /nexus/NimbV3_target.gatk.interval_list \

-o $HOME/jobout/normal_only.vcf

Thanx,

Rajendra

Brussels

From Geraldine_VdAuwera on 2016-01-06

Hi @rshahi,

We recently had a similar report from someone else, which makes me wonder if this might be a bug. Would you be able to share some data that reproduces the problem? We would need a snippet of data from a region where you see variants called with e.g. HaplotypeCaller. Instructions for submitting a bug report are here: https://www.broadinstitute.org/gatk/guide/article?id=1894

From mglclinical on 2016-01-07

We are using GATKv3.4 , mainly for realignment, baserecalibration, unifiedgenotyper and haplotypecaller.

I am intending to upgrade to GATK.3.5 . After reading posts from @rshahi , I am thinking if I should I wait to make sure that everything about GATK3.5 is fine, before I upgrade ?

From Geraldine_VdAuwera on 2016-01-07

mglclinical The issue that rshahi and others have raised is strictly limited to MuTect2, which is currently in beta status and expected to have a few kinks that need sorting out. The core Best Practices tools which you are using are fully mature; HaplotypeCaller in particular has been improved in 3.5 to run faster due to some efficiency tuning. So I would encourage you to switch to 3.5 — but running a few tests first is always a good idea of course.

From mglclinical on 2016-01-07

@Geraldine_VdAuwera Great, Thanks for clarifying . I will test GATK v3.5

From rshahi on 2016-01-07

Geraldine_VdAuwera please find the snippet archived as rshahi_mutect2_error.tar.bz2(ftp://gsapubftpftp.broadinstitute.org/rshahi_mutect2_error.tar.bz2)

Note:I ran MuTect2(mt) as well as HaplotypeCaller(hc) for the same sample.No problem with hc, where as mt has problem as mentioned earlier in the thread. All the files required are included.

Regards,

From Geraldine_VdAuwera on 2016-01-12

Hi rshahi, cooperjam,

Thanks for the test data. We think this is a bug that was introduced just before the release and has since been fixed in development, so the latest nightly build (available on the Downloads page) should work fine. We still need to verify that ourselves and it’s in our test queue, but feel free to test it out yourself if you don’t want to wait.

From Sheila on 2016-01-19

@rshahi

Hi Rajendra,

I just submitted a bug report. Please find updates on the fix [here](http://gatkforums.broadinstitute.org/gatk/discussion/comment/26638).

Thanks,

Sheila

From YingLiu on 2016-03-09

Hi ,Professor, I wanna use GATK3.5 to do some clinical work ,but I find some information that GATK team do not recommend , So when GATK 4 will published ? And I have other choice to do if I wanna use GATK ?

Thank you very much!

From YingLiu on 2016-03-09

@Geraldine_VdAuwera

From Geraldine_VdAuwera on 2016-03-09

@YingLiu What kind of work do you want to do? Most of the current GATK tools are ok for clinical work. The only tool that is considered Beta (still in development) is MuTect2. If that’s what you mean (for caling somatic variants) you can use the original MuTect which is available from the download page — but it only calls SNPs, not indels.

From mglclinical on 2016-05-17

I believe —disable_indel_quals is the optional parameter in -T PrintReads tool, and not the actual -T BaseRecalibrator tool ?

Thanks,

mglclinical

From Geraldine_VdAuwera on 2016-05-18

Yes, that’s correct, as in:

> recommending the use of the —disable_indel_quals argument when writing out the recalibrated BAM file with PrintReads.

From mglclinical on 2016-05-18

thank you @Geraldine_VdAuwera

From daih on 2017-02-16

Hi @Geraldine_VdAuwera

I’m confused about “we recommend applying post-processing filters, e.g. by hard-filtering calls with low minor allele frequencies”.

Do you mean filter by “AF” or minor allele frequency in the followed annotation file?

Could you give an example command line related to this filtering step?

Thanks

From shlee on 2017-02-16

Hi @daih,

Here’s an example result from MuTect2 calling.

Each of the tumor and normal samples have genotype calls in the format:

> GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1

Here, AF refers to the allele fraction for the non-ref alt allele.

To filter on AF, I believe you can use [JEXL](http://gatkforums.broadinstitute.org/gatk/discussion/1255/using-jexl-to-apply-hard-filters-or-select-variants-based-on-annotation-values), although I have not tried this. Let us know how it goes.

Report abuse