created by Geraldine_VdAuwera
on 2017-07-29
One more 3.x version, for the road! That's right, even as we're ramping up our efforts on GATK4 (we're three beta releases in at this point, and getting down to brass tacks writing the migration guide ahead of the 4.0 general release) we still found it worthwhile to cut one last release of GATK3.
Our main motivation here is to introduce the Intel Genomics Kernel Library, which comes bearing the gift of speed improvements for those of you who won't be able to migrate to GATK4 right away.
As a secondary benefit, this version includes a handful of bug fixes, some usability improvements including better error messages, documentation fixes and logging tweaks, and a few improvements to annotation calculations (especially in allele-specific mode), which you'll find described briefly in the release notes. No big changes though, except perhaps the new default behavior of VariantsToTable with regard to missing annotation values, discussed below. Finally, we've committed a copy of all the peripheral documentation (= the docs that live in the forum and complement the tool documentation) to the now-old GATK codebase.
And thus, the last-ever GATK3 version emerges covered in carbonite.
The Genomics Kernel Library or GKL is an open-source library developed by our collaborators at Intel that provides accelerated versions of algorithms, i.e. "kernels", used in genomics tools. These kernels are optimized to run on Intel Architecture under 64-bit Linux and Mac OSX. They're plugged into the GATK in such a way that they will be automatically used if your computing hardware supports them, but if it doesn't they will remain inactive and the "default" generic Java versions will be used instead.
At the moment there are three main kernels included:
-jdk_deflater
and -jdk_inflater
flags.-pairHMM LOGLESS_CACHING
, for example if you need completely deterministic behavior across different machine types (at the expense, of course, of speed).VariantsToTable is a tool we're quite fond of because it allows us to extract just the information we want from VCFs when we want to probe a callset interactively, typically for filtering purposes. Previously we had to tell it explicitly not to freak out if it came across any sites or genotypes where an annotation we requested was missing; but realistically, there are always some sites for which we can't calculate some annotations (like ranksum annotations at sites where we don't have any heterozygous samples), so that was annoying. Now we've flipped the behavior so that by default the tool keeps going and just outputs "NA" anywhere it encounters such sites or genotypes, unless you specify that it should freak out by using the --errorIfMissingData
flag.
In preparation for the general release of GATK4 (in the form of a 4.0 version), we made a copy of all the peripheral (forum-based) documentation in its current state and archived it in the codebase itself here. This is intended to be a permanent archive for documentation that we are phasing out in favor of GATK4-focused documentation.
Our ultimate goal is to provide some degree of continuity and support for users who cannot migrate to GATK4 right away and must continue to use older versions, without leaving too much clutter around that might confuse everyone else.
In the immediate future we will delete three sets of documents from the forum (and therefore from the website):
Within the other documentation sections, articles may get updated in place or moved to the Archive for future removal. Versioned tool documentation going back to 3.5-0 will remain available on the website for the foreseeable future. For older versions, the documentation can be built from source. Finally, the Best Practices section of the website will be updated to reflect the new world order once GATK 4.0 is released and becomes the officially supported version of GATK. Going forward we'll have versioned Best Practices accompanied by a publicly available WDL script for each major use case. We'll post more details of what this will look like in the coming weeks.
Updated on 2017-07-29
From EADG on 2017-08-10
Hi,
is there are list which intel-cpu will support GKL ?
I allways need a reason for my boss to buy new hardware ;)
Greetings EADG
From Geraldine_VdAuwera on 2017-08-10
Hah, no kidding :D
I’m not aware of any such list but if you’re interested I can put you in touch with the folks at Intel who can best tell you that.
From SkyWarrior on 2017-08-10
Any AVX compatible Intel CPU (Sandybridge Sandybridge EP Core i3 i5 i7 /Xeon E3 E5… and above) should do a decent acceleration I think. I have seen a nice boost after 3.8 even when I don’t use multithreading in most of my workflow (I don’t use multithreading other than BWA and BQSR because I need the bamout in HC and I noticed that (with my humble testing of course YMMV) concurrent sample workflows are faster than multithreading a single sample with all you have. [4 WES samples are completed with annotation and all QC extras in 5 hours on average])
From EADG on 2017-08-11
Hi SkyWarrior,
Geraldine_VdAuwera
4 WES samples in 5 hours that sound fast, can you give me a ruff description of the system which you are using (cpu/mem) ?
@Geraldine_VdAuwera
That would be nice, even it would be interesting which intel cpu supporting FPGA right now.
Another question is if Mutect2 also profit from the faster PairHMM calculation.
Greetings EADG
From SkyWarrior on 2017-08-11
Hi @EADG
I am using Skylake-X i9 7900× 128GB ram. My genome and reference vcf files are on M.2 NVMe SSD and my scratch disk is a 8TB 256mb cache 7200RPM spinner. Ubuntu 17.04 and all the regular stuff is loaded.
I am running maximum of 4 threads per workflow and I run 4 workflows concurrently. This setup finishes 50-60X WES samples 4 samples per 5 hours and 4 samples per 6-7 hours for 100X WES samples. I can shorten this duration about an hour and half but that time is usually used to collect data per sample for more advanced analysis stuff like CNV etc…
From Geraldine_VdAuwera on 2017-08-11
Yes MuTect2 does benefit from the acceleration of PairHMM.
From Faten on 2017-10-13
Hi @Geraldine_VdAuwera,
I run HaplotyCaller in this GATK version 3.8, may i know if there is no longer standemitconf anymore?
I try running using command : java -jar GenomeAnalysisTK-3.8-0/GenomeAnalysisTK.jar -T HaplotypeCaller -R hg38.fa -I sortedRGdedupmark2.bam -standemitconf 10 -standcall_conf 30 -o variants.vcf
Error showing: ##### ERROR MESSAGE: Invalid command line: The parameter standardminconfidencethresholdfor_emitting is deprecated. This argument is no longer used in GATK versions 3.7 and newer. Please see the online documentation for the latest usage recommendations.
Thank you.
From Sheila on 2017-10-17
@Faten
Hi,
The stand_emit_conf is indeed no longer an option in 3.8. You can only use stand_call_conf. Have a look at [this post](https://gatkforums.broadinstitute.org/gatk/discussion/8692/version-highlights-for-gatk-version-3-7) for more information.
-Sheila
From MattB on 2017-11-06
Hi Geraldine_VdAuwera and
Sheila, re the doc_archive in GitHub do you think you could archive the tool documentation? E.g. pages like [this](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_variantrecalibration_VariantRecalibrator.php “this”). Just thinking that perhaps some of those arguments and defaults will be changing with the move over to 4.x and it would be good to have the old 3.x ones archived in their final state as of 3.8 along with the other docs.
From MattB on 2017-11-06
Ah I’ve just seen the dropdown [here](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/ “here”) which I’d yet to see because I always google the name of the tool, it might be useful to push that orange dropdown to the actual tool documentation pages, so people are aware the documentation is versioned when directly landing on pages.
From shlee on 2017-11-07
Hi @MattB,
Thanks for voicing you’ll need access to old 3.x documentation. Rest assured, we also think it is prudent to keep this documentation around and so we are planning to differentiate 3.x and 4.x documentation via different subforums, much like we do for WDL and FireCloud.
We will also keep the orange dropdown to track changes to minor versions for provenance.
From Geraldine_VdAuwera on 2017-11-08
To be clear, the command-line tool docs will continue to be presented as they are today, though we can certainly improve the visibility of the versioning information.
Regarding the “peripheral” documentation, to elaborate a bit on @shlee’s comment, we aim to provide clear distinction, during a forthcoming transition period, between documents that we update for use with GATK 4 (and/or remain equally applicable across versions) vs. documents that only apply to GATK 3 and older versions, which will eventually be archived and deprecated. Some details remain to be determined, but our goal here is to minimize confusion and friction, as much as humanly possible.
From datakid on 2018-10-17
What is the expected EoL for 3.8?
From Geraldine_VdAuwera on 2018-10-19
There will be no further 3.8-x releases (no more code changes, bug fixes etc) and we aim to discontinue support for any new work by Dec 31 2019 — so starting Jan 1 2020 we expect all new work to be done with a 4.x version. However we’ll still answer questions about results that were previously obtained with a 3.x version (we’re not monsters).
From rosesophos on 2018-11-23
thnak you
From rosesophos on 2018-11-23
nice very article