created by Geraldine_VdAuwera
on 2019-09-30
It’s a beautiful early autumn day in New England, with small patches of vibrant reds and yellows in the foliage just hinting at the fiery displays to come. Perfect weather for me to de-lurk and bring you some news! (I promise it’s not GATK5)
The long and short of it (but mostly the short) is that we’ve started collaborating with the DRAGEN team at Illumina, led by Rami Mehio, to improve GATK tools and pipelines. There’s a [press release](https://www.broadinstitute.org/node/618156) if you want the official announcement, or you can read on to get the long version from the GATK team’s perspective.
——
If you’re not familiar with DRAGEN, the name stands for Dynamic Read Analysis for GENomics and refers to a secondary analysis platform originally created by a company called Edico Genome, which was acquired by Illumina last year. The DRAGEN team became widely known for making genomic data processing insanely fast on special hardware, but they’re not just a speed shop. They have top-notch computational biology expertise: when they reimplemented GATK tools like HaplotypeCaller in DRAGEN, they made some clever tweaks that improved the scientific accuracy of the results. They’ve done this for other tools as well, and they’ve also developed their own novel algorithms for other use cases.
That alone is already a big motivation for us to team up with them: they have great ideas for improving our tools and pipelines, and they’re willing to share them. Works for us! Then there’s the bigger picture of what this means for the kind of research we are working to enable. Both of our teams feel pretty strongly that as the amount of genomic data generation snowballs, particularly in the biomedical field, it’s really important to ensure that the results of different studies can be cross-analyzed. For that to be possible, we need to standardize secondary analysis as much as possible to minimize batch effects. We believe that by working together to consolidate our methods and pipeline development efforts, we can remove a major source of heterogeneity in the ecosystem.
So what does that mean in practice?
Rest assured GATK itself is still going to be GATK, developed by our team at the Broad and released under the same BSD-3 open-source license you know and love. Any improvements that the DRAGEN team contributes to GATK tools will be integrated into the GATK codebase under the same BSD-3 license.
Beyond code improvements to GATK itself, there will also be some changes to the composition of the Best Practices pipelines. For example, we’re going to replace BWA with the DRAGEN aligner, which is quite a bit faster, in our DNA pre-processing pipelines (full details and benchmarking results to follow). To reflect the collaborative nature of the work, any pipelines we co-develop with the DRAGEN team will be named DRAGEN-GATK Best Practices.
All the software involved in the DRAGEN-GATK pipelines will be fully open source and available in Github, including a new open source version of the DRAGEN aligner, and we’ll continue to publish WDL workflows for every pipeline in Github and in Terra workspaces. Importantly, it will all still be runnable on normal hardware, whether you’re doing your work on a local server, on-premises HPC or in the cloud. We’ll also continue to provide free support for all GATK tools and pipelines, and as part of that we’re going to work with the DRAGEN team to make sure we can provide the same level of high quality support for the tools that they provide.

The DRAGEN team also plans to produce a hardware-accelerated version of any DRAGEN-GATK Best Practices pipeline that we co-develop, which Illumina will offer on the commercial DRAGEN system. We won’t touch that work at all (it’s not our jam), but we will run comparative evaluations to validate that the hardware-accelerated version of any given pipeline produces results that are functionally equivalent to the “universal” open source software version. To be clear, it won’t be just a rubber-stamp approval; we’re highly motivated to make sure that the pipeline implementations are functionally equivalent because our colleagues in the Broad’s Genomics Platform are planning to switch some of the Broad’s production pipelines to the DRAGEN hardware version for projects where speed is a critical factor.
On that note, what I personally find the most exciting about this partnership is that going forward, everyone in the research community will be able to take advantage of the best ideas from both our teams regardless of whether they want the “regular” software or a hardware-accelerated version. You could even switch between the two within the course of a project and still be able to cross-analyze the outputs. Over the years, I’ve had to tell a lot of people “sorry, you’re going to have to reprocess everything with the same pipeline” so this feels like a huge step in the right direction.
Okay, this sounds great — so when will the improved tools and pipelines be available?
We’re already actively working on porting over improvements from the DRAGEN team, so if you follow the GATK repository on Github you should start seeing relevant commits and pull requests any day now. Barring any unforeseen complications, the tool improvements should roll out into regular GATK releases over the next couple of months, and we expect to release the first full DRAGEN-GATK pipeline (for germline short variants) in the first quarter of 2020. We’ll post updates here on the blog about how it’s going and what you can expect to see as the code rolls in and the release calendar firms up.
In the meantime, don’t hesitate to reach out to us if you have any questions that aren’t addressed here or in the press release. Note that if you’re going to be at the ASHG meeting in Houston later this month, Angel Pizarro and I will be talking about this collaboration at the [Illumina Informatics Summit](http://eventregistration.illumina.com/ASHGInformatics2019) that precedes the conference on Tuesday Oct 15, and I will be available at the Broad Genomics booth in the exhibit hall at ASHG itself on Wednesday Oct 16 if you’d like to discuss this in person. I hope to see a lot of you there!
Updated on 2019-10-12
From SkyWarrior on 2019-09-30
I cannot wait to test the DRAGEN-aligner!
From matdmset on 2019-10-01
Can we get notified when either part is released? I’m very curious about the performance benchmarks between DRAGEN and BWA.
From Geraldine_VdAuwera on 2019-10-01
Hi @matdmset, yes we’ll put out some updates on the blog (and Twitter) as new information and data rolls out, including some benchmarking results. We’re working on a plan to do that with the DRAGEN team.
From raonyguimaraes on 2019-10-04
Hi Geraldine, great news!
From nans on 2019-10-10
Exciting times!! @Geraldine_VdAuwera Will the variant caller also change to Dragen’s caller or HC still holds good ?
From Geraldine_VdAuwera on 2019-10-10
Hi @nans, for germline short variants it’ll still be HaplotypeCaller, with a few tweaks contributed by the DRAGEN team that improve accuracy.
From Geraldine_VdAuwera on 2019-10-16
Hi all, we’ve put together a short five-question survey to assess how you would prefer to receive updates about DRAGEN-GATK, which includes an option to sign up for a mailing list and/or newsletter. Please fill it out and pass it along to any colleagues who might be interested so that we can tailor our communications plan accordingly. Thanks!
https://www.surveymonkey.com/r/GK9YZ2B
From nans_bn on 2019-11-22
Hi @Geraldine_VdAuwera, just adding to my comment above… would like to see the impact on how MNVs are called when using Dragen. Currently we use GATKv4 for clinical diagnostics and MNVs are called as separate variants (as most tools)… it would be interesting to see if this improves with Dragen
From Geraldine_VdAuwera on 2019-12-08
Hi @nans_bn, sorry for the late response. Have you tried the (relatively new) [MNV merging option (—max-mnp-distance)](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_walkers_haplotypecaller_HaplotypeCaller.php#—max-mnp-distance) in HaplotypeCaller? That was added to address the need you describe. I’m not sure what is the Dragen range of available options on that front but we can find out if needed.