082. Were officially BFFs with Intel now

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by Geraldine_VdAuwera

on 2016-11-17

Here's the scoop. We've been working with Intel engineers for some time now, and we've all been enjoying it so much, we decided to commit to the relationship big time.

As announced in this Broad press release, we are taking our collaboration with Intel to the next level. Specifically, we have joined forces to create the "Intel-Broad Center for Genomic Data Engineering", with an initial five-year mission to build out life sciences tools and infrastructure, and boldly grow the genomics community's ability to collaborate across diverse datasets and analysis platforms in ways that no one has done before.

Ahem. In practice this is going to enable us to bring you some key improvements on three fronts: hardware recommendations, genomics software tools, and cross-infrastructure collaboration.

Let's start with genomics software tools.

You may have heard of this little toolkit we develop called GATK. We like to think it's really quite good at solving genome analysis problems, in the sense that it gives us best in class scientific results. But sometimes it's a tad cough significantly cough slower than it could be, because -- surprise! -- most of us aren't actually software engineers, and our primary goal is to answer scientific questions correctly, not optimize for speed. So although we do try to make the code as efficient as possible (if only because an elegant algorithm is like a ray of sunshine on a dreary post-election November morning) we're not in the business of developing exotic hardware-specific optimizations, FPGAs, GPUs, SSDs and all that alphabet soup. We like to do the smart thing, and here the smart thing is to partner up with seasoned professionals who can do this stuff in their sleep. Enter the Intel team and their Genomics Kernel Library (GKL), an open-source collection of optimized components that act as drop-in replacements for key components of the GATK tools, like PairHMM and IntelDeflater, producing significant speedups given compatible hardware.

But it's not just GATK. The Intel team has been actively contributing to the development of Cromwell, the open-source execution engine that our fellow nerds in Data Science and Data Engineering (DSDE) are building to run analysis pipelines written in WDL. The Cromwell/WDL pairing is a one-two punch combination that makes it easy to write sophisticated genomic analysis pipelines and execute them across different platforms, on-premises or on cloud.

Our Intel collaborators have also developed a new kind of database called GenomicsDB that provides a number of benefits, including enabling large-scale production use cases (think ExAC and GnomAD). Right now it's not something that's really trivial to deploy, but we're working on making it easier for anyone to take advantage of the GenomicsDB capabilities within GATK pipelines.

Okay, now let's talk hardware.

But only in the sense of not talking about hardware, because that's something I know very little about -- and I'm perfectly happy to keep it that way. For the most part, GATK development exists in a blissful bubble of not having to worry about hardware all that much. We have coworkers and collaborators who take care of all that for us, it's awesome. (Yes, we're a bit spoiled. No, we're not sorry.)

Hey, the good news is that again, the Intel team is here to help. They've been systematically evaluating our Best Practices pipelines from the point of view of resources and requirements, determining what are the performance bottlenecks, what steps can be parallelized most efficiently and how, and generally figuring out what are the most optimal hardware configurations for given use cases. We worked with them to vet the pipeline implementations and help them understand how their observations matched what the tools do under the hood. This led to a first set of recommendations published as a white paper a few months ago, with a follow-up paper slated for imminent release -- which I've been authorized to make available as a sneak preview here. Going forward, the scope of this "reference architecture" effort is going to expand to include more use cases and more types of platforms. We're hoping this will help take the guesswork out of setting up new infrastructure and allow the community to focus on what's important, i.e. the science.

And to conclude, the more complicated part: collaboration across systems and organizations.

Fun fact: the regulations on human data privacy are sufficiently different between the USA and the EU, to mention only those two (yes, we're aware the rest of the world exists) that right now there are valuable scientific studies that cannot happen because the relevant data cannot cross the Pond -- in either direction. Similarly frustrating problems arise across organizational and national borders in a variety of other settings.

One of the most ambitious aspects of Broad's collaboration with Intel, which goes far beyond our little GATK team and our immediate coworkers in DSDE, is the development of “secure multiparty computation” to enable data sharing across different infrastructures, i.e. systems that will enable secure and seamless analysis across datasets that reside in otherwise unconnectable silos. It's a super exciting concept and I look forward to telling you more about it when I get around to understanding more than 10% of how it's supposed to work.

Let's face it, I can't do justice to the vision that underlies this collaboration in a mere two pages -- but I look forward to delivering news of more cool things that will make your work and life better as they become available!

Updated on 2017-01-05

From Geraldine_VdAuwera on 2016-11-18

Oh, and there’s a [YouTube video](https://www.youtube.com/watch?v=RBlRW3xVlO8&sns=tw) of our group leader Anthony Phillippakis announcing the Intel-Broad Center.