013. Hacking a community resource prototype for synthetic data

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by Geraldine_VdAuwera

on 2019-05-03

Part 3 of a series on the theme of synthetic data | Start with part 1: Fake your data -- for Science! and part 2: An exercise in reproducibility (and frustration)

Ever since our original foray into synthetic data creation, I've been looking for an opportunity to follow up on that project. It is absolutely obvious to me that there is a huge unmet need for researcher-friendly synthetic sequence data resources, i.e. generic synthetic datasets you can use off the shelf, plus user-friendly tooling to generate customized datasets on demand. I'm also fairly confident that it wouldn't actually be very hard, technically speaking, to start addressing that need. The catch though is that this sort of resource generation work is not part of my remit, beyond what is immediately useful for education, frontline support and outreach purposes. Despite a surprisingly common misconception, I don't run the GATK development team! (I just blog about it a lot)

So when the irrepressible Ben Busby reached out to ask if we wanted to participate in the FAIR Data Hackathon at BioIT World, I recruited a few colleagues and signed us up to do a hackathon project called Bringing the Power of Synthetic Data Generation to the Masses. The overall mission: turn the synthetic data tooling we had developed for the ASHG workshop into a proper community resource.

If you're not familiar with FAIR, it's a set of principles and protocols for making research resources more Findable, Accessible, Interoperable and Reusable. Look it up, it's important.

Goals & methods

At this point, experienced hackathoners are laughing their heads off. The truth is, a two-day hackathon doesn't usually afford you enough time to produce a Real, Functional, Substantial piece of work. Even if you're super prepared (which we were, thanks to the efforts of key team members -- shoutout to my colleagues Adelaide Rhodes, Allie Hajian and Anton Kovalsky), the goal is usually to build a prototype as a proof of concept, not a finished product. So it's with that outlook that we defined four buckets of work for the project: 1) scoping out the community's needs, 2) adding functionality to our existing tooling, 3) optimizing the implementation for cost and runtime efficiency, and 4) developing quality control approaches. You can read more about how we defined and split up the work in the project's README on Github .

For the computational parts of the work, which involved both batch workflows (aka pipelines) and Jupyter notebooks, we used Terra, the Broad's cloud-based analysis platform, with a supporting grant from Google EDU (in the form of Google Cloud credits) to cover compute and storage costs for the project (which are billed directly by Google). You can check out the public workspace we put together for the project here. It contains the cohort of 100 synthetic exomes that we had previously created, the workflows used to generate them as well as a workflow and prototype notebook for collecting and analyzing sequence quality control metrics. All the code is also in the Github repository but the nice thing about the Terra workspace is that you can see how the code gets applied to the data, and you have the option to clone it and run/modify as much of it as you like. Terra itself is completely free and open to all and every new account comes with $300 in Google credits, so you can try it out and really kick the tires of the project.

Results

So how did it go, you ask? We ended up with a team of 12 hackathoners from various backgrounds including publishing, data science and software engineering, and that was really a great mix given our objectives. We had plenty of work for both coding and non-coding types! Since we had a robust outline of what we wanted to do, we were able to get started fairly quickly; yet our plans were flexible enough to incorporate ideas and suggestions from the non-Broadies who joined our team with their own perspectives. That really enriched the experience and made the end results better.

Speaking of which, our team ultimately made progress on all four of the fronts that we had planned to tackle, as you can see in the summary report presentation. Given the level of interest that has bubbled up around this project, we're planning to write it up in more detail in a white paper in the immediate future.

Impressions & next steps

I was really pleased by the overall positive response to the synthetic data generation approach we presented. It's something we gloss over in the workshops where we use the dataset we originally created, so I had some trepidation about how it would be received by an audience that is perhaps predisposed to examine this sort of thing more thoroughly. There are definitely some big outstanding questions on where we go from here in terms of generating larger cohorts, which will require generating "fake people" (as opposed to the "real people, fake data" approach we used as a convenient hack), and how far can we push the realism of the synthetic data (e.g. can we model quality fluctuations in the low-confidence intervals from Genome in a Bottle). At the risk of sounding like a bandwagon-jumper, I suspect the answers to both questions lie in machine learning approaches. I would love to see if we can get to the point where we can feed a database of human variation like gnomAD to an ML algorithm that spits out novel realistic synthetic VCFs on demand, with population-appropriate profiles. Probably more a question of when than if, in fact. Similarly, I would be surprised (mildly shocked, even?) if there was not already work being done to use ML techniques to improve the realism of read data simulation software, particularly with regard to different sequencing technologies but also in relation to regions of the genome that can be more or less problematic.

Going forward, my hope is that we can nucleate a community-driven effort to pursue these efforts at a larger scale, i.e. move beyond the prototype. I'm confident that together we can build valuable resources that enable developers, researchers and educators to leverage synthetic sequence data for testing, collaboration and teaching. If you're interested in contributing to this effort, please leave a comment on this post or email me at geraldine@broadinstitute.org.

My heartfelt thanks to all the Broadies who contributed to this project as well as our hackathon friends Ernesto Andrianantoandro, Dan Rozelle, Jay Moore, Rory Davidson, Roma Kurilov and Vrinda Pareek!

Updated on 2019-05-07