015. Fake your data for Science

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by Geraldine_VdAuwera

on 2019-04-27

Part 1 of a series on the theme of synthetic data

Don't get me wrong, I'm not suddenly advocating for fraudulent research. What I'm talking about is creating synthetic sequence data for testing pipelines, sharing tools and generally increasing the computational reproducibility of published studies, so that we can all more easily build on each other's work.

The majority of the effort around computational reproducibility has so far focused on better ways to share and run code, as far as I can tell. With great results -- it's been transformative to see the community adopt tooling like version control, containers and Jupyter notebooks. Yet you can give me all the containers and notebooks in the world; if I don't have appropriate data to run that code on, none of it helps me.

Most of the genomic data that gets generated for human biomedical research is subject to very strict access restrictions. These protections exist for good reason, but on the downside, they make it much harder to train researchers in key methodologies until after they have been granted access to specific datasets — if they can get access at all. There are certainly open datasets like 1000 Genomes and ENCODE that can be used beyond their original research purposes for some types of training and testing. However they don't cover the full range of what is needed in the field in terms of technical characteristics (eg exome vs WGS, amount of coverage, number of samples for scale testing etc); not by a long way.

That's where fake data comes in -- we can create synthetic datasets to use as proxies for the real data. This is not a new idea of course; people have been using synthetic data for some time, as in the ICGC-TCGA DREAM Mutation challenges, and there is already a rather impressive range of command-line software packages available for generating synthetic genomic data. It's even possible to introduce (or "spike in") variants into sequencing data, real or fake, on demand. So that's all pretty cool. But in practice these packages tend to be mostly used by savvy tool developers for small-scale testing and benchmarking purposes, and rarely (if ever? send me links!) by biomedical researchers for providing reproducible research supplements.

And frankly, it's no surprise. It's actually kinda hard.

Next up: An exercise in reproducibility (and frustration)

Updated on 2019-04-29

From SkyWarrior on 2019-04-27

I feel your thoughts exactly in my research currently. Generating your own simulated data is good. Publishing that is also good. But when you try a similar simulation but cannot reproduce the results of a previous publication this is brain damage (almost permenant)! I generated a small R script to generate fake VCFs with designated runs of homozygosity for my research and I can share that if needed by others.

From Geraldine_VdAuwera on 2019-04-29

Hah yes indeed — see part 2 for the project that gave me brain damage :dizzy:

Report abuse