created by GATK_Team
on 2017-12-24
To date we have published three peer-reviewed papers on GATK, plus a preprint in bioRxiv (citation details below). We provide brief descriptions of each paper below; you're welcome to choose which paper is most representative of what aspect of GATK you called on in your work. Keep in mind however that the tools have evolved significantly over time so the older papers are of limited value at this point beyond just acknowledging the authorship of GATK. In terms of describing the work for reproducibility purposes, the Poplin et al 2017 preprint is the most appropriate for describing germline short variant work done with recent versions (3.x to 4.x). Unfortunately, we don’t currently have good references for the other use cases (such as somatic analysis). When describing the analysis work you did in your own papers, there is ultimately no substitute for sharing your pipelines and any code you used in your work.
The fourth paper, technically just a manuscript deposited in bioRxiv -- but it counts! This is a good citation to include in a Materials and Methods section or in a Discussion if you're talking about the joint calling process.
Scaling accurate genetic variant discovery to tens of thousands of samples Ryan Poplin, Valentin Ruano-Rubio, Mark A. DePristo, Tim J. Fennell, Mauricio O. Carneiro, Geraldine A. Van der Auwera, David E. Kling, Laura D. Gauthier, Ami Levy-Moonshine, David Roazen, Khalid Shakir, Joel Thibault, Sheila Chandran, Chris Whelan, Monkol Lek, Stacey Gabriel, Mark J. Daly, Benjamin Neale, Daniel G. MacArthur, Eric Banks, 2017 bioRxiv
The third GATK paper describes the Best Practices for Variant Discovery (version 2.x). It was intended mainly as a learning resource for first-time users and as a protocol reference. This was a good citation to include in a Materials and Methods section for older versions (2.x) but it is now very out of date and is no longer appropriate as sole reference for work done with later versions.
From FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline Van der Auwera GA, Carneiro M, Hartl C, Poplin R, del Angel G, Levy-Moonshine A, Jordan T, Shakir K, Roazen D, Thibault J, Banks E, Garimella K, Altshuler D, Gabriel S, DePristo M, 2013 CURRENT PROTOCOLS IN BIOINFORMATICS 43:11.10.1-11.10.33
Remember that as our work continues and our Best Practices recommendations evolve, specific command lines, argument values and even tool choices described in the paper become obsolete. Be sure to always refer to our Best Practices documentation for the most up-to-date and version-appropriate recommendations.
The second GATK paper describes in more detail some of the key tools commonly used in the GATK for high-throughput sequencing data processing and variant discovery. The paper covers base quality score recalibration, indel realignment, SNP calling with UnifiedGenotyper, variant quality score recalibration and their application to deep whole genome, whole exome, and low-pass multi-sample calling. This was a good citation for an overview of germline short variant discovery with versions 1.x through 2.x but is now severely out of date and no longer appropriate for work done with later versions.
A framework for variation discovery and genotyping using next-generation DNA sequencing data DePristo M, Banks E, Poplin R, Garimella K, Maguire J, Hartl C, Philippakis A, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell T, Kernytsky A, Sivachenko A, Cibulskis K, Gabriel S, Altshuler D, Daly M, 2011 NATURE GENETICS 43:491-498
Note that the workflow described in this paper corresponds to the version 1.x to 2.x best practices. Some key steps for variant discovery have been significantly modified in later versions (3.x onwards). This paper should not be used as a definitive guide to variant discovery with GATK. For that, please see our online documentation guide.
The first GATK paper covers the computational philosophy underlying the GATK and is a good citation for the GATK in general.
The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA, 2010 GENOME RESEARCH 20:1297-303
We sequenced 10 samples on 10 lanes on an Illumina HiSeq 2000, aligned the resulting reads to the hg19 reference genome with BWA (Li & Durbin), applied GATK (McKenna et al., 2010) base quality score recalibration, indel realignment, duplicate removal, and performed SNP and INDEL discovery and genotyping across all 10 samples simultaneously using standard hard filtering parameters or variant quality score recalibration according to GATK Best Practices recommendations (DePristo et al., 2011; Van der Auwera et al., 2013).
Updated on 2019-11-07