Introduction to the GATK Best Practices workflows

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by Geraldine_VdAuwera

on 2014-04-16

This article is part of the Best Practices documentation. See http://www.broadinstitute.org/gatk/guide/best-practices for the full documentation set.

The “GATK Best Practices” are workflow descriptions that provide step-by-step recommendations for getting the best analysis results possible out of high-throughput sequencing data. At present, we provide the following Best Practice workflows:

These recommendations have been developed by the [GATK development team](http://www.broadinstitute.org/gatk/about/who-we-are) over years of analysis work on many of the Broad Institute’s sequencing projects, and are applied in the Broad’s production pipelines. As a general rule, the command-line arguments and parameters given in the documentation examples are meant to be broadly applicable.

Important notes on context and caveats

Our testing focuses largely on data from human whole-genome or whole-exome samples sequenced with Illumina technology, so if you are working with different types of data or experimental designs, you may need to adapt certain branches of the workflow, as well as certain parameter selections and values. Unfortunately we are not able to provide official recommendations on how to deal with very different experimental designs or divergent datatypes (such as Ion Torrent).

In addition, the illustrations and tutorials provided in these pages tend to assume a simple experimental design where each sample is used to produce one DNA library that is sequenced separately on one lane of the machine. See the Guide for help dealing with other experimental designs.

Finally, please be aware that several key steps in the Best Practices workflow make use of existing resources such as known variants, which are readily available for humans (we provide several useful resource datasets for download from our FTP server). If no such resources are available for your organism, you may need to bootstrap your own or use alternative methods. We have documented useful methods to do this wherever possible, but be aware than some issues are currently still without a good solution.

Important note on GATK versions

The Best Practices have been updated for GATK version 3. If you are running an older version, you should seriously consider upgrading. For more details about what has changed in each version, please see the Version History section. If you cannot upgrade your version of GATK for any reason, please look up the corresponding version of the GuideBook PDF (also in the Version History section) to ensure that you are using the appropriate recommendations for your version.

Updated on 2016-01-27

From michaelchao on 2014-10-01

Dear GATK,

We have targeted sequencing for 8 unique samples, and for each sample we have 3 technical replicates (Total Sample N=24). Each of the 24 samples were processed as an independent sample/individual. We would like to pool the 3 technical replicates together, so we can have maximum depth of coverage. We hope this will give us better indel calling.

Our questions are:

Do we merge the 3 technical replicates at the FASTQ file stage (pre-alignment)? Or, do we merge the 3 technical replicates at the BAM file stage (post-alignment)?

Thank you for your help.

From Geraldine_VdAuwera on 2014-10-02

Hi @michaelchao,

I’m assuming that by technical replicates you mean separate preps/libraries coming from the same DNA samples, so correct me if that’s not what you meant.

It mostly depends on whether these replicates were sequenced on the same lane or in different ones. If they were on different lanes of the machine, they may be subject to different mechanical biases, so you would want to recalibrate them separately. Generally speaking, to be safe I would recommend treating them as different read groups anyway, so that if down the road you find there was a systematic problem with one, you can easily exclude or discount it. So you would map them separately, then give them different read group IDs, but the same sample name (SM tag). Then you can pass the resulting bam files together (per sample) to the indel realignment tools, which will produce a single per-sample bam file. From then on you can easily continue processing each sample bam separately up to the joint analysis part. Does that make sense?

From michaelchao on 2014-10-02

Thank you very much Geraldin, yes, it makes sense. I’ll give it a try.

From KKND on 2019-05-30

did it deprecate the preprocessing bamfile?