Best Practices for Variant Discovery in DNAseq

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by Geraldine_VdAuwera

on 2013-09-13

This article is part of the Best Practices documentation. See http://www.broadinstitute.org/gatk/guide/best-practices for the full documentation set.

This is our recommended workflow for calling variants in DNAseq data from cohorts of samples, in which steps from data processing up to variant calling are performed per-sample, and subsequent steps are performed jointly on all the individuals in the cohort.

The workflow is divided in three main sections that are meant to be performed sequentially:

- Pre-processing: from raw DNAseq sequence reads (FASTQ files) to analysis-ready reads (BAM files)

- Variant discovery: from reads (BAM files) to variants (VCF files)

- Refinement and evaluation: genotype refinement, functional annotation and callset QC

Pre-Processing

The data generated by the sequencers are put through some pre-processing steps to make it suitable for variant calling analysis. The steps involved are: Mapping and Marking Duplicates; Local Realignment Around Indels; and Base Quality Score Recalibration (BQSR); performed in that order.

Mapping and Marking Duplicates

The sequence reads are first mapped to the reference using BWA mem to produce a file in SAM/BAM format sorted by coordinate. The next step is to mark duplicates. The rationale here is that during the sequencing process, the same DNA molecules can be sequenced several times. The resulting duplicate reads are not informative and should not be counted as additional evidence for or against a putative variant. The duplicate marking process identifies these reads as such so that the GATK tools know they should ignore them.

Realignment Around Indels

Next, local realignment is performed around indels, because the algorithms that are used in the initial mapping step tend to produce various types of artifacts. For example, reads that align on the edges of indels often get mapped with mismatching bases that might look like evidence for SNPs, but are actually mapping artifacts. The realignment process identifies the most consistent placement of the reads relative to the indel in order to clean up these artifacts. It occurs in two steps: first the program identifies intervals that need to be realigned, then in the second step it determines the optimal consensus sequence and performs the actual realignment of reads.

Base Quality Score Recalibration

Finally, base quality scores are recalibrated, because the variant calling algorithms rely heavily on the quality scores assigned to the individual base calls in each sequence read. These scores are per-base estimates of error emitted by the sequencing machines. Unfortunately the scores produced by the machines are subject to various sources of systematic error, leading to over- or under-estimated base quality scores in the data. Base quality score recalibration is a process in which we apply machine learning to model these errors empirically and adjust the quality scores accordingly. This yields more accurate base qualities, which in turn improves the accuracy of the variant calls. The base recalibration process involves two key steps: first the program builds a model of covariation based on the data and a set of known variants, then it adjusts the base quality scores in the data based on the model.

Variant Discovery

Once the data has been pre-processed as described above, it is put through the variant discovery process, i.e. the identification of sites where the data displays variation relative to the reference genome, and calculation of genotypes for each sample at that site. Because some of the variation observed is caused by mapping and sequencing artifacts, the greatest challenge here is to balance the need for sensitivity (to minimize false negatives, i.e. failing to identify real variants) vs. specificity (to minimize false positives, i.e. failing to reject artifacts). It is very difficult to reconcile these objectives in a single step, so instead the variant discovery process is decomposed into separate steps: variant calling (performed per-sample), joint genotyping (performed per-cohort) and variant filtering (also performed per-cohort). The first two steps are designed to maximize sensitivity, while the filtering step aims to deliver a level of specificity that can be customized for each project.

Per-Sample Variant Calling

We perform variant calling by running the HaplotypeCaller on each sample BAM file (if a sample’s data is spread over more than one BAM, then pass them all in together) to create single-sample gVCFs. If there are more than a few hundred samples, we combine the gVCFs in batches of ~200 gVCFs using a specialized tool, CombineGVCFs. This will make the next step more tractable and reflects that the processing bottleneck lies with the number of input files and not the number of samples in those files.

Joint Genotyping

All available samples are then jointly genotyped by taking the gVCFs produced earlier and running GenotypeGVCFs on all of them together to create a set of raw SNP and indel calls. This cohort-wide analysis empowers sensitive detection of variants even at difficult sites.

Variant Quality Score Recalibration

Variant recalibration involves using a machine learning method to assign a well-calibrated probability to each variant call in a raw call set. We can then use this variant quality score in the second step to filter the raw call set, thus producing a subset of calls with our desired level of quality, fine-tuned to balance specificity and sensitivity.

Refinement and evaluation

In this last section, we perform some refinement steps on the genotype calls (GQ estimation and transmission phasing), add functional annotations if desired, and do some quality evaluation by comparing the callset to known resources. None of these steps are absolutely required, and the workflow may need to be adapted quite a bit to each project’s requirements.

Important note on GATK versions

The [Best Practices](http://www.broadinstitute.org/gatk/guide/best-practices) have been updated for GATK version 3. If you are running an older version, you should seriously consider upgrading. For more details about what has changed in each version, please see the [Version History](http://www.broadinstitute.org/gatk/guide/version-history) section. If you cannot upgrade your version of GATK for any reason, please look up the corresponding version of the GuideBook PDF (also in the [Version History](http://www.broadinstitute.org/gatk/guide/version-history) section) to ensure that you are using the appropriate recommendations for your version.

Updated on 2015-05-16

From albertoap on 2014-05-01

Hi,

I don’t know if this is the right place to post this comment, but when using Firefox (v 29.0 Linux 64 bits) the “DNAseq” and “RNAseq” buttons don’t work properly and I can’t get additional information for the different steps.

It works properly using Chromium.

Alberto

From Geraldine_VdAuwera on 2014-05-01

Thanks for reporting this bug —I’ll try to get that fixed asap.

From crojo on 2014-05-25

Just fyi: the bug reported by @albertoap still seems to exist. In case it helps….

From Geraldine_VdAuwera on 2014-05-27

@crojo, can you tell me what is your browser/platform version?

From crojo on 2014-06-01

@Geraldine_VdAuwera‌ sure. I’m using Firefox v 29.0.1, on Mac OS X 10.7.5.

From JorgeAmigo on 2014-06-06

buttons don’t work neither on Firefox v30 beta on Windows 8.1 64 bits, although tabs do work perfectly and all information seems to be accessible through them.

From Geraldine_VdAuwera on 2014-06-06

OK, I have a fix for this but can’t apply it yet due to other pending modifications. Should be able to get this resolved next week. In the meantime, the content of these pages can be accessed directly through the forum in the GATK Documentation / Best Practices subcategories, for anyone who is completely unable to view these pages. Sorry for the inconvenience and thanks for your patience!

From mmterpstra on 2014-06-27

> Should be able to get this resolved next week

@Geraldine_VdAuwera: Are the buttons still broken? Or should i update my firefox v20? Maybe both :)

From Geraldine_VdAuwera on 2014-06-27

@mmterpstra‌ Not updated yet, sorry — ran into a couple of issues that I need to fix before pushing changes to public. Maybe consider using Chrome in the meantime :)

From geoffroyGATK on 2014-10-06

It is still broken with Firefox 32.0.3 on Windows 7.

Can you push changes to public?

From Geraldine_VdAuwera on 2014-10-07

@geoffroyGATK Sorry, this got pushed to the backburner. I’ll try to deal with it this week as we have a couple of other fixes pending.

From weberATillinoisedu on 2014-11-11

DNAseq and RNAseq buttons are still broken – at least in Firefox 33.0 under mac OS 10.8. Since the permanent fix may be a ways off yet, here are the links to the two sections for others who prefer to work in Firefox: [DNAseq](https://www.broadinstitute.org/gatk/guide/best-practices?bpm=DNAseq) and [RNAseq](https://www.broadinstitute.org/gatk/guide/best-practices?bpm=RNAseq). Hope this helps others while this is being fixed!

Also, Firefox displays the “broken image” symbol on some of the subsections of the RNAseq pages. These don’t show up when using Safari (which may simply ignore broken images).

From abrahamdsl on 2015-05-20

I can confirm that it it still unresolved as of now. Better to just use Chrome. :/

From Geraldine_VdAuwera on 2015-05-20

Sorry folks. Turned out to be a little trickier than I thought, and in the meantime we got some resources allocated to do a complete redesign of the best practices page. It’s going to be much easier to navigate.

From mjtiv on 2018-04-11

Question? On the RNA-Seq pipeline it says “realignment around indels” is “optional”. Is this statement applicable for the DNA-Seq pipeline too (run if you have time/computational)? Or is there a specific reason it is needed to be run in DNA versus RNA?

From Sheila on 2018-04-17

@mjtiv

Hi,

It is now optional for both. Have a look at [this blog post](https://software.broadinstitute.org/gatk/blog?id=7847).

-Sheila

From valdezkm on 2018-05-02

Greetings!

Where can I find the command line scripts for this workflow, similar to the RNA-seq workflow found here https://software.broadinstitute.org/gatk/documentation/article.php?id=3891?

Thanks,

Kristin

From Sheila on 2018-05-08

@valdezkm

Greetings Kristin :smile:

The workflows in [this repo](https://github.com/gatk-workflows/gatk3-4-rnaseq-germline-snps-indels) should help.

-Sheila

From Werefurser on 2018-06-29

Wowo, I like it Man !!!