Using depth of coverage metrics for variant evaluation

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by Geraldine_VdAuwera

on 2014-10-17

Overview

This document describes the proper use of metrics associated with depth of coverage for the purpose of evaluating variants.

The metrics involved are the following:

    • DepthPerAlleleBySample (AD): outputs the depth of coverage of each allele per sample.
    • Coverage (DP): outputs the filtered depth of coverage for each sample and the unfiltered depth of coverage across all samples.

For an overview of the tools and concepts involved in performing sequence coverage analysis, where the purpose is to answer the common question: "(Where) Do I have enough sequence data to be empowered to discover variants with reasonable confidence?", please see this document.

Coverage annotations: DP and AD

The variant callers generate two main coverage annotation metrics: the allele depth per sample (AD) and overall depth of coverage (DP, available both per sample and across all samples, with important differences), controlled by the following annotator modules:

    • DepthPerAlleleBySample (AD): outputs the depth of coverage of each allele per sample.
    • Coverage (DP): outputs the filtered depth of coverage for each sample and the unfiltered depth of coverage across all samples.

At the sample level, these annotations are highly complementary metrics that provide two important ways of thinking about the depth of the data available for a given sample at a given site. The key difference is that the AD metric is based on unfiltered read counts while the sample-level DP is based on filtered read counts (see tool documentation for a list of read filters that are applied by default for each tool). As a result, they should be interpreted differently.

The sample-level DP is in some sense reflective of the power I have to determine the genotype of the sample at this site, while the AD tells me how many times I saw each of the REF and ALT alleles in the reads, free of any bias potentially introduced by filtering the reads. If, for example, I believe there really is a an A/T polymorphism at a site, then I would like to know the counts of A and T bases in this sample, even for reads with poor mapping quality that would normally be excluded from the statistical calculations going into GQ and QUAL.

Note that because the AD includes reads and bases that were filtered by the caller (and in case of indels, is based on a statistical computation), it should not be used to make assumptions about the genotype that it is associated with. Ultimately, the phred-scaled genotype likelihoods (PLs) are what determines the genotype calls.

TO BE CONTINUED...

Updated on 2015-07-06

From Richard_Pearson on 2015-11-25

I would like to make a feature request for a filtered depth of coverage of each allele per sample. I work with Plasmodium samples, which are typically a mixture of an unknown number of haploid strains in unknown proportions. I think I prefer to call these in haploid mode (ploidy 1), so the GT is then a reflection of the likely “majority call”. However, I would also like to estimate the fractional proportions of each allele in each sample at each variant site. At present I am using the unfiltered allele depths contained in AD to do this. However, I’m thinking this would perhaps be more accurate if using the filtered depths, using the same filtering as applied when creating the sample-level DP. Would it be possible to include a new sample-level annotation (perhaps FAD?) that would give this filtered depth of coverage for each alelle in each sample? For each sample the sum of FAD would be equal to the DP for that sample.

From Sheila on 2015-11-30

@Richard_Pearson

Hi,

We do have an annotation called StrandAlleleCountsBySample that does what you are asking. https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_StrandAlleleCountsBySample.php You will have to do the extra step of adding the counts of reads that support the alleles on the forward and reverse strands.

-Sheila

From Richard_Pearson on 2016-01-13

Thanks Sheila, and apologies for my late response, I’ve only just seen your reply. I did see StrandAlleleCountsBySample, but assumed this was the same as DepthPerAlleleBySample in that it would return unfiltered counts, partly based on the fact that in the example given the values for SAC (1,0,3,15,4,8) add up to give the values in AD (1,18,12). Please could you confirm that StrandAlleleCountsBySample does indeed give filtered counts whereas DepthPerAlleleBySample gives unfiltered counts? If this is the case, it might be good to update the documentation for StrandAlleleCountsBySample to make this explicit.

Many thanks! Richard

From Geraldine_VdAuwera on 2016-01-15

Hi Richard, recommending SAC was my idea but on second thought I may be wrong. I need to check a few things and get back to you.

From Sheila on 2016-01-25

@Richard_Pearson

Hi Richard,

No, SAC does not give filtered counts. It is unfiltered like AD. Let me see if I can put this in as a feature request. Unfortunately, our developers are quite busy right now and won’t be able to get to this very soon. However, we are very happy to look at a patch you submit :smile:

http://gatkforums.broadinstitute.org/gatk/discussion/1267/how-can-i-submit-a-patch-to-the-gatk-codebase

-Sheila

From Sheila on 2016-01-26

@Richard_Pearson

Hi again Richard,

It seems there may be some hope for this. One of the developers is working on something similar to what you are asking. I’m not sure when it will be available, however.

-Sheila

From Richard_Pearson on 2016-01-28

Sounds positive, thanks both!

From mbxat1 on 2016-06-10

Hi,

should I be worried about read length when calculating Depth of Coverage using GATK. I have samples of the same species, some sequenced with a read length of 100bp and others 150bp. I am using GATK version 3.4, mean depth calculation is less than I expected on the samples sequenced with a read length of 150bp.

Thank you in anticipation of your response

From Sheila on 2016-06-13

@mbxat1

Hi,

Can you give us some more details about the differences? How much of a difference is there? The major issue I can think of is that VQSR runs on the assumption that the sample annotations are all distributed in the same way. So, if your depths are different between the samples, that can cause some issues.

-Sheila

From Geraldine_VdAuwera on 2016-06-18

@mbxat1, the length of reads is not taken into account by DepthOfCoverage. The tool simply looks at how many bases cover each position.

If you’re getting surprising results for the mean depth, you need to look at the distribution of coverage. Unevenness of coverage could affect your ability to call variants confidently. The tool produces a histogram file that can be useful in interpreting this.

From isaac_joseph on 2016-08-03

Greetings. Wondering why AD might be missing from the FORMAT for a very small minority (1 out of ≈ 15,000 variants ) after using HaplotypeCaller. Any insight? Thanks!

From Sheila on 2016-08-04

@isaac_joseph

Hi,

This is a known issue. You can keep track of it [here](https://software.broadinstitute.org/gatk/documentation/issue-tracker).

-Sheila

From tytolin on 2018-10-09

Hello, GATK

I’m interested in filtering by allele depth in a VCF file containing multiple samples.

Theoretically, if a sample which is heterozgote on a SNP site, I will see the allele depth 5,5 on a SNP site which depth is 10. However, in some of the cases, I got a VCF containing multiple samples. In some of the sample which shows heterozygote but allele depth is 2,10. Should I change the sample into alternative homozygote in the vcf file ?