howto Evaluate a callset with CollectVariantCallingMetrics

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by KateN

on 2015-09-23

Context

This document will walk you through use of Picard's CollectVariantCallingMetrics tool, an excellent tool for large callsets, especially if you need your results quickly and do not require many additional metrics to those described here. Your callset consists of variants identified by earlier steps in the GATK best practices pipeline, and now requires additional evaluation to determine where your callset falls on the spectrum of "perfectly identifies all true, biological variants" to "only identifies artifactual or otherwise unreal variants". When variant calling, we want the callset to maximize the correct calls, while minimizing false positive calls. While very robust methods, such as Sanger sequencing, can be used to individually sequence each potential variant, statistical analysis can be used to evaluate callsets instead, saving both time and money. These callset-based analyses are accomplished by comparing relevant metrics between your samples and a known truth set, such as dbSNP. Two tools exist to examine these metrics: VariantEval in GATK, and CollectVariantCallingMetrics in Picard. While the latter is currently used in the Broad Institute's production pipeline, the merits to each tool, as well as the basis for variant evaluation, are discussed here.

Example Use

Command

java -jar picard.jar CollectVariantCallingMetrics \ INPUT=CEUtrio.vcf \ OUTPUT=CEUtrioMetrics \ DBSNP=dbsnp_138.b37.excluding_sites_after_129.vcf

INPUT The CEU trio (NA12892, NA12891, and 12878) from the 1000 Genomes Project is the input chosen for this example. It is the callset that we wish to examine the metrics on, and thus this is the field where you would specify the .vcf file containing your sample(s)'s variant calls.
OUTPUT The output for this command will be written to two files named CEUtrioMetrics.variantcallingsummarymetrics and CEUtrioMetrics.variantcallingdetailmetrics, hereafter referred to as "summary" and "detail", respectively. The specification given in this field is applied as the name of the out files; the file extensions are provided by the tool itself.
DBSNP The last required input to run this tool is a dbSNP file. The one used here is available in the current GATK bundle. CollectVariantCallingMetrics utilizes this dbSNP file as a base of comparison against the sample(s) present in your vcf.

Getting Results

After running the command, CollectVariantCallingMetrics will return both a detail and a summary metrics file. These files can be viewed as a text file if needed, or they can be read in as a table using your preferred spreadsheet viewer (e.g. Excel) or scripting language of your choice (e.g. python, R, etc.) The files contain headers and are tab-delimited; the commands for reading in the tables in RStudio are found below. (Note: Replace "~/path/to/" with the path to your output files as needed.)

summary <- read.table("~/path/to/CEUtrioMetrics.variant_calling_summary_metrics", header=TRUE, sep="\t") detail <- read.table("~/path/to/CEUtrioMetrics.variant_calling_detail_metrics", header=TRUE, sep="\t")

- Summary The summary metrics file will contain a single row of data for each metric, taking into account all samples present in your INPUT file. - Detail The detail metrics file gives a breakdown of each statistic by sample. In addition to all metrics covered in the summary table, the detail table also contains entries for SAMPLE_ALIAS and HET_HOMVAR_RATIO. In the example case here, the detail file will contain metrics for the three different samples, NA12892, NA12891, and NA12878.

Analyzing Results

^{*Concatenated in the above table are the detail file's (rows 1-3) and the summary file's (row 4) relevant metrics; for full output table, see attached image file.}

Number of Indels & SNPs This tool collects the number of SNPs (single nucleotide polymorphisms) and indels (insertions and deletions) as found in the variants file. It counts only biallelic sites and filters out multiallelic sites. Many factors affect these counts, including cohort size, relatedness between samples, strictness of filtering, ethnicity of samples, and even algorithm improvement due to updated software. While this metric alone is insufficient to evaluate your variants, it does provide a good baseline. It is reassuring to see that across the three related samples, we saw very similar numbers of SNPs and indels. It could be cause for concern if a particular sample had significantly more or fewer variants than the rest.
Indel Ratio The indel ratio is determined to be the total number of insertions divided by the total number of deletions; this tool does not include filtered variants in this calculation. Usually, the indel ratio is around 1, as insertions occur typically as frequently as deletions. However, in rare variant studies, indel ratio should be around 0.2-0.5. Our samples have an indel ratio of ~0.95, indicating that these variants are not likely to have a bias affecting their insertion/deletion ratio.
TiTv Ratio This metric is the ratio of transition (Ti) to transversion (Tv) mutations. For whole genome sequencing data, TiTv should be ~2.0-2.1, whereas whole exome sequencing data will have a TiTv ratio of ~3.0-3.3¹. In the case of the CEU trio of samples, the TiTv of ~2.08 and ~1.91 are within reason, and this variant callset is unlikely to have a bias affecting its transition/transversion ratio.

Attachments:

CollectVariantCallingMetrics_all.png

Updated on 2015-12-01

From Jutta on 2017-06-17

Hi,

I followed the Best-Practice guidelines for RNA-seq data and called variants using Haplotype caller between different maize genotypes and the reference genome. According to your nice tutorial, I should do some quality assessment of my called variants. For this, I would like to use picard’s CollecvariantCallingMetrics. However, I am not quite sure what to use as the input file for DBSNP?

Thanks,

Jutta

From shlee on 2017-06-19

Hi @Jutta,

The DBSNP file refers to the [database of short genetic variations](https://www.ncbi.nlm.nih.gov/projects/SNP/), for which there appears to be a resource for corn available. The Zea mays resource is currently listed at the very end of [this page](https://www.ncbi.nlm.nih.gov/dbvar/content/org_summary/).

CollectVariantCallingMetrics uses the known variants from the resource file to stratify the experimental variants you provide it, e.g. gives metrics on the novel variants. So despite the name of the DBSNP tool argument, really you could provide the tool any VCF of variants from which you would like the comparison to be drawn.

From prasundutta87 on 2017-06-20

Hi,

Can you let me know some more details about SNP_REFERENCE_BIAS value? I mean how much should I expect it to be? I guess any value equal to approx 0.5 is ok? What if it is less than that or more?

From shlee on 2017-06-20

Hi @prasundutta87,

You can look up all of the Picard metrics at . For SNP_REFERENCE_BIAS, we see

> The rate at which reference bases are observed at ref/alt heterozygous SNP sites.

Let me consult with a developer on the range of acceptable values.

From shlee on 2017-06-20

@prasundutta87,

Our developer agrees that

> close to 0.5 is a good value. If it’s more than 0.5 you might have some bias towards the reference.

From prasundutta87 on 2017-06-20

Thanks a lot. In one of my vcf file, I have 0.3..i guess it is fine then?

From shlee on 2017-06-20

That’s low. Our developer says a more acceptable range is 0.45–0.55. Also,

> Depending on your assay, 0.3 might be as good as it can get. Expect to have an elevated rate of genotyping errors and false negatives at that rate, especially on low coverage regions.

From EliseG on 2018-07-06

Hi shlee!

You mentionned : the Picard metrics at https://broadinstitute.github.io/picard/picard-metric-definitions.html.

I think there is a mistake in this page : It seems that there is an inversion between CollectVariantCallingMetrics.VariantCallingSummaryMetrics fields and CollectVariantCallingMetrics.VariantCallingDetailMetrics fields.

Kind regards,

Elise

From shlee on 2018-07-06

Hi @EliseG,

CollectVariantCallingMetrics summary metrics give the following for variant site in the VCF:

TOTAL_SNPS NUM_IN_DB_SNP NOVEL_SNPS FILTERED_SNPS PCT_DBSNP DBSNP_TITV NOVEL_TITV TOTAL_INDELS NOVEL_INDELS FILTERED_INDELS PCT_DBSNP_INDELS NUM_IN_DB_SNP_INDELS DBSNP_INS_DEL_RATIO NOVEL_INS_DEL_RATIO TOTAL_MULTIALLELIC_SNPS NUM_IN_DB_SNP_MULTIALLELIC TOTAL_COMPLEX_INDELS NUM_IN_DB_SNP_COMPLEX_INDELS SNP_REFERENCE_BIAS NUM_SINGLETONS

CollectVariantCallingMetrics detail metrics give the following for each sample in the VCF:

SAMPLE_ALIAS HET_HOMVAR_RATIO PCT_GQ0_VARIANTS TOTAL_GQ0_VARIANTS TOTAL_HET_DEPTH

plus:

which are identical metrics as the summary metrics. It looks like the metrics page is being succinct. Apologies for the confusion. At some point in the future, these metrics docs will be moved to the GATK website and at that point, this point will be clarified.

Report abuse

howto Evaluate a callset with CollectVariantCallingMetrics

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

Related Documents

Context

Example Use

Command

Getting Results

Analyzing Results

Attachments: