created by KateN
on 2015-09-23
This document will walk you through use of Picard's CollectVariantCallingMetrics tool, an excellent tool for large callsets, especially if you need your results quickly and do not require many additional metrics to those described here. Your callset consists of variants identified by earlier steps in the GATK best practices pipeline, and now requires additional evaluation to determine where your callset falls on the spectrum of "perfectly identifies all true, biological variants" to "only identifies artifactual or otherwise unreal variants". When variant calling, we want the callset to maximize the correct calls, while minimizing false positive calls. While very robust methods, such as Sanger sequencing, can be used to individually sequence each potential variant, statistical analysis can be used to evaluate callsets instead, saving both time and money. These callset-based analyses are accomplished by comparing relevant metrics between your samples and a known truth set, such as dbSNP. Two tools exist to examine these metrics: VariantEval in GATK, and CollectVariantCallingMetrics in Picard. While the latter is currently used in the Broad Institute's production pipeline, the merits to each tool, as well as the basis for variant evaluation, are discussed here.
java -jar picard.jar CollectVariantCallingMetrics \ INPUT=CEUtrio.vcf \ OUTPUT=CEUtrioMetrics \ DBSNP=dbsnp_138.b37.excluding_sites_after_129.vcf
After running the command, CollectVariantCallingMetrics will return both a detail and a summary metrics file. These files can be viewed as a text file if needed, or they can be read in as a table using your preferred spreadsheet viewer (e.g. Excel) or scripting language of your choice (e.g. python, R, etc.) The files contain headers and are tab-delimited; the commands for reading in the tables in RStudio are found below. (Note: Replace "~/path/to/" with the path to your output files as needed.)
summary <- read.table("~/path/to/CEUtrioMetrics.variant_calling_summary_metrics", header=TRUE, sep="\t") detail <- read.table("~/path/to/CEUtrioMetrics.variant_calling_detail_metrics", header=TRUE, sep="\t")
- Summary The summary metrics file will contain a single row of data for each metric, taking into account all samples present in your INPUT
file. - Detail The detail metrics file gives a breakdown of each statistic by sample. In addition to all metrics covered in the summary table, the detail table also contains entries for SAMPLE_ALIAS
and HET_HOMVAR_RATIO
. In the example case here, the detail file will contain metrics for the three different samples, NA12892, NA12891, and NA12878.
*Concatenated in the above table are the detail file's (rows 1-3) and the summary file's (row 4) relevant metrics; for full output table, see attached image file.
CollectVariantCallingMetrics_all.png
Updated on 2015-12-01
From Jutta on 2017-06-17
Hi,
I followed the Best-Practice guidelines for RNA-seq data and called variants using Haplotype caller between different maize genotypes and the reference genome. According to your nice tutorial, I should do some quality assessment of my called variants. For this, I would like to use picard’s CollecvariantCallingMetrics. However, I am not quite sure what to use as the input file for DBSNP?
Thanks,
Jutta
From shlee on 2017-06-19
Hi @Jutta,
The DBSNP file refers to the [database of short genetic variations](https://www.ncbi.nlm.nih.gov/projects/SNP/), for which there appears to be a resource for corn available. The Zea mays resource is currently listed at the very end of [this page](https://www.ncbi.nlm.nih.gov/dbvar/content/org_summary/).
CollectVariantCallingMetrics uses the known variants from the resource file to stratify the experimental variants you provide it, e.g. gives metrics on the novel variants. So despite the name of the DBSNP tool argument, really you could provide the tool any VCF of variants from which you would like the comparison to be drawn.
From prasundutta87 on 2017-06-20
Hi,
Can you let me know some more details about SNP_REFERENCE_BIAS value? I mean how much should I expect it to be? I guess any value equal to approx 0.5 is ok? What if it is less than that or more?
From shlee on 2017-06-20
Hi @prasundutta87,
You can look up all of the Picard metrics at . For SNP_REFERENCE_BIAS, we see
> The rate at which reference bases are observed at ref/alt heterozygous SNP sites.
Let me consult with a developer on the range of acceptable values.
From shlee on 2017-06-20
@prasundutta87,
Our developer agrees that
> close to 0.5 is a good value. If it’s more than 0.5 you might have some bias towards the reference.
From prasundutta87 on 2017-06-20
Thanks a lot. In one of my vcf file, I have 0.3..i guess it is fine then?
From shlee on 2017-06-20
That’s low. Our developer says a more acceptable range is 0.45–0.55. Also,
> Depending on your assay, 0.3 might be as good as it can get. Expect to have an elevated rate of genotyping errors and false negatives at that rate, especially on low coverage regions.
From EliseG on 2018-07-06
Hi shlee!
You mentionned : the Picard metrics at https://broadinstitute.github.io/picard/picard-metric-definitions.html.
I think there is a mistake in this page : It seems that there is an inversion between CollectVariantCallingMetrics.VariantCallingSummaryMetrics fields and CollectVariantCallingMetrics.VariantCallingDetailMetrics fields.
Kind regards,
Elise
From shlee on 2018-07-06
Hi @EliseG,
CollectVariantCallingMetrics summary metrics give the following for variant site in the VCF:
TOTAL_SNPS NUM_IN_DB_SNP NOVEL_SNPS FILTERED_SNPS PCT_DBSNP DBSNP_TITV NOVEL_TITV TOTAL_INDELS NOVEL_INDELS FILTERED_INDELS PCT_DBSNP_INDELS NUM_IN_DB_SNP_INDELS DBSNP_INS_DEL_RATIO NOVEL_INS_DEL_RATIO TOTAL_MULTIALLELIC_SNPS NUM_IN_DB_SNP_MULTIALLELIC TOTAL_COMPLEX_INDELS NUM_IN_DB_SNP_COMPLEX_INDELS SNP_REFERENCE_BIAS NUM_SINGLETONS
CollectVariantCallingMetrics detail metrics give the following for each sample in the VCF:
SAMPLE_ALIAS HET_HOMVAR_RATIO PCT_GQ0_VARIANTS TOTAL_GQ0_VARIANTS TOTAL_HET_DEPTH
plus:
TOTAL_SNPS NUM_IN_DB_SNP NOVEL_SNPS FILTERED_SNPS PCT_DBSNP DBSNP_TITV NOVEL_TITV TOTAL_INDELS NOVEL_INDELS FILTERED_INDELS PCT_DBSNP_INDELS NUM_IN_DB_SNP_INDELS DBSNP_INS_DEL_RATIO NOVEL_INS_DEL_RATIO TOTAL_MULTIALLELIC_SNPS NUM_IN_DB_SNP_MULTIALLELIC TOTAL_COMPLEX_INDELS NUM_IN_DB_SNP_COMPLEX_INDELS SNP_REFERENCE_BIAS NUM_SINGLETONS
which are identical metrics as the summary metrics. It looks like the metrics page is being succinct. Apologies for the confusion. At some point in the future, these metrics docs will be moved to the GATK website and at that point, this point will be clarified.