created by KateN
on 2015-09-29
This document will walk you through use of GATK's VariantEval tool. VariantEval allows for a lot of customizability, enabling an enhanced analysis of your callset through stratification, use of additional evaluation modules, and the ability to pass in multiple truth sets. Your callset consists of variants identified by earlier steps in the GATK best practices pipeline, and now requires additional evaluation to determine where your callset falls on the spectrum of "perfectly identifies all true, biological variants" to "only identifies artifactual or otherwise unreal variants". When variant calling, we want the callset to maximize the correct calls, while minimizing false positive calls. While very robust methods, such as Sanger sequencing, can be used to individually sequence each potential variant, statistical analysis can be used to evaluate callsets instead, saving both time and money. These callset-based analyses are accomplished by comparing relevant metrics between your samples and a known truth set, such as dbSNP. Two tools exist to examine these metrics: VariantEval in GATK, and CollectVariantCallingMetrics in Picard. While the latter is currently used in the Broad Institute's production pipeline, the merits to each tool, as well as the basis for variant evaluation, are discussed here.
java -jar GenomeAnalysisTK.jar \ -T VariantEval \ -R reference.b37.fasta \ -eval SampleVariants.vcf \ -D dbsnp_138.b37.excluding_sites_after_129.vcf \ -noEV -EV CompOverlap -EV IndelSummary -EV TiTvVariantEvaluator -EV CountVariants -EV MultiallelicSummary \ -o SampleVariants_Evaluation.eval.grp
This command specifies the tool (VariantEval), input files, evaluation modules to be used, and an output file to write the results to. The output will be in the form of a GATKReport.
-eval
: a .vcf file containing your sample(s)' variant data you wish to evaluate. The example shown here uses a whole-genome sequenced rare variant association study performed on >1500 samples. You can specify multiple files to evaluate with additional -eval
lines.-D
: a dbSNP .vcf to provide the tool a reference of known variants, which can be found in the GATK bundle-R
: a reference sequence .fastaFor our example command, we will simplify our analysis and examine results using the following minimum standard modules: CompOverlap, IndelSummary, TiTvVariantEvaluator, CountVariants, & MultiallelicSummary. These modules will provide a reasonable assessment of variant qualities while reducing the computational burden in comparison to running the default modules. In the data we ran here, >1500 whole-genome-sequenced samples, this improved the run time by 5 hours and 20 minutes compared to using the default modules, which equates to a 12% time reduction. In order to do this, all default modules are removed with -noEV
, then the minimum standard modules are added back in. This tool uses only at variants that have passed all filtration steps to calculate metrics.
Here we see an example of the table generated by the CompOverlap evaluation module. The field concordantRate
is highlighted as it is one of the metrics we examine for quality checks. Each table generated by the sample call is in the attached files list at the end of this document, which you are free to browse at your leisure.
It is important to note the stratification by novelty, seen in this and all other tables for this example. The row for "novel" includes all variants that are found in SampleVariants.vcf
but not found in the known variants file. By default, your known variants are in dbSNP. However, if you would like to specify a different known set of variants, you can pass in a -comp
file, and call -knownName
on it. (See the VariantEval tool documentation for more information) The "known" row includes all variants found in SampleVariants.vcf
and the known variants file. "All" totals the "known" and "novel" rows. This novelty stratification is done by default, but many other stratifications can be specified; see tool documentation for more information.
Compiled in the below table are all of the metrics taken from various tables that we will use to ascertain the quality of the analysis.
-comp
argument.* In the case used here, we expect (and observe) a majority of overlap between eval
and dbSNP. The latter contains a multitude of variants and is not highly regulated, so matching a high number of eval
variants to it is quite likely. * Please note: As dbSNP is our default truth set (for comparison), and our default known (for novelty determination), you will see 0 in the novel concordantRate column. If you are interested in knowing the novel concordantRate, you must specify a truth set different from the set specified as known.insertion_to_deletion_ratio
under IndelSummary, and under CountVariants as insertionDeletionRatio
. Each table gives a different ratio, due to the differences in calculating indels as discussed in the previous section. In our particular sample data set, filters were run to favor detection of more rare variants. Thus the indel ratios of the loci-based table (IndelSummary; 0.77 & 0.69) are closer to the rare ratio than the expected normal.The purpose of running the analysis with the minimum standard evaluation modules is to minimize the time spent running VariantEval. Reducing the number of evaluation modules has some effects on the total runtime; depending on the additional specifications given (stratifications, multiple -comp
files, etc.), running with the minimum standard evaluation modules can reduce the runtime by 10-30% in comparison to running the default evaluation modules. Further reducing the runtime can be achieved through multithreading, using the -nt
argument.
Updated on 2015-12-01