What do the VariantEval modules do

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by Geraldine_VdAuwera

on 2013-03-18

VariantEval accepts two types of modules: stratification and evaluation modules.

    • Stratification modules will stratify (group) the variants based on certain properties.
    • Evaluation modules will compute certain metrics for the variants

CpG

CpG is a three-state stratification:

    • The locus is a CpG site ("CpG")
    • The locus is not a CpG site ("non_CpG")
    • The locus is either a CpG or not a CpG site ("all")

A CpG site is defined as a site where the reference base at a locus is a C and the adjacent reference base in the 3' direction is a G.

EvalRod

EvalRod is an N-state stratification, where N is the number of eval rods bound to VariantEval.

Sample

Sample is an N-state stratification, where N is the number of samples in the eval files.

Filter

Filter is a three-state stratification:

    • The locus passes QC filters ("called")
    • The locus fails QC filters ("filtered")
    • The locus either passes or fails QC filters ("raw")

FunctionalClass

FunctionalClass is a four-state stratification:

    • The locus is a synonymous site ("silent")
    • The locus is a missense site ("missense")
    • The locus is a nonsense site ("nonsense")
    • The locus is of any functional class ("any")

CompRod

CompRod is an N-state stratification, where N is the number of comp tracks bound to VariantEval.

Degeneracy

Degeneracy is a six-state stratification:

    • The underlying base position in the codon is 1-fold degenerate ("1-fold")
    • The underlying base position in the codon is 2-fold degenerate ("2-fold")
    • The underlying base position in the codon is 3-fold degenerate ("3-fold")
    • The underlying base position in the codon is 4-fold degenerate ("4-fold")
    • The underlying base position in the codon is 6-fold degenerate ("6-fold")
    • The underlying base position in the codon is degenerate at any level ("all")

See the [http://en.wikipedia.org/wiki/Genetic_code#Degeneracy Wikipedia page on degeneracy] for more information.

JexlExpression

JexlExpression is an N-state stratification, where N is the number of JEXL expressions supplied to VariantEval. See [[Using JEXL expressions]]

Novelty

Novelty is a three-state stratification:

    • The locus overlaps the knowns comp track (usually the dbSNP track) ("known")
    • The locus does not overlap the knowns comp track ("novel")
    • The locus either overlaps or does not overlap the knowns comp track ("all")

CountVariants

CountVariants is an evaluation module that computes the following metrics:

| Metric | Definition | |:-------|:-----------| | nProcessedLoci | Number of processed loci | | nCalledLoci | Number of called loci | | nRefLoci | Number of reference loci | | nVariantLoci | Number of variant loci | | variantRate | Variants per loci rate | | variantRatePerBp | Number of variants per base | | nSNPs | Number of snp loci | | nInsertions | Number of insertion | | nDeletions | Number of deletions | | nComplex | Number of complex loci | | nNoCalls | Number of no calls loci | | nHets | Number of het loci | | nHomRef | Number of hom ref loci | | nHomVar | Number of hom var loci | | nSingletons | Number of singletons | | heterozygosity | heterozygosity per locus rate | | heterozygosityPerBp | heterozygosity per base pair | | hetHomRatio | heterozygosity to homozygosity ratio | | indelRate | indel rate (insertion count + deletion count) | | indelRatePerBp | indel rate per base pair | | deletionInsertionRatio | deletion to insertion ratio |

CompOverlap

CompOverlap is an evaluation module that computes the following metrics:

| Metric | Definition | |:-------|:-----------| | nEvalSNPs | number of eval SNP sites | | nCompSNPs | number of comp SNP sites | | novelSites | number of eval sites outside of comp sites | | nVariantsAtComp | number of eval sites at comp sites (that is, sharing the same locus as a variant in the comp track, regardless of whether the alternate allele is the same) | | compRate | percentage of eval sites at comp sites | | nConcordant | number of concordant sites (that is, for the sites that share the same locus as a variant in the comp track, those that have the same alternate allele) | | concordantRate | the concordance rate |

Understanding the output of CompOverlap

A SNP in the detection set is said to be 'concordant' if the position exactly matches an entry in dbSNP and the allele is the same. To understand this and other output of CompOverlap, we shall examine a detailed example. First, consider a fake dbSNP file (headers are suppressed so that one can see the important things):

$ grep -v '##' dbsnp.vcf #CHROM POS ID REF ALT QUAL FILTER INFO 1 10327 rs112750067 T C . . ASP;R5;VC=SNP;VP=050000020005000000000100;WGT=1;dbSNPBuildID=132

Now, a detection set file with a single sample, where the variant allele is the same as listed in dbSNP:

$ grep -v '##' eval_correct_allele.vcf #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 001-6 1 10327 . T C 5168.52 PASS ... GT:AD:DP:GQ:PL 0/1:357,238:373:99:3959,0,4059

Finally, a detection set file with a single sample, but the alternate allele differs from that in dbSNP:

$ grep -v '##' eval_incorrect_allele.vcf #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 001-6 1 10327 . T A 5168.52 PASS ... GT:AD:DP:GQ:PL 0/1:357,238:373:99:3959,0,4059

Running VariantEval with just the CompOverlap module:

$ java -jar $STING_DIR/dist/GenomeAnalysisTK.jar -T VariantEval \ -R /seq/references/Homo_sapiens_assembly19/v1/Homo_sapiens_assembly19.fasta \ -L 1:10327 \ -B:dbsnp,VCF dbsnp.vcf \ -B:eval_correct_allele,VCF eval_correct_allele.vcf \ -B:eval_incorrect_allele,VCF eval_incorrect_allele.vcf \ -noEV \ -EV CompOverlap \ -o eval.table

We find that the eval.table file contains the following:

$ grep -v '##' eval.table | column -t CompOverlap CompRod EvalRod JexlExpression Novelty nEvalVariants nCompVariants novelSites nVariantsAtComp compRate nConcordant concordantRate CompOverlap dbsnp eval_correct_allele none all 1 1 0 1 100.00000000 1 100.00000000 CompOverlap dbsnp eval_correct_allele none known 1 1 0 1 100.00000000 1 100.00000000 CompOverlap dbsnp eval_correct_allele none novel 0 0 0 0 0.00000000 0 0.00000000 CompOverlap dbsnp eval_incorrect_allele none all 1 1 0 1 100.00000000 0 0.00000000 CompOverlap dbsnp eval_incorrect_allele none known 1 1 0 1 100.00000000 0 0.00000000 CompOverlap dbsnp eval_incorrect_allele none novel 0 0 0 0 0.00000000 0 0.00000000

As you can see, the detection set variant was listed under nVariantsAtComp (meaning the variant was seen at a position listed in dbSNP), but only the evalcorrectallele dataset is shown to be concordant at that site, because the allele listed in this dataset and dbSNP match.

TiTvVariantEvaluator

TiTvVariantEvaluator is an evaluation module that computes the following metrics:

| Metric | Definition | |:-------|:-----------| | nTi | number of transition loci | | nTv | number of transversion loci | | tiTvRatio | the transition to transversion ratio | | nTiInComp | number of comp transition sites | | nTvInComp | number of comp transversion sites | | TiTvRatioStandard | the transition to transversion ratio for comp sites |

Tags:

official, tooltips, varianteval, analyst, intermediate

Updated on 2013-03-18

From Laurent on 2013-03-19

This is such a great and helpful page, thanks a lot! A small question regarding the FunctionalClass stratification: what annotation will it read ?

From Geraldine_VdAuwera on 2013-03-19

I’m glad you find it useful. FFunctionalClass reads annotations such as those imported from SnpEff — see the SnpEff annotation documentation for more details. There’s also a presentation on this topic here (see “Functional annotation” toward the end of the page): http://www.broadinstitute.org/gatk/guide/events?id=2038

From myoglu on 2013-10-07

Silly question maybe, but how did you make the nice plots and tables? I have the report as “.txt”, but that does not look at all so nice.

Thanks!

From Geraldine_VdAuwera on 2013-10-07

We have some custom Rscripts to plot the report data. We currently don’t make them available to the public though, sorry!

From SCR on 2014-07-24

Hi,

I am using VariantEval to compare variant calls between two vcfs, and I noticed that in the CountVariants table, the values for nCalledLoci and nNoCalls are the same within the rows displaying calls unique to each set. For example, for set 1, nCalledLoci=551 and nNoCalls=551. Logically this seems incorrect – any explanations as to why this is happening?

Thanks!

From Geraldine_VdAuwera on 2014-07-24

Hmm. Can you please post the full table?

From SCR on 2014-07-25

Hi @Geraldine_VdAuwera‌,

Thanks for getting back to me. The full table is quite unwieldy just as text, but I will post it below. Here is a link to a more readable version in dropbox: https://www.dropbox.com/s/y3r4rc5uqlka22q/GATKReport_nCalledLoci_nNoCalls_troubleshooting.xlsx

#:GATKTable:30:21:%s:%s:%s:%s:%s:%d:%d:%d:%d:%.8f:%.8f:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%.2e:%.2f:%.2f:%.2e:%.2f:%.2f:; #:GATKTable:CountVariants:Counts different classes of variants in the sample CountVariants CompRod EvalRod JexlExpression Novelty nProcessedLoci nCalledLoci nRefLoci nVariantLoci variantRate variantRatePerBp nSNPs nMNPs nInsertions nDeletions nComplex nSymbolic nMixed nNoCalls nHets nHomRef nHomVar nSingletons nHomDerived heterozygosity heterozygosityPerBp hetHomRatio indelRate indelRatePerBp insertionDeletionRatio CountVariants dbsnp eval FilteredInAll all 3137161264 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00E+00 0 0 0.00E+00 0 0 CountVariants dbsnp eval FilteredInAll known 3137161264 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00E+00 0 0 0.00E+00 0 0 CountVariants dbsnp eval FilteredInAll novel 3137161264 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00E+00 0 0 0.00E+00 0 0 CountVariants dbsnp eval InPDX_P0-FilteredInPDX_PT all 3137161264 301 0 301 0.0000001 10422462 219 0 12 57 13 0 0 0 530 0 72 0 0 1.69E-07 5919172 7.36 2.61E-08 38258064 0.21 CountVariants dbsnp eval InPDX_P0-FilteredInPDX_PT known 3137161264 250 0 250 0.00000008 12548645 186 0 6 47 11 0 0 0 432 0 68 0 0 1.38E-07 7261947 6.35 2.04E-08 49018144 0.13 CountVariants dbsnp eval InPDX_P0-FilteredInPDX_PT novel 3137161264 51 0 51 0.00000002 61512965 33 0 6 10 2 0 0 0 98 0 4 0 0 3.12E-08 32011849 24.5 5.74E-09 174286736 0.6 CountVariants dbsnp eval InPDX_PT-FilteredInPDX_P0 all 3137161264 494 0 494 0.00000016 6350528 400 0 44 37 13 0 0 0 846 0 142 0 0 2.70E-07 3708228 5.96 3.00E-08 33374056 1.19 CountVariants dbsnp eval InPDX_PT-FilteredInPDX_P0 known 3137161264 436 0 436 0.00000014 7195324 355 0 40 29 12 0 0 0 733 0 139 0 0 2.34E-07 4279892 5.27 2.58E-08 38730385 1.38 CountVariants dbsnp eval InPDX_PT-FilteredInPDX_P0 novel 3137161264 58 0 58 0.00000002 54088987 45 0 4 8 1 0 0 0 113 0 3 0 0 3.60E-08 27762489 37.67 4.14E-09 241320097 0.5 CountVariants dbsnp eval Intersection all 3137161264 43389 0 43389 0.00001383 72303 40019 0 1614 1655 101 0 0 0 46622 0 40156 0 0 1.49E-05 67289 1.16 1.07E-06 930908 0.98 CountVariants dbsnp eval Intersection known 3137161264 42396 0 42396 0.00001351 73996 39307 0 1492 1496 101 0 0 0 44896 0 39896 0 0 1.43E-05 69876 1.13 9.85E-07 1015591 1 CountVariants dbsnp eval Intersection novel 3137161264 993 0 993 0.00000032 3159276 712 0 122 159 0 0 0 0 1726 0 260 0 0 5.50E-07 1817590 6.64 8.96E-08 11164274 0.77 CountVariants dbsnp eval PDX_P0 all 3137161264 551 0 551 0.00000018 5693577 450 0 44 56 1 0 0 551 377 0 174 311 0 1.20E-07 8321382 2.17 3.22E-08 31061002 0.79 CountVariants dbsnp eval PDX_P0 known 3137161264 355 0 355 0.00000011 8837073 297 0 18 39 1 0 0 355 192 0 163 161 0 6.12E-08 16339381 1.18 1.85E-08 54088987 0.46 CountVariants dbsnp eval PDX_P0 novel 3137161264 196 0 196 0.00000006 16005924 153 0 26 17 0 0 0 196 185 0 11 150 0 5.90E-08 16957628 16.82 1.37E-08 72957238 1.53 CountVariants dbsnp eval PDX_PT all 3137161264 1523 0 1523 0.00000049 2059856 1262 0 131 125 5 0 0 1523 1224 0 299 1025 0 3.90E-07 2563040 4.09 8.32E-08 12019774 1.05 CountVariants dbsnp eval PDX_PT known 3137161264 1292 0 1292 0.00000041 2428143 1100 0 87 100 5 0 0 1292 1011 0 281 870 0 3.22E-07 3103027 3.6 6.12E-08 16339381 0.87 CountVariants dbsnp eval PDX_PT novel 3137161264 231 0 231 0.00000007 13580784 162 0 44 25 0 0 0 231 213 0 18 155 0 6.79E-08 14728456 11.83 2.20E-08 45466105 1.76 CountVariants dbsnp eval none all 3137161264 46258 0 46258 0.00001475 67818 42350 0 1845 1930 133 0 0 2074 49599 0 40843 1336 0 1.58E-05 63250 1.21 1.25E-06 802753 0.96 CountVariants dbsnp eval none known 3137161264 44729 0 44729 0.00001426 70137 41245 0 1643 1711 130 0 0 1647 47264 0 40547 1031 0 1.51E-05 66375 1.17 1.11E-06 900448 0.96 CountVariants dbsnp eval none novel 3137161264 1529 0 1529 0.00000049 2051773 1105 0 202 219 3 0 0 427 2335 0 296 305 0 7.44E-07 1343538 7.89 1.35E-07 7398965 0.92

From Geraldine_VdAuwera on 2014-07-28

Hi @SCR,

Thanks, this is fine — just wanted to check that the table looks sane, which it does if you have multiple samples in your callset. The first set of fields, such as nCalledLoci, are properties that are evaluated per variant site. Then the next set of fields, including nNoCalls, nHets etc. are evaluated per sample, since they are genotype properties. So you can have 551 variant calls (nCalledLoci), with 551 no-genotype-calls (noCalls) over one or more samples. Since it is a bit odd that you’d have exactly the same number I’m wondering if one of your samples has all no-calls at the sites you’re looking at. You can stratify this table by sample to find out.

From tinu on 2014-08-07

Hi Gerladine,

I used the following command

java -Xmx6G -jar /GenomeAnalysisTK.jar -R /hs37d5.fa -T VariantEval -eval INPUT.vcf -o INPUT.gatkreport —dbsnp dbsnp_137.b37.vcf GATKTable:CompOverlap:The overlap between eval and comp sites CompOverlap CompRod EvalRod JexlExpression Novelty nEvalVariants novelSites nVariantsAtComp compRate nConcordant concordantRate CompOverlap dbsnp eval none all 64970 1680 63290 97.41 63201 99.86 CompOverlap dbsnp eval none known 63290 0 63290 100 63201 99.86 CompOverlap dbsnp eval none novel 1680 1680 0 0 0 0

My VCF has 66932 variants with 63621 SNPs, 3047 INDELs and 264 multiallelic variants. My question is why is VariantEval reporting all just 64970 variants

Thanks,

Tinu

From Geraldine_VdAuwera on 2014-08-11

@‌tinu

How did you count the number of variants in your file?