Evaluating the quality of a variant callset

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by KateN

on 2015-10-21

Introduction

Running through the steps involved in variant discovery (calling variants, joint genotyping and applying filters) produces a variant callset in the form of a VCF file. So what’s next? Technically, that callset is ready to be used in downstream analysis. But before you do that, we recommend running some quality control analyses to evaluate how “good” that callset is.

To be frank, distinguishing between a “good” callset and a “bad” callset is a complex problem. If you knew the absolute truth of what variants are present or not in your samples, you probably wouldn’t be here running variant discovery on some high-throughput sequencing data. Your fresh new callset is your attempt to discover that truth. So how do you know how close you got?

Methods for variant evaluation

There are several methods that you can apply which offer different insights into the probable biological truth, all with their own pros and cons. Possibly the most trusted method is Sanger sequencing of regions surrounding putative variants. However, it is also the least scalable as it would be prohibitively costly and time-consuming to apply to an entire callset. Typically, Sanger sequencing is only applied to validate candidate variants that are judged highly likely. Another popular method is to evaluate concordance against results obtained from a genotyping chip run on the same samples. This is much more scalable, and conveniently also doubles as a quality control method to detect sample swaps. Although it only covers the subset of known variants that the chip was designed for, this method can give you a pretty good indication of both sensitivity (ability to detect true variants) and specificity (not calling variants where there are none). This is something we do systematically for all samples in the Broad’s production pipelines.

The third method, presented here, is to evaluate how your variant callset stacks up against another variant callset (typically derived from other samples) that is considered to be a truth set (sometimes referred to as a gold standard -- these terms are very close and often used interchangeably). The general idea is that key properties of your callset (metrics discussed later in the text) should roughly match those of the truth set. This method is not meant to render any judgments about the veracity of individual variant calls; instead, it aims to estimate the overall quality of your callset and detect any red flags that might be indicative of error.

Underlying assumptions and truthiness^*: a note of caution

It should be immediately obvious that there are two important assumptions being made here: 1) that the content of the truth set has been validated somehow and is considered especially trustworthy; and 2) that your samples are expected to have similar genomic content as the population of samples that was used to produce the truth set. These assumptions are not always well-supported, depending on the truth set, your callset, and what they have (or don’t have) in common. You should always keep this in mind when choosing a truth set for your evaluation; it’s a jungle out there. Consider that if anyone can submit variants to a truth set’s database without a well-regulated validation process, and there is no process for removing variants if someone later finds they were wrong (I’m looking at you, dbSNP), you should be extra cautious in interpreting results. ^{*With apologies to Stephen Colbert.}

Validation

So what constitutes validation? Well, the best validation is done with orthogonal methods, meaning that it is done with technology (wetware, hardware, software, etc.) that is not subject to the same error modes as the sequencing process. Calling variants with two callers that use similar algorithms? Great way to reinforce your biases. It won’t mean anything that both give the same results; they could both be making the same mistakes. On the wetlab side, Sanger and genotyping chips are great validation tools; the technology is pretty different, so they tend to make different mistakes. Therefore it means more if they agree or disagree with calls made from high-throughput sequencing.

Matching populations

Regarding the population genomics aspect: it’s complicated -- especially if we’re talking about humans (I am). There’s a lot of interesting literature on this topic; for now let’s just summarize by saying that some important variant calling metrics vary depending on ethnicity. So if you are studying a population with a very specific ethnic composition, you should try to find a truth set composed of individuals with a similar ethnic background, and adjust your expectations accordingly for some metrics.

Similar principles apply to non-human genomic data, with important variations depending on whether you’re looking at wild or domesticated populations, natural or experimentally manipulated lineages, and so on. Unfortunately we can’t currently provide any detailed guidance on this topic, but hopefully this explanation of the logic and considerations involved will help you formulate a variant evaluation strategy that is appropriate for your organism of interest.

Variant evaluation metrics

So let’s say you’ve got your fresh new callset and you’ve found an appropriate truth set. You’re ready to look at some metrics (but don’t worry yet about how; we’ll get to that soon enough). There are several metrics that we recommend examining in order to evaluate your data. The set described here should be considered a minimum and is by no means exclusive. It is nearly always better to evaluate more metrics if you possess the appropriate data to do so -- and as long as you understand why those additional metrics are meaningful. Please don’t try to use metrics that you don’t understand properly, because misunderstandings lead to confusion; confusion leads to worry; and worry leads to too many desperate posts on the GATK forum.

Variant-level concordance and genotype concordance

The relationship between variant-level concordance and genotype concordance is illustrated in this figure.

Variant-level concordance (aka % Concordance) gives the percentage of variants in your samples that match (are concordant with) variants in your truth set. It essentially serves as a check of how well your analysis pipeline identified variants contained in the truth set. Depending on what you are evaluating and comparing, the interpretation of percent concordance can vary quite significantly. Comparing your sample(s) against genotyping chip results matched per sample allows you to evaluate whether you missed any real variants within the scope of what is represented on the chip. Based on that concordance result, you can extrapolate what proportion you may have missed out of the real variants not represented on the chip. If you don't have a sample-matched truth set and you're comparing your sample against a truth set derived from a population, your interpretation of percent concordance will be more limited. You have to account for the fact that some variants that are real in your sample will not be present in the population and that conversely, many variants that are in the population will not be present in your sample. In both cases, "how many" depends on how big the population is and how representative it is of your sample's background. Keep in mind that for most tools that calculate this metric, all unmatched variants (present in your sample but not in the truth set) are considered to be false positives. Depending on your trust in the truth set and whether or not you expect to see true, novel variants, these unmatched variants could warrant further investigation -- or they could be artifacts that you should ignore.
Genotype concordance is a similar metric but operates at the genotype level. It allows you to evaluate, within a set of variant calls that are present in both your sample callset and your truth set, what proportion of the genotype calls have been assigned correctly. This assumes that you are comparing your sample to a matched truth set derived from the same original sample.

Number of Indels & SNPs and TiTv Ratio

These metrics are widely applicable. The table below summarizes their expected value ranges for Human Germline Data:

| Sequencing Type | # of Variants* | TiTv Ratio | | ----- | ----- | ----- | | WGS | ~4.4M | 2.0-2.1 | | WES | ~41k | 3.0-3.3 |

^{*for a single sample}

Number of Indels & SNPs The number of variants detected in your sample(s) are counted separately as indels (insertions and deletions) and SNPs (Single Nucleotide Polymorphisms). Many factors can affect this statistic including whole exome (WES) versus whole genome (WGS) data, cohort size, strictness of filtering through the GATK pipeline, the ethnicity of your sample(s), and even algorithm improvement due to a software update. For reference, Nature's recently published 2015 paper in which various ethnicities in a moderately large cohort were analyzed for number of variants. As such, this metric alone is insufficient to confirm data validity, but it can raise warning flags when something went extremely wrong: e.g. 1000 variants in a large cohort WGS data set, or 4 billion variants in a ten-sample whole-exome set.
TiTv Ratio This metric is the ratio of transition (Ti) to transversion (Tv) SNPs. If the distribution of transition and transversion mutations were random (i.e. without any biological influence) we would expect a ratio of 0.5. This is simply due to the fact that there are twice as many possible transversion mutations than there are transitions. However, in the biological context, it is very common to see a methylated cytosine undergo deamination to become thymine. As this is a transition mutation, it has been shown to increase the expected random ratio from 0.5 to ~2.0¹. Furthermore, CpG islands, usually found in primer regions, have higher concentrations of methylcytosines. By including these regions, whole exome sequencing shows an even stronger lean towards transition mutations, with an expected ratio of 3.0-3.3. A significant deviation from the expected values could indicate artifactual variants causing bias. If your TiTv Ratio is too low, your callset likely has more false positives.
It should also be noted that the TiTv ratio from exome-sequenced data will vary from the expected value based upon the length of flanking sequences. When we analyze exome sequence data, we add some padding (usually 100 bases) around the targeted regions (using the -ip engine argument) because this improves calling of variants that are at the edges of exons (whether inside the exon sequence or in the promoter/regulatory sequence before the exon). These flanking sequences are not subject to the same evolutionary pressures as the exons themselves, so the number of transition and transversion mutants lean away from the expected ratio. The amount of "lean" depends on how long the flanking sequence is.

Ratio of Insertions to Deletions (Indel Ratio)

This metric is generally evaluated after filtering for purposes that are specific to your study, and the expected value range depends on whether you're looking for rare or common variants, as summarized in the table below.

| Filtering for | Indel Ratio | | --- | --- | | common | ~1 | | rare | 0.2-0.5 |

A significant deviation from the expected ratios listed in the table above could indicate a bias resulting from artifactual variants.

Tools for performing variant evaluation

VariantEval

This is the GATK’s main tool for variant evaluation. It is designed to collect and calculate a variety of callset metrics that are organized in evaluation modules, which are listed in the tool doc. For each evaluation module that is enabled, the tool will produce a table containing the corresponding callset metrics based on the specified inputs (your callset of interest and one or more truth sets). By default, VariantEval will run with a specific subset of the available modules (listed below), but all evaluation modules can be enabled or disabled from the command line. We recommend setting the tool to produce only the metrics that you are interested in, because each active module adds to the computational requirements and overall runtime of the tool.

It should be noted that all module calculations only include variants that passed filtering (i.e. FILTER column in your vcf file should read PASS); variants tagged as filtered out will be ignored. It is not possible to modify this behavior. See the example analysis for more details on how to use this tool and interpret its output.

GenotypeConcordance

This tool calculates -- you’ve guessed it -- the genotype concordance between callsets. In earlier versions of GATK, GenotypeConcordance was itself a module within VariantEval. It was converted into a standalone tool to enable more complex genotype concordance calculations.

Picard tools

The Picard toolkit includes two tools that perform similar functions to VariantEval and GenotypeConcordance, respectively called CollectVariantCallingMetrics and GenotypeConcordance. Both are relatively lightweight in comparison to their GATK equivalents; their functionalities are more limited, but they do run quite a bit faster. See the example analysis of CollectVariantCallingMetrics for details on its use and data interpretation. Note that in the coming months, the Picard tools are going to be integrated into the next major version of GATK, so at that occasion we plan to consolidate these two pairs of homologous tools to eliminate redundancy.

Which tool should I use?

We recommend Picard's version of each tool for most cases. The GenotypeConcordance tools provide mostly the same information, but Picard's version is preferred by Broadies. Both VariantEval and CollectVariantCallingMetrics produce similar metrics, however the latter runs faster and is scales better for larger cohorts. By default, CollectVariantCallingMetrics stratifies by sample, allowing you to see the value of relevant statistics as they pertain to specific samples in your cohort. It includes all metrics discussed here, as well as a few more. On the other hand, VariantEval provides many more metrics beyond the minimum described here for analysis. It should be noted that none of these tools use phasing to determine metrics.

So when should I use CollectVariantCallingMetrics?

If you have a very large callset
If you want to look at the metrics discussed here and not much else
If you want your analysis back quickly

When should I use VariantEval?

When you require a more detailed analysis of your callset
If you need to stratify your callset by another factor (allele frequency, indel size, etc.)
If you need to compare to multiple truth sets at the same time

Updated on 2016-05-29

From mglclinical on 2016-06-03

Hi @Geraldine_VdAuwera

My TiTv Ratio is 2.67 for exome run on NA12878 cell line. GATK Best Practices recommendations and mostly default parameters are followed in my pipeline(BWA + HaplotypeCaller)

Should I be worried that I have too many false positives ?

Thanks,

mglclinical

From Geraldine_VdAuwera on 2016-06-03

That is lower than what we’d expect. Have you compared your results to one of the benchmarking datasets?

From mglclinical on 2016-06-03

Yes, I just finished comparing to NIST_2.19 with variantEval and here are the results

my sensitivity is at 99%

#:GATKTable:24:3:%s:%s:%s:%s:%s:%d:%d:%d:%d:%d:%.2f:%.2f:%.2f:%.2f:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:; #:GATKTable:ValidationReport:Assess site accuracy and sensitivity of callset against follow-up validation assay

ValidationReport CompRod EvalRod JexlExpression Novelty nComp TP FP FN TN sensitivity specificity PPV FDR CompMonoEvalNoCall CompMonoEvalFiltered CompMonoEvalMono CompMonoEvalPoly CompPolyEvalNoCall CompPolyEvalFiltered CompPolyEvalMono CompPolyEvalPoly CompFiltered nDifferentAlleleSites

ValidationReport comp eval none all 24165 23936 0 229 0 99.05 100.00 100.00 0.00 0 0 0 0 229 0 0 23936 0 0

ValidationReport comp eval none known 0 0 0 0 0 NaN 100.00 NaN NaN 0 0 0 0 0 0 0 0 0 0

ValidationReport comp eval none novel 24165 23936 0 229 0 99.05 100.00 100.00 0.00 0 0 0 0 229 0 0 23936 0 0

From mglclinical on 2016-06-03

and, here is the command that I ran for variantEval

[sgajja@genlabcs bwa_hc]$ java -jar ~/algorithms/gatk/20160106/GenomeAnalysisTK_3.5.jar -T VariantEval -R ~/refData/gatkBundle28/b37/human_g1k_v37.fasta -eval variants_Final_2.vcf.gz -comp NISTv2.19_TruthPositive.vcf -o variantEvalOutput.txt -EV ValidationReport

INFO 16:16:43,947 HelpFormatter – ————————————————————————————————————————

INFO 16:16:43,949 HelpFormatter – The Genome Analysis Toolkit (GATK) v3.5-0-g36282e4, Compiled 2015/11/25 04:03:56

INFO 16:16:43,949 HelpFormatter – For support and documentation go to http://www.broadinstitute.org/gatk

INFO 16:16:43,952 HelpFormatter – Program Args: -T VariantEval -R /home/sgajja/refData/gatkBundle28/b37/human_g1k_v37.fasta -eval variants_Final_2.vcf.gz -comp NISTv2.19_TruthPositive.vcf -o variantEvalOutput.txt -EV ValidationReport

INFO 16:16:43,956 HelpFormatter – Executing as sgajja@genlabcs on Linux 3.10.0-229.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.7.0_99-mockbuild_2016_03_23_23_52-b00.

INFO 16:16:43,956 HelpFormatter – Date/Time: 2016/06/03 16:16:43

INFO 16:16:43,956 HelpFormatter – ————————————————————————————————————————

INFO 16:16:44,015 GenomeAnalysisEngine – Strictness is SILENT

INFO 16:16:44,112 GenomeAnalysisEngine – Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000

WARN 16:16:44,260 IndexDictionaryUtils – Track eval doesn’t have a sequence dictionary built in, skipping dictionary validation

INFO 16:16:44,313 GenomeAnalysisEngine – Preparing for traversal

INFO 16:16:44,331 GenomeAnalysisEngine – Done preparing for traversal

INFO 16:16:44,331 ProgressMeter – [INITIALIZATION COMPLETE; STARTING PROCESSING]

INFO 16:16:44,347 VariantEval – Creating 3 combinatorial stratification states

INFO 16:16:53,959 VariantEval – Finalizing variant report

INFO 16:16:54,068 ProgressMeter – done 49692.0 9.0 s 3.3 m 99.8% 9.0 s 0.0 s

INFO 16:16:54,069 ProgressMeter – Total runtime 9.74 secs, 0.16 min, 0.00 hours

[sgajja@genlabcs bwa_hc]$

From mglclinical on 2016-06-16

@Geraldine,

We have used Nextera Rapid Capture Exome kit which covers 45 million coordinates, and I guess you use custom-designed exome capture targets (or) Agilent Capture kit ?

Does the difference in capture kit affects the Ti/Tv ratio ?

In this video at 9:10

(https://www.youtube.com/watch?v=XfJpSx6CEAQ)

you mention that Ti/Tv ratio for exomes should be between 2.5 and 3.0, but in this post it is in between 3.0 and 3.3 ?

Thanks,

mglclinical

From Geraldine_VdAuwera on 2016-06-17

Hi @mglclinical,

Yes the titv can vary depending on the capture kit. Generally speaking though it would be closer to 3 — we’ve seen callsets up to 3.3 I think. I may have misstated the expected values in that video, it was a while ago.

From mglclinical on 2016-06-20

thank you @Geraldine_VdAuwera .

This post (http://gatkforums.broadinstitute.org/gatk/discussion/2443/ti-tv-variant-evaluator-results-from-varianteval#latest) has helped me in figuring out on how to get closer to the 3.0-3.3 Ti/Tv ratio

My original vcf file had 35,000 variants. My target bed file is around 45 million bases.

I have restricted my vcf file to ucsc refseq exon regions using gatk SelectVariants, then the newly produced vcf file had only ~20,000 variants in it.

Later I ran my ~20,000 Refseq fileted vcf file through the Picard CollectVariantCallingMetrics, and now my Ti/Tv ratio is 2.991566

Please let me know if I could take any additional measures to get the Ti/Tv ratio into the widely accepted/observed 3.0-3.3 range ?

Thanks,

mglclinical

From Geraldine_VdAuwera on 2016-06-20

@mglclinical The result you’re getting after restricting target intervals sounds very close to the expected target. Depending on what are your project goals (eg if sensitivity is more important than specificity) that may be good enough as is. If you require more specificity then you could try to filter further, but that’s up to you.

From mglclinical on 2016-06-21

ok, thank you for the suggestion @Geraldine_VdAuwera

From buddej on 2016-07-12

The --eval option does not seem to allow carets (^) in the set name for a ROD to be evaluated. Is this intentional? We use carets in our sample names (as a delimiter between different types of IDs that all apply to a sample, since different sites in our study already use - _ and . as within-sample-id delimiters. Apparently they couldn't agree on just 1 to use...). If this is in the docs somewhere, I must have missed it.

I can leave the set name off, but then the report produced by VariantEval just lists "eval" on every line, instead of the EvalRod / sample name. I can workaround this easily, but it would be convenient if ^ was supported as a valid character. The error occurs in all recent GATKs (3.4-46, 3.5, 3.6). The message is below:

##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A USER ERROR has occurred (version 3.6-0-g89b7209): ##### ERROR ##### ERROR This means that one or more arguments or inputs in your command are incorrect. ##### ERROR The error message below tells you what is the problem. ##### ERROR ##### ERROR If the problem is an invalid argument, please check the online documentation guide ##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool. ##### ERROR ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions https://www.broadinstitute.org/gatk ##### ERROR ##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself. ##### ERROR ##### ERROR MESSAGE: Invalid argument value 'unk^01AD1234^Project1_paddedexome.raw.snps.indels.g.vcf' at position 6. ##### ERROR ------------------------------------------------------------------------------------------

From Geraldine_VdAuwera on 2016-07-12

Hmm, carets as delimiters, that’s pretty exotic… never seen that before. No judgment, mind you, just commenting :)

I can’t think of a good reason for not supporting carets as regular characters, so I’m willing to put this in as an enhancement request — but fair warning, this is going to be super low priority and might not get done for a while.

Can you post your full command line for the bug report?

From buddej on 2016-07-13

It is certainly unorthodox, but when your collaborators embed all the common delimiters as part of their within-site IDs, you have to resort to other measures. I could just create an entirely new ID, but that has its own issues. Caret is at least a valid filename character on most modern OSs

This fails:

/usr/java/jre1.8.0_91/bin/java \ -Xms2g -Xmx8g -XX:ParallelGCThreads=1 -Djava.io.tmpdir=/data/tmp \ -jar /usr/local/genome/GATK-3.6-0/GenomeAnalysisTK.jar \ -T VariantEval \ -R /data/GATK_pipeline_files/ucsc.hg19.fasta \ -nt 1 \ --eval:unk^01AD1234^Project1 \ unk^01AD1234^Project1_paddedexome.raw.snps.indels.g.vcf \ -o unk^01AD1234^Project1_paddedexome_varianteval.gatkreport

Change the ^ to a – in the —eval line (carets in the filename are fine, even though there is an error message related to the .g.vcf when running the command above), and this works:

/usr/java/jre1.8.0_91/bin/java \ -Xms2g -Xmx8g -XX:ParallelGCThreads=1 -Djava.io.tmpdir=/data/tmp \ -jar /usr/local/genome/GATK-3.6-0/GenomeAnalysisTK.jar \ -T VariantEval \ -R /data/GATK_pipeline_files/ucsc.hg19.fasta \ -nt 1 \ --eval:unk-01AD1234-Project1 \ unk^01AD1234^Project1_paddedexome.raw.snps.indels.g.vcf \ -o unk^01AD1234^Project1_paddedexome_varianteval.gatkreport

From Geraldine_VdAuwera on 2016-07-13

OK, thanks — I can confirm this reproduces when tagging inputs to any tool that supports tags, btw.

From robertb on 2017-04-15

Hi,

I’ve got some samples which seem to have way too many variants for some class of variants compared to my other samples as well as expectation based on the literature. What i mean by a variant class would be, for example, high confidant, private (not in any population database), loss of function variants in pLI>0.9 genes as defined by ExAC resource. Most of my samples have 0 or 1 or 2 but some have 5-13 and a couple are around 33. And that’s just SNVs. I’ve got a lot of insertion/deletions as well though some of these at least are common. But some of my samples have far more variants than others. I used Picard CollectVariantQualityMterics and checked for outliers but there were none. So the samples do not seem to have a total excess of SNPs but for some classes they definitely are outliers. I mean ExAC individuals have on average less than 1 private high confidant loss of function variant in a pLI>0.9 gene. Case groups, like for ASD, have been shown to have more of these, but is it biologically implausible even for an ASD cohort that some individuals would have so many of these (like around 35)? It doesn’t seem specific to loss of function either. Some of the same samples are outliers for synonymous variants as well and missense.

Can someone suggest what might be the cause of this and how to test for it? As i said, none of the metrics I’ve seen so far indicates a problem. I’ve included the metrics for one of the outlier samples below (1444-21622) as well as a normal control.

SAMPLE_ALIAS GroupID HET_HOMVAR_RATIO TOTAL_SNPS NUM_IN_DB_SNP NOVEL_SNPS FILTERED_SNPS PCT_DBSNP DBSNP_TITV NOVEL_TITV TOTAL_INDELS NOVEL_INDELS FILTERED_INDELS PCT_DBSNP_INDELS NUM_IN_DB_SNP_INDELS DBSNP_INS_DEL_RATIO NOVEL_INS_DEL_RATIO TOTAL_MULTIALLELIC_SNPS NUM_IN_DB_SNP_MULTIALLELIC TOTAL_COMPLEX_INDELS NUM_IN_DB_SNP_COMPLEX_INDELS SNP_REFERENCE_BIAS NUM_SINGLETONS

1444-21622 2 1.665311 3791853 3624765 167088 205611 0.955935 2.08925 1.674307 380264 19054 81506 0.949893 361210 0.875295 0.62813 113658 91337 251022 236187 0.534575 69597

03C16241 0 1.770779 3799716 3650742 148974 224859 0.960793 2.08458 1.378409 393907 22896 96576 0.941875 371011 0.863487 0.573068 114163 91413 308262 286610 0.527945 69330

I’ve noticed a significant difference between samples in terms of the number of singletons but it seems to be driven largely by ethnic differences ( i.e., subjects from non-European pops tend to have a much larger number of singletons) but the sample 1444-21622 is of European ancestry as is the control above.

- thnaks

From Sheila on 2017-04-18

@robertb

Hi,

Can you tell us how you produced the final VCF? Did you follow the Best Practices? Can you post IGV screenshots of some of the questionable sites?

Thanks,

Sheila

From robertb on 2017-04-18

Yes I followed best practices. I inherited the BAMs but they were created using a BWA, Picard, GATK work flow. As a check, I ran Picard CollectAlignmentSummaryMetrics and everything looked o.k. The couple of samples that make me doubt whether all the calls are veridical look o.k across all the metrics they are not outliers. Similarily for the VCF files I ran Picard VariantQualityMetrics and they all looked o.k. Any other tests you can think of?

Someone suggested looking at the ratio of reads for the two strands. The ratio should be close to 1, right? If the ratio is significantly biased towards one of the strands it’s likely to be a false positive. Is that correct?

I filtered the genotypes at sites that passed filtering using GQ >=20 and DP >= 10 which is normal. And I only looked at canonical transcripts. Most also had CCDS support. If I take the individual with 33 private, high confidant LOF in PLI > 0.9 genes the GQs and DPs are also very high for each of the variants.

As a reality check I’m also looking at missense and synonymous variants to see if the same samples have a similar excess for these.

Still I’m not convinced these problems would explain what I’m seeing because they would presumably affect all my samples rather than just one or two. Also the genes implicated by these variants do not seem random but are significantly enriched for GO biological processes like RNA splicing and mRNA processing. For example, the individual with 32 of these variants has the following genes supposedly disrupted:

PTPRU, ABCD3, ATP1A1, DCAF6, HNRNPU, MDH1, XRCC5, CLCN3, HMGCS1, HNRNPH1, HDAC2

HNRNPA2B1, OGDH, SND1, SVIL, HNRNPH3, EIF3M, HYOU1, DCTN2, TM9SF2, CNOT1, BCAR1, NPEPPS, CAMSAP3, HNRNPUL1, RPS19, MAPRE1

Note the six Heterogeneous Nuclear Ribonucleoprotein genes. Not what you’d expect from some kind of technical artifact. And the individual has multiple LOF for some of these genes which is why they’re are 27 genes and 33 variants.

Not sure of IGV I have never used it before but I understand that you can use the samtools pileup to get the reads in the region of a variant and look at their distribution using the UCSC genome browser or IGV. I’ll give that a shot and get back to you. thanks for the help – Robert

From Sheila on 2017-04-21

@robertb

Hi Robert,

Yes, the ratio of reads on each strand supporting the alleles should be close to one. We have a few annotations for strand bias. Have a look at [FisherStrand](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_annotator_FisherStrand.php) and [StrandOddsRatio](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_annotator_StrandOddsRatio.php). Also, you can look at all of the [annotations](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/) that GATK outputs to see if they are reasonable. Have a look at [this document](https://software.broadinstitute.org/gatk/documentation/article?id=6925) for ways to plot the annotations and check if the samples have similar distributions.

Have you looked at known vs novel variants?

I asked a developer for some tips on troubleshooting too many variants in a VCF (I plan to make a document with more detail), but for now, you can check for:

1) different populations (it looks like you found some differences there which is expected)

2) contamination

3) bad sample extraction that leads to chimerism

4) bad lab processing that lead to too many artifacts.

To diagnose, you can:

1) collect QC data (coverage, contamination, chimerism, sequencing artifact metrics)

2) look at the data (metrics, sequence data and variants)

3) look at the code/pipeline

4) look at the data again

Good luck!

-Sheila

From robertb on 2017-04-28

Sorry I should have been clear I have a sufficient number of samples that I ran VQSR and didn’t rely on hard filtering. Specifically I had 140 samples WGS 30X coverage. I did the calling for all samples simultaneously chromosome by chromosome. So if your relying on VQSR the advice is do not hard filter, right? Unless there is some metric that I can use in the genotype field to filter individual calls for strand bias (and there isn’t, at least not in my VCFs) then why should I bother hard filtering variants? Aren’t we supposed to have faith in VQSR?

Anyways, I ran VariantEval walker and think I found something but am still trying to figure out how interpret everything. The tool documentation doesn’t really explain what the metrics mean or how they are calculated. Does anyone know where to find that information? Thanks. The output is not like Picard where it gives to a set of metrics for each sample in the VCF. As a result I had to specify each sample separately (140 jobs).

From robertb on 2017-04-29

nevermind I’ve figured it out. My only question is how to proceed. I’ve looked at ti/tv ratio, number snps, number of insertions, number of deletions, het/hom ratio, and insertion/deletion ratio, across all my European samples. I’ve only considered the ALL row ( as in both found and not found in dbSNP ) and have found nothing. My insertion/deletion ratios are all at 0.86 to 0.89 which is consistent with what was stated in this post ( common should be around 1 and rare 0.2-0.5 ) and my ti/tv ratios are all at 1.97 and 1.98. The full breakdown is below.

Question: should you consider removing samples which are outliers for novel metrics as well?

Also, there’s a lot of metrics provided by GATK so what ones ( in addition to the ones used here ) would be worth using?

I’m trying to figure out why some samples have too many variants (novel indels or novel missense ). For example, sample 2063-24398 has 209 novel LoF insertions/deletions in pLI >=0.9 genes. By novel I mean supposedly absent from ExAC and gnomAD though I think some may turn out to in there. But most of my samples have between 0-6 of these and there’s a few with 52, 60, 92, or 94. So how can that be? I only have about 375 of these variants which means that this one sample is represented in 55% of my variants for this class. I have a similar problem for one of my samples regarding missense variants where a subject has a much larger number of novel missense variants (about 4000 of them ). I guess the question is whether this is sufficient grounds to remove the samples? The problem is that the data look o.k when using the assessment metrics so how do I justify it? Thanks – Robert

SAMPLE ti/tv snps insertions deletions het/hom ratio insertion/deletion ratio

1034-19683 1.97 3866337 237677 269299 1.62 0.88

1291-21555 1.97 3879954 233638 264904 1.65 0.88

1399-21439 1.97 3853485 234810 263968 1.62 0.89

1444-21622 1.98 3924152 235619 268612 1.67 0.88

1454-21674 1.98 3870827 232692 265712 1.65 0.88

1457-21708 1.98 3924441 238021 270461 1.67 0.88

1510-21940 1.98 3862814 230972 264075 1.61 0.87

15164-25527 1.98 3912175 220198 251503 1.75 0.88

1548-22146 1.97 3873642 233616 267609 1.62 0.87

15492-26109 1.98 3830651 230787 263270 1.62 0.88

1550-22157 1.97 3867588 234560 268148 1.67 0.87

15542-26229 1.97 3931245 237144 272966 1.74 0.87

15554-26282 1.97 3889547 236044 269568 1.72 0.88

1569-22238 1.98 3844416 232295 263165 1.63 0.88

1588-22326 1.97 3881270 236482 270469 1.7 0.87

16260-27149 1.97 3866626 237518 271275 1.66 0.88

1711-22976 1.97 3860816 238232 269497 1.66 0.88

1769-23158 1.97 3876414 241267 274601 1.64 0.88

1792-23269 1.98 3891430 234205 266953 1.71 0.88

18436-29827 1.97 3837918 236255 270751 1.6 0.87

1855-23570 1.97 3866459 233773 268021 1.64 0.87

187-13717 1.97 3877352 233194 269999 1.71 0.86

1883-23662 1.97 3908870 240800 275763 1.73 0.87

1885-23669 1.97 3893353 238257 269318 1.71 0.88

1886-23674 1.97 3843213 225784 257940 1.61 0.88

1923-23847 1.98 3848056 221547 255096 1.65 0.87

1924-23853 1.98 3899750 236984 268720 1.72 0.88

1925-23858 1.98 3823200 230999 260713 1.6 0.89

19275-30820 1.98 3847933 225349 254503 1.61 0.89

19283-30834 1.97 3879395 234834 270007 1.65 0.87

19303-30889 1.97 3871397 232175 264929 1.65 0.88

1937-23920 1.98 3862835 231266 265104 1.65 0.87

1946-23968 1.97 3863393 232647 267933 1.64 0.87

1972-24038 1.98 3870966 231492 264548 1.63 0.88

1974-24044 1.97 3879875 236558 271901 1.67 0.87

1987-24091 1.97 3881906 233779 268816 1.7 0.87

19991-31601 1.97 3884824 232934 266734 1.66 0.87

19993-31610 1.97 3913346 238049 273378 1.72 0.87

20014-31660 1.98 3837525 225604 256799 1.6 0.88

2007-24164 1.97 3866459 232439 267176 1.63 0.87

2031-24261 1.97 3927803 237692 274861 1.74 0.86

2047-24322 1.98 3897118 232939 266210 1.66 0.88

20572-32334 1.97 3890066 230337 264617 1.73 0.87

20574-32342 1.98 3883932 231799 266375 1.68 0.87

20618-32410 1.98 3878279 231477 266866 1.71 0.87

2063-24398 1.98 3882688 223320 256350 1.7 0.87

20705-32533 1.98 3879455 231354 267413 1.69 0.87

2076-24438 1.97 3858141 222950 259140 1.62 0.86

20984-32938 1.97 3836127 228268 262280 1.6 0.87

21165-33301 1.97 3851553 231182 264445 1.63 0.87

21183-33343 1.98 3898958 233181 269354 1.69 0.87

21189-33353 1.97 3860542 232012 271519 1.67 0.85

21236-33440 1.98 3892753 231922 268637 1.72 0.86

21267-33506 1.98 3854440 230078 264977 1.61 0.87

21304-33563 1.97 3853400 229305 264321 1.64 0.87

21505-33825 1.97 3848607 233506 266630 1.65 0.88

21507-33839 1.98 3860031 223537 258956 1.72 0.86

21597-33936 1.97 3845290 228715 263992 1.65 0.87

21684-34154 1.97 3863653 231934 264891 1.66 0.88

21699-34238 1.98 3881060 231345 263083 1.69 0.88

21705-34281 1.98 3830467 219710 249421 1.6 0.88

21730-34469 1.97 3862819 233464 268189 1.63 0.87

21739-34571 1.98 3882583 234250 270258 1.68 0.87

21740-34579 1.97 3852762 235590 269893 1.63 0.87

21742-34594 1.97 3870426 230942 263390 1.63 0.88

21758-34760 1.97 3879541 231701 265410 1.69 0.87

21762-34808 1.97 3856786 232563 266528 1.65 0.87

21773-34925 1.98 3854533 231052 264436 1.63 0.87

21801-35317 1.98 3849638 230568 264540 1.63 0.87

21852-36011 1.97 3843176 233867 269094 1.64 0.87

21928-37301 1.98 3864011 230225 263159 1.66 0.87

21929-37306 1.98 3856194 228506 261769 1.64 0.87

21969-37814 1.98 3857447 228178 260905 1.61 0.87

21970-37819 1.97 3890212 235764 271405 1.73 0.87

22124-39156 1.97 3850019 231251 266194 1.63 0.87

466-14420 1.97 3841169 230605 265233 1.64 0.87

48-14470 1.97 3860078 233790 270255 1.63 0.87

532-14596 1.97 3848610 233887 270724 1.6 0.86

533-15609 1.98 3894765 237369 272333 1.67 0.87

613-14848 1.97 3865059 234991 272804 1.63 0.86

635-18522 1.97 3897875 236661 272032 1.73 0.87

768-18527 1.97 3847028 234264 268389 1.61 0.87

779-18300 1.97 3853334 234262 269104 1.62 0.87

871-19574 1.97 3833553 237336 272978 1.61 0.87

From Sheila on 2017-05-04

@robertb

Hi Robert,

I have not forgotten about your question. I need to check on a few things, and I will get back to you.

-Sheila

From Sheila on 2017-05-08

@robertb

Hi Robert,

Have a look [here](http://gatkforums.broadinstitute.org/gatk/discussion/6308/evaluating-the-quality-of-a-variant-callset) for some tips. Also, the presentation on evaluating a callset [here](https://drive.google.com/drive/folders/0B7akc6CTmxIHbk1vUFlVSVRmN0k).

-Sheila

From kimianajafi on 2018-05-07

I was wondering what does PCT_CHIMERAS mean and also what does high PCT_CHIMERA suggests.

CATEGORY TOTAL_READS PF_READS PCT_PF_READS PF_NOISE_READS PF_READS_ALIGNED PCT_PF_READS_ALIGNED PF_ALIGNED_BASES PF_HQ_ALIGNED_READS PF_HQ_ALIGNED_BASES PF_HQ_ALIGNED_Q20_BASES PF_HQ_MEDIAN_MISMATCHES PF_MISMATCH_RATE PF_HQ_ERROR_RATE PF_INDEL_RATE MEAN_READ_LENGTH READS_ALIGNED_IN_PAIRS PCT_READS_ALIGNED_IN_PAIRS BAD_CYCLES STRAND_BALANCE PCT_CHIMERAS PCT_ADAPTER SAMPLE LIBRARY READ_GROUP

FIRST_OF_PAIR 58691137 58691137 1 0 58606594 0.99856 5665425977 53661171 5212591873 5093099935 0 0.004403 0.00381 0.000253 100.907548 58369435 0.995953 0 0.511244 0.338296 0

SECOND_OF_PAIR 58691137 58691137 1 0 58422666 0.995426 5419169605 52708088 4925051414 4809325127 0 0.005926 0.005137 0.000309 100.907548 58369435 0.999089 0 0.487461 0.370182 0

PAIR 117382274 117382274 1 0 117029260 0.996993 11084595582 106369259 10137643287 9902425062 0 0.005147 0.004455 0.00028 100.907548 116738870 0.997519 0 0.499371 0.354217 0

Kimia

From shlee on 2018-05-07

Hi @kimianajafi. Please refer to the [Picard metrics definitions](https://broadinstitute.github.io/picard/picard-metric-definitions.html).

From Rosmaninho on 2018-06-08

Hello,

My callset consists of 30 WES samples but includes 21 1000Genomes .bam files that I reconverted to fastq, aligned to hg19 and performed Haplotype calling against the SeqCap_EZ_Exome_v3_hg19_primary_targets.bed .

I used collectVariantCallingMetrics and GenotypeConcordance (Picard) to evaluate my callset against the NA12878 as a truth set.

With CollectVariantCallingMetrics I obtained some results that seem to be ok:

- TiTv ratio of approximately 2.6 for all samples.

- Total SNPs are around 50k for all samples .

- Total Indels are around 4k and Ins/Del ratio around 1.

However, the Genotype concordance of my samples is hovering around 0.4 for SNPs and 0.25 for Indels. What might be a possible reason for this? I’m not understanding the GenotypeConcordance parameters very well.

From Sheila on 2018-06-18

@Rosmaninho

Hi,

Can you post the exact command you ran for GenotypeConcordance?

Can you check the concordance of the 21 1000Genomes samples against NA12878?

Here are some steps you can try:

1) collect QC data (coverage, contamination, chimerism, sequencing artifact metrics)

2) look at the data (metrics, sequence data and variants)

3) look at the code/pipeline.

4) look at the data again.

-Sheila

From Rosmaninho on 2018-06-27

Sorry for the delay in replying Sheila… We had some issues in our computational clusters that put it offline and I couldn’t access my scripts.

Here’s the command I used:

srun shifter —image=broadinstitute/gatk:latest gatk —java-options ‘-Xmx8G’ GenotypeConcordance -CV=$vcf_files/cohort_output_snp_indels.vcf.gz -CS=EX181_sm -TV=$genome/NA12878.knowledgebase.snapshot.20131119.hg19.vcf -O=$workdir/EX181_cohort_output_snp_indels_concordance —TMP_DIR=$tmpdir

From Sheila on 2018-06-29

@Rosmaninho

Hi,

Is NA12878 one of your 1000Genomes samples? Can you try running your pipeline on an NA12878 test sample and seeing if that gives you good results?

-Sheila

From Rosmaninho on 2018-08-28

Hello,

No, NA12878 is not one of my 1000Genomes samples. I picked samples of an iberian background since they’re similar to my samples. However, I wouldn’t expect to have such a drastic difference compared to the gold Standard NA12878…

From Rosmaninho on 2018-08-30

Ok, so in the GATKwr23-G5-Callset_Evaluation presentation it mentions that the best truth set is a genotyping chip for the same sample.

Does this mean I should get a vcf file of the SeqCap_EZ_Exome_v3_hg19_UCSCBrowser to use as truth set?

From Rosmaninho on 2018-08-30

So I should run GC against one of my truth samples or include NA12878 in my set?

From Rosmaninho on 2018-09-05

Sheila said: >

Rosmaninho

> Hi,

> Is NA12878 one of your 1000Genomes samples? Can you try running your pipeline on an NA12878 test sample and seeing if that gives you good results?

> -Sheila

I download the NA12878 mapped .bam files from ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/phase3/data/NA12878/exome_alignment/ and reconverted it to fq.gz files to run through my pipeline together with my samples.

However, the genotype concordance is still very low. Should I provide an interval list with the primary targets from SeqCap_EZ_Exome_v3_hg19_primary_targets.bed?

I haven’t done that, but it’s the only thing that I can think of that might be affecting this…

From Sheila on 2018-09-06

@Rosmaninho

Hi,

>Should I provide an interval list with the primary targets from SeqCap_EZ_Exome_v3_hg19_primary_targets.bed?

Yes, can you check if that helps? I think it should.

-Sheila

From Rosmaninho on 2018-09-07

It didn’t help unfortunately. I converted the SeqCap_EZ_Exome_v3_hg19_primary_targets.bed into SeqCap_EZ_Exome_v3_hg19_primary_targets.interval_list using the ucsc.hg19.dict with PT BedToIntervalList.

From Rosmaninho on 2018-09-07

Could it be that after downloading the mapped .bam files from the 1000genomes ftp I am reconverting them to fq to re-run them through the pipeline together with my samples?

From Rosmaninho on 2018-09-11

Ok, I have a pair of twins and their gentype concordance in SNPs is 0.95, and in indels it is 0.8. I’ll assume these can serve as controls for my run and that it worked out ok.

From danb on 2019-05-10

The first sentence has a broken link.

Variant discovery -> https://www.broadinstitute.org/gatk/guide/bp_step.php?p=2 -> 404

Report abuse

Evaluating the quality of a variant callset

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

Introduction

Methods for variant evaluation

Underlying assumptions and truthiness*: a note of caution

Validation

Matching populations

Variant evaluation metrics

Variant-level concordance and genotype concordance

Number of Indels & SNPs and TiTv Ratio

Ratio of Insertions to Deletions (Indel Ratio)

Tools for performing variant evaluation

VariantEval

GenotypeConcordance

Picard tools

Which tool should I use?

Underlying assumptions and truthiness^*: a note of caution