What is a VCF and how should I interpret it

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by Geraldine_VdAuwera

on 2012-08-06

This document describes "regular" VCF files produced for GERMLINE calls. For information on the special kind of VCF called gVCF, produced by HaplotypeCaller in -ERC GVCF mode, please see this companion document. For information specific to SOMATIC calls, see the MuTect documentation.

Contents

What is VCF?
Basic structure of a VCF file
Interpreting the VCF file header information
Structure of variant call records
How the genotype and other sample-level information is represented
How to extract information from a VCF in a sane, straightforward way

1. What is VCF?

VCF stands for Variant Call Format. It is a standardized text file format for representing SNP, indel, and structural variation calls. The VCF specification used to be maintained by the 1000 Genomes Project, but its management and expansion has been taken over by the Global Alliance for Genomics and Health Data Working group file format team. The full format spec can be found in the Samtools/Hts-specs repository along with other useful specs like SAM/BAM. We highly encourage you to take a look at those documents, as they contain a lot of useful information that we don't go over in this document.

VCF is the primary (and only well-supported) format used by the GATK for variant calls. We prefer it above all others because while it can be a bit verbose, the VCF format is very explicit about the exact type and sequence of variation as well as the genotypes of multiple samples for this variation.

That being said, this highly detailed information can be challenging to understand. The information provided by the GATK tools that infer variation from high-throughput sequencing data, such as the HaplotypeCaller, is especially complex. This document describes the key features and annotations that you need to know about in order to understand VCF files output by the GATK tools.

Note that VCF files are plain text files, so you can open them for viewing or editing in any text editor, with the following caveats:

Some VCF files are very large, so your personal computer may struggle to load the whole file into memory. In such cases, you may need to use a different approach, such as using UNIX tools to access the part of the dataset that is relevant to you, or subsetting the data using tools like GATK's SelectVariants.
NEVER EDIT A VCF IN A WORD PROCESSOR SUCH AS MICROSOFT WORD BECAUSE IT WILL SCREW UP THE FORMAT! You have been warned :)
Don't write home-brewed VCF parsing scripts. It never ends well.

2. Basic structure of a VCF file

A valid VCF file is composed of two main parts: the header, and the variant call records.

The header contains information about the dataset and relevant reference sources (e.g. the organism, genome build version etc.), as well as definitions of all the annotations used to qualify and quantify the properties of the variant calls contained in the VCF file. The header of VCFs generated by GATK tools also include the command line that was used to generate them. Some other programs also record the command line in the VCF header, but not all do so as it is not required by the VCF specification. For more information about the header, see the next section.

The actual data lines will look something like this:

[HEADER LINES] #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878 1 873762 . T G 5231.78 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:173,141:282:99:255,0,255 1 877664 rs3828047 A G 3931.66 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 1/1:0,105:94:99:255,255,0 1 899282 rs28548431 C T 71.77 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:1,3:4:26:103,0,26 1 974165 rs9442391 T C 29.84 LowQual [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:14,4:14:61:61,0,255

After the header lines and the field names, each line represents a single variant, with various properties of that variant represented in the columns. Note that all the lines shown in the example above describe SNPs (also called SNVs), but other variation could be described, such as indels or CNVs. See the VCF specification for details on how the various types of variations are represented. Depending on how the callset was generated, there may only be records for sites where a variant was identified, or there may also be "invariant" records, ie records for sites where no variation was identified.

You will sometimes come across VCFs that have only 8 columns, and contain no FORMAT or sample-specific information. These are called "sites-only" VCFs, and represent variation that has been observed in a population. Generally, information about the population of origin should be included in the header.

3. Interpreting the VCF file header information

The following is a valid VCF header produced by HaplotypeCaller on an example data set (derived from our favorite test sample, NA12878). You can download similar test data from our resource bundle and try looking at it yourself!

##fileformat=VCFv4.1 ##FILTER=<ID=LowQual,Description="Low quality"> ##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification"> ##GATKCommandLine.HaplotypeCaller=<ID=HaplotypeCaller,Version=3.4-3-gd1ac142,Date="Mon May 18 17:36:4 . . . ##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed"> ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed"> ##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes"> ##contig=<ID=chr1,length=249250621,assembly=b37> ##reference=file:human_genome_b37.fasta

We're not showing all the lines here, but that's still a lot... so let's break it down into digestible bits. Note that the header lines are always listed in alphabetical order.

VCF spec version

The first line:

##fileformat=VCFv4.1

tells you the version of the VCF specification to which the file conforms. This may seem uninteresting but it can have some important consequences for how to handle and interpret the file contents. As genomics is a fast moving field, the file formats are evolving fairly rapidly, so some of the encoding conventions change. If you run into unexpected issues while trying to parse a VCF file, be sure to check the version and the spec for any relevant format changes.

FILTER lines

The FILTER lines tell you what filters have been applied to the data. In our test file, one filter has been applied:

##FILTER=<ID=LowQual,Description="Low quality">

Records that fail any of the filters listed here will contain the ID of the filter (here, LowQual) in its FILTER field (see how records are structured further below).

FORMAT and INFO lines

These lines define the annotations contained in the FORMAT and INFO columns of the VCF file, which we explain further below. If you ever need to know what an annotation stands for, you can always check the VCF header for a brief explanation.

GATKCommandLine

The GATKCommandLine lines contain all the parameters that went used by the tool that generated the file. Here, GATKCommandLine.HaplotypeCaller refers to a command line invoking HaplotypeCaller. These parameters include all the arguments that the tool accepts, not just the ones specified explicitly by the user in the command line.

Contig lines and Reference

These contain the contig names, lengths, and which reference assembly was used with the input bam file. This can come in handy when someone gives you a callset but doesn't tell you which reference it was derived from -- remember that for most organisms, there are multiple reference assemblies, and you should always make sure to use the appropriate one!

[todo: FAQ on genome builds]

4. Structure of variant call records

For each site record, the information is structured into columns (also called fields) as follows:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878 [other samples...]

The first 8 columns of the VCF records (up to and including INFO) represent the properties observed at the level of the variant (or invariant) site. Keep in mind that when multiple samples are represented in a VCF file, some of the site-level annotations represent a summary or average of the values obtained for that site from the different samples.

Sample-specific information such as genotype and individual sample-level annotation values are contained in the FORMAT column (9th column) and in the sample-name columns (10th and beyond). In the example above, there is one sample called NA12878; if there were additional samples there would be additional columns to the right. Most programs order the sample columns alphabetically by sample name, but this is not always the case, so be aware that you can't depend on ordering rules for parsing VCF output!

Site-level properties and annotations

These first 7 fields are required by the VCF format and must be present, although they can be empty (in practice, there has to be a dot, ie . to serve as a placeholder).

CHROM and POS : The contig and genomic coordinates on which the variant occurs. Note that for deletions the position given is actually the base preceding the event.
ID: An optional identifier for the variant. Based on the contig and position of the call and whether a record exists at this site in a reference database such as dbSNP.
REF and ALT: The reference allele and alternative allele(s) observed in a sample, set of samples, or a population in general (depending how the VCF was generated). Note that REF and ALT are always given on the forward strand. For insertions, the ALT allele includes the inserted sequence as well as the base preceding the insertion so you know where the insertion is compared to the reference sequence. For deletions, the ALT allele is the base before the deletion.
QUAL: The Phred-scaled probability that a REF/ALT polymorphism exists at this site given sequencing data. Because the Phred scale is -10 * log(1-p), a value of 10 indicates a 1 in 10 chance of error, while a 100 indicates a 1 in 10^10 chance (see the FAQ article for a detailed explanation). These values can grow very large when a large amount of data is used for variant calling, so QUAL is not often a very useful property for evaluating the quality of a variant call. See our documentation on filtering variants for more information on this topic. Not to be confused with the sample-level annotation GQ; see this FAQ article for an explanation of the differences in what they mean and how they should be used.
FILTER: This field contains the name(s) of any filter(s) that the variant fails to pass, or the value PASS if the variant passed all filters. If the FILTER value is ., then no filtering has been applied to the records. It is extremely important to apply appropriate filters before using a variant callset in downstream analysis. See our documentation on filtering variants for more information on this topic.

This next field does not have to be present in the VCF.

INFO: Various site-level annotations. The annotations contained in the INFO field are represented as tag-value pairs, where the tag and value are separated by an equal sign, ie =, and pairs are separated by colons, ie ; as in this example: MQ=99.00;MQ0=0;QD=17.94. They typically summarize context information from the samples, but can also include information from other sources (e.g. population frequencies from a database resource). Some are annotated by default by the GATK tools that produce the callset, and some can be added on request. They are always defined in the VCF header, so that's an easy way to check what an annotation means if you don't recognize it. You can also find additional information on how they are calculated and how they should be interpreted in the "Annotations" section of the Tool Documentation.

Sample-level annotations

At this point you've met all the fields up to INFO in this lineup:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878 [other samples...]

All the rest is going to be sample-level information. Sample-level annotations are tag-value pairs, like the INFO annotations, but the formatting is a bit different. The short names of the sample-level annotations are recorded in the FORMAT field. The annotation values are then recorded in corresponding order in each sample column (where the sample names are the SM tags identified in the read group data). Typically, you will at minimum have information about the genotype and confidence in the genotype for the sample at each site. See the next section on genotypes for more details.

5. How the genotype and other sample-level information is represented

The sample-level information contained in the VCF (also called "genotype fields") may look a bit complicated at first glance, but they're actually not that hard to interpret once you understand that they're just sets of tags and values.

Let's take a look at three of the records shown earlier, simplified to just show the key genotype annotations:

1 873762 . T G [CLIPPED] GT:AD:DP:GQ:PL 0/1:173,141:282:99:255,0,255 1 877664 rs3828047 A G [CLIPPED] GT:AD:DP:GQ:PL 1/1:0,105:94:99:255,255,0 1 899282 rs28548431 C T [CLIPPED] GT:AD:DP:GQ:PL 0/1:1,3:4:26:103,0,26

Looking at that last column, here is what the tags mean:

GT : The genotype of this sample at this site. For a diploid organism, the GT field indicates the two alleles carried by the sample, encoded by a 0 for the REF allele, 1 for the first ALT allele, 2 for the second ALT allele, etc. When there's a single ALT allele (by far the more common case), GT will be either:
AD and DP : Allele depth and depth of coverage. These are complementary fields that represent two important ways of thinking about the depth of the data for this sample at this site. AD is the unfiltered allele depth, i.e. the number of reads that support each of the reported alleles. All reads at the position (including reads that did not pass the variant caller’s filters) are included in this number, except reads that were considered uninformative. Reads are considered uninformative when they do not provide enough statistical evidence to support one allele over another. DP is the filtered depth, at the sample level. This gives you the number of filtered reads that support each of the reported alleles. You can check the variant caller’s documentation to see which filters are applied by default. Only reads that passed the variant caller’s filters are included in this number. However, unlike the AD calculation, uninformative reads are included in DP. See the Tool Documentation for more details on AD (DepthPerAlleleBySample) and DP (Coverage) for more details.
PL : "Normalized" Phred-scaled likelihoods of the possible genotypes. For the typical case of a monomorphic site (where there is only one ALT allele) in a diploid organism, the PL field will contain three numbers, corresponding to the three possible genotypes (0/0, 0/1, and 1/1). The PL values are "normalized" so that the PL of the most likely genotype (assigned in the GT field) is 0 in the Phred scale. We use "normalized" in quotes because these are not probabilities. We set the most likely genotype PL to 0 for easy reading purpose.The other values are scaled relative to this most likely genotype. Keep in mind, if you're not familiar with the statistical lingo, that when we say PL is the "Phred-scaled likelihood of the genotype", we mean it is "How much less likely that genotype is compared to the best one". Have a look at this article for an example of how PL is calculated.
GQ : Quality of the assigned genotype. The Genotype Quality represents the Phred-scaled confidence that the genotype assignment (GT) is correct, derived from the genotype PLs. Specifically, the GQ is the difference between the PL of the second most likely genotype, and the PL of the most likely genotype. As noted above, the values of the PLs are normalized so that the most likely PL is always 0, so the GQ ends up being equal to the second smallest PL, unless that PL is greater than 99. In GATK, the value of GQ is capped at 99 because larger values are not more informative, but they take more space in the file. So if the second most likely PL is greater than 99, we still assign a GQ of 99. Basically the GQ gives you the difference between the likelihoods of the two most likely genotypes. If it is low, you can tell there is not much confidence in the genotype, i.e. there was not enough evidence to confidently choose one genotype over another. See the FAQ article on the Phred scale to get a sense of what would be considered low. Not to be confused with the site-level annotation QUAL; see this FAQ article for an explanation of the differences in what they mean and how they should be used.

With that out of the way, let's interpret the genotype information for NA12878 at 1:899282.

1 899282 rs28548431 C T [CLIPPED] GT:AD:DP:GQ:PL 0/1:1,3:4:26:103,0,26

At this site, the called genotype is GT = 0/1, which corresponds to the alleles C/T. The confidence indicated by GQ = 26 isn't very good, largely because there were only a total of 4 reads at this site (DP =4), 1 of which was REF (=had the reference base) and 3 of which were ALT (=had the alternate base) (indicated by AD=1,3). The lack of certainty is evident in the PL field, where PL(0/1) = 0 (the normalized value that corresponds to a likelihood of 1.0) as is always the case for the assigned allele, but the next PL is PL(1/1) = 26 (which corresponds to 10^(-2.6), or 0.0025). So although we're pretty sure there's a variant at this site, there's a chance that the genotype assignment is incorrect, and that the subject may in fact not be het (heterozygous) but be may instead be hom-var (homozygous with the variant allele). But either way, it's clear that the subject is definitely not hom-ref (homozygous with the reference allele) since PL(0/0) = 103, which corresponds to 10^(-10.3), a very small number.

6. How to extract information from a VCF in a sane, (mostly) straightforward way

Use VariantsToTable.

No, really, don't write your own parser if you can avoid it. This is not a comment on how smart or how competent we think you are -- it's a comment on how annoyingly obtuse and convoluted the VCF format is.

Seriously. The VCF format lends itself really poorly to parsing methods like regular expressions, and we hear sob stories all the time from perfectly competent people whose home-brewed parser broke because it couldn't handle a more esoteric feature of the format. We know we broke a bunch of people's scripts when we introduced a new representation for spanning deletions in multisample callsets. OK, we ended up replacing it with a better representation a month later that was a lot less disruptive and more in line with the spirit of the specification -- but the point is, that first version was technically legal by the 4.2 spec, and that sort of thing can happen at any time. So yes, the VCF is a difficult format to work with, and one way to deal with that safely is to not home-brew parsers.

(Why are we sticking with it anyway? Because, as Winston Churchill famously put it, VCF is the worst variant call representation, except for all the others.)

Attachments:

basic-vcf-format.png

Updated on 2017-05-30

From Geraldine_VdAuwera on 2014-05-22

Comments made before 22 May 2014 have been moved to http://gatkforums.broadinstitute.org/discussion/4206/questions-about-interpreting-vcf-files/p1

From Heidihuang on 2014-07-03

Hello, Is there any information about non-reference counts? In my vcf file, the genotype field has 9 subfields including GT:AB AD:DOS:DP:GQ:NALT:NALTT:PL. I am confused by NALT and NALTT. Is NALT will equal to the total numbers of alt alleles in AD field? Or are there any other ideas to get the non-reference counts for every site of every sample? Thanks a lot!

The description of FORMAT field in my vcf file is as follows.

GT:AB:AD:DOS:DP:GQ:NALT:NALTT:PL

FORMAT=

From Heidihuang on 2014-07-03

@Geraldine_VdAuwera‌

Sorry for asking silly questions. In FORMAT field, AD corresponds to non-filtered counts, and DP is filtered. I was wondering which one is corresponding non-reference counts for DP in FORMAT field? If I want one corresponding non-reference counts and depth, no matter it is filtered or not, how can I extract them? Thanks a lot!

From Geraldine_VdAuwera on 2014-07-03

@Heidihuang‌ If you want to get the count of all reads that do not have the reference allele, you should take the sum of the AD elements minus the first AD element (which is the reference depth).

From Hasani on 2014-08-06

Over 70% of the links in this page are broken..could you please re-point to the right page?

Broken links are:

- AD (DepthPerAlleleBySample) and DP (Coverage).

- AC,AF,AN -> Chromosome Counts.

- DP -> Coverage.

- Dels -> SpanningDeletions.

- MQ and MQ0 -> RMS Mapping Quality and Mapping Quality Zero.

- BaseQualityRankSumTest -> Base Quality Rank Sum Test.

- MappingQualityRankSumTest -> Mapping Quality Rank Sum Test.

- ReadPosRankSumTest -> Read Position Rank Sum Test.

- HRun -> Homopolymer Run.

- HaplotypeScore -> Haplotype Score.

- QD -> Qual By Depth.

- FS -> Fisher Strand

Thank you in advance!

From Geraldine_VdAuwera on 2014-08-06

Sure, will do. In the meantime you can find those doc pages by going to Guide > Tool Documentation > Annotations

From Jack on 2014-09-19

Hi Geraldine! Recently I used GATK HaplotypeCaller to call SNPs, but I didn’t quite understand the output. I used HaplotypeCaller to generate g.vcf files and then genotype these g.vcf files together. Below is a few examples:

The first is :

`chr1 231884 rs316190603 C T 513.90 PASS AC=2;AF=1.00;AN=2;DB;DP=37;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=28.62 GT:AD:DP:GQ:PL 1/1:0,12:12:36:540,36,0 ./.:.:5 ./.:.:20`

In this example, I wonder why GATK gives `./.:.:5` and `./.:.:20`, assigning no genotypes and allele depth (AD) for these two samples.I examined the read mapping, and found that both the second and the third sample have reads mapped at this site (see the attatchment).

The second example is :

`chr3 98144609 . A G 318.39 PASS AC=4;AF=0.667;AN=6;DP=0;FS=0.000;MLEAC=4;MLEAF=0.667;MQ=0.00;MQ0=0 GT:GQ:PL 1/1:3:133,3,0 0/1:23:38,0,23 0/1:3:175,0,3`

In this example, why the filed “DP” is 0, and GATK still assign genotypes for the samples?

Thanks!

read mapping.png

From Geraldine_VdAuwera on 2014-09-22

Hi @Jack,

For your first example, can you post the lines from the individual sample GVCFs? you should use the -bamout argument to see what the region looks like after reassembly. It’s possible that the reassembly process has moved the reads somewhere else and that the site is no longer covered.

For your second example, the DP annotation is the filtered depth, so it’s possible that there are reads there, but they fail the HC’s quality filters. When that happens, the HC will still try to genotype the samples, but you can see from the PL values that there is no real confidence about which genotype is correct.

From naymikm on 2014-11-19

Hi I am trying to make sense of this low quality snp:

chr1 1671866 rs7364986 A T,C 0.01 LowQual DB;DP=2;MQ0=0;NS=1 GT:BQ:BRC6:CM:CP:CU:DP:DP4:GP:GQ:SS 1/1:28.5,NaN:0,2,0,0,0,0:.:.:.

There are two ALT alleles but here the 1/1 indicates that only the T is being referred to, from my understanding. My question is why is the C ALT allele even listed? Is this just because it is such low quality the software is just noting that a C is just as likely here?

Thanks,

Marcus

From Sheila on 2014-11-19

@naymikm‌

Hi Marcus,

Can you tell me which tool you are using and which version of GATK you are using? Also, please post your exact command line.

Thanks,

Sheila

From thaleko on 2014-11-24

Hi,

Hope this isn’t too stupid a question – I am new to variant calling and have now gone through your RNA-seq VC workflow using my paired end illumina reads and b37 fasta (from your ftp server) as reference. The reads were mapped with the STAR 2-pass approach. I have now reached the variant calling/filtering step. My problem is: The VCF files that are produced only seem to give me variants from chromosome 20. (ie all entries have 20 in the #CHROM field).

Is there a problem with the reference used? Or am I just interpreting the vcf file all wrong? (If so, where can I find the variants from the remaining 22 chromosomes….?)

(Edit: I think I figured it out, the culprit is probably the -L 20 argument in the BQSR step… Sorry!)

From Sheila on 2014-11-24

@thaleko‌

Hi,

Can you please post the exact commands you used? Did you use -L in your commands?

If you used the -L because that is what we have in out tutorials, please note we only include the -L 20 in order to restrict analysis to a chromosome interval for demo purposes. For more details on using -L argument, you can refer here: https://www.broadinstitute.org/gatk/guide/article?id=4133

-Sheila

From jullfly on 2015-01-12

Hi,

I have not been able to find this information anywhere on this website or any of the websites for VCF file documentation, but I apologize if this information is given somewhere. I am wondering what order the ALT alleles (from VCF files made from HaplotypeCaller) are listed in: is ALT1 always the most frequent ALT allele and the other ALT alleles are listed in order of decreasing frequency?

Clarification would be greatly appreciated.

Thanks

From Geraldine_VdAuwera on 2015-01-12

@jullfly It’s a good question, I don’t think this is discussed explicitly anywhere. The VCF spec does not impose any logic to the ordering of alternate alleles. I can’t speak for other programs, but GATK itself does not incorporate any informative logic to the allele ordering — I forget if it is alphabetical or some random process, but it does not imply anything meaningful about allele frequency.

From jullfly on 2015-01-12

@Geraldine_VdAuwera Thanks a lot for your prompt reply! It would be easier if ALT alleles were listed in order of decreasing frequency for me to do some hard filtering with JEXL expressions, but I should still be able to work around it.

From Geraldine_VdAuwera on 2015-01-12

I see what you mean, but from a technical standpoint I can imagine a lot of things going wrong with that, so I don’t think it’s something we would want to build in. Maybe a separate annotation would do the trick. Now we just need someone to volunteer to write one ;)

From geoffroyGATK on 2015-02-09

Hi,

This is just for information. In the “What is VCF?” section, the link for VCF specification should be updated.

The VCF specification is no longer maintained by the 1000 Genomes Project. The group leading the management and expansion of the format is the Global Alliance for Genomics and Health Data Working group file format team, http://ga4gh.org/#/fileformats-team

From Geraldine_VdAuwera on 2015-02-09

Fair point (we are well acquainted with GA4GH); will update those docs accordingly.

From slowsmile on 2015-02-17

This page is very helpful but I have a followup question. What is the order in AD and PL if the variant is multi-allelic? in my case, I have

ref=G

alt=GA,GAA,GAAA,A

in one of the samples I have

GT=1/2,

AD=7 0 6 27 19 0 0 0 0 0

PL= 1111 216 215 351 0 516 704 248 419 709 704 248 419 709 709

What is the order of AD or PL values corresponding to each of the allele case?

From pdexheimer on 2015-02-17

@slowsmile – AD is pretty straightforward, I think it even says in the header. It’s ref allele first, followed by the alt alleles in the same order as the ALT column. This doesn’t match what you have in your example, though – you only have five alleles, but you show 10 values for AD. Not sure what’s going on there.

PL is far more complex. From the [VCF spec](http://samtools.github.io/hts-specs/VCFv4.2.pdf): “If A is the allele in REF and B,C,… are the alleles as ordered in ALT, the ordering of genotypes for the likelihoods is given by: F(j/k) = (k*(k+1)/2)+j. In other words, for biallelic sites the ordering is: AA,AB,BB; for triallelic sites the ordering is: AA,AB,BB,AC,BC,CC, etc.”

So in your example, the ordering will be 0/0, 0/1, 1/1, 0/2, 1/2, 2/2, 0/3, 1/3, 2/3, 3/3, 0/4, 1/4, 2/4, 3/4, 4/4

From Sheila on 2015-02-18

@slowsmile

Hi,

Sorry, I just deleted my previous post. I made a mistake, but @pdexheimer answered you correctly above.

-Sheila

From slowsmile on 2015-02-18

Sheila and pdexheimer , Thanks both of you. Your answer is very prompt and accurate. It helps me a lot. Best

From yoyorolls on 2015-02-26

Hi,

I’m new to GATK and I just tried out your best practice for getting variants from RNA-seq data for one human sample. Maybe I’m not understanding the vcf format correctly, but all my variants have either AC= 2, or 1; AF= 1, or 0.5, and all AN = 2. But the sample DP clearly has variants with DP >2. I don’t understand why AC and AF info doesn’t match up with sample DP. I thought Allelic Count + ref count = sample DP. Also, I don’t believe all the varants have allelic frequency of either 1 or 0.5, seeing that it’s RNA-seq.

I followed through the protocol, and did the base recalibration as well. The only difference I can think of is that for base recalibration, I used bed file for snp instead of vcf, and because the program didn’t like the way bed formats its position (eg. insertions would have the same start and end sites, entries with start site lower than the end site of previous entry etc), I changed it so that the end site is always start +1 bp. I also didn’t do the optional indel realignment. I did notice, that after marking duplicates, it said >80% of reads were filtered… Could that be the problem?

(RNA-seq was done with ion torrent, and ~30M reads. When mapping using STAR, I used snp-masked genome. In GATK, I used regular unmasked genome)

Thanks!

From pdexheimer on 2015-02-27

@yoyorolls -

“Allele Count” (AC) refers to the number of called alleles, not to the number of reads supporting an allele. With a single diploid sample, you’re only calling two alleles, and either one or both of them will be variant.

The number of reads supporting each allele are output in the FORMAT-level AD annotation

From yoyorolls on 2015-02-27

Ok, thanks!

Also, from previous discussions, it seems that AD is unfiltered while sample-DP is filtered. Is there a way to get the filtered number of reads supporting each allele? Or is it common practice to use the AD numbers to calculate allele frequency?

From Geraldine_VdAuwera on 2015-02-27

We don’t currently have official recommendations for calculating allele frequency in RNAseq, but basically, in the current state you would indeed use AD.

From yoyorolls on 2015-03-02

Alright~ thanks!

From sols on 2015-04-07

I have called variants for our RNA-Seq data following the best practice as mentioned. I am unable to interpret some of the entries in the vcf files. I am pasting the examples below.

chr1 1288459 . TG T 283.73 . AC=1;AF=0.500;AN=2;BaseQRankSum=-2.064;ClippingRankSum=-0.268;DP=50;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=35.71;MQ0=0;MQRankSum=0.268;QD=5.67;ReadPosRankSum=-0.755 GT:AD:DP:GQ:PL 0/1:25,23:48:99:321,0,344

Here, it states that the variant is heterozygous with the DP 48 and ref count AD 25 and alt count AD 23. Quality looks good and the confidence GQ is also very high.

Why are two nucleotides mentioned in REF column? How is it heterozygous if its T in REF and T in ALT columns?

I also find the similar entries with more than two nucleotides in the REF column.

chr1 160209736 . AAGG A 591.52 . AC=1;AF=0.500;AN=2;BaseQRankSum=-0.804;ClippingRankSum=-0.291;DP=44;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=36.53;MQ0=0;MQRankSum=-0.222;QD=13.44;ReadPosRankSum=3.505 GT:AD:DP:GQ:PL 0/1:6,38:44:7:628,0,7

Lastly, the same behavior is also observed the other way around.

chr1 1153898 . C CT 66.73 . AC=1;AF=0.500;AN=2;BaseQRankSum=-1.787;ClippingRankSum=-0.208;DP=77;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=45.50;MQ0=0;MQRankSum=1.633;QD=0.87;ReadPosRankSum=-0.659 GT:AD:DP:GQ:PL 0/1:50,23:73:99:104,0,547

Can you please help me interpret it.

Thanks.

From Geraldine_VdAuwera on 2015-04-07

Hi @sols, those are simply insertions and deletions (indels). The VCF format specification has some helpful explanations for interpreting these records.

From Sheila on 2015-04-07

@sols

Hi,

The first two examples you posted above are of deletions at the site. The third example you posted is of an insertion at the site.

In your first example, the reference has TG at position 1288459, but the sample has an alternate allele of T, meaning the G has been deleted.

In the last example, the reference has a C, but the sample has an alternate allele of CT, meaning there is a T insertion.

I hope this helps.

-Sheila

From sols on 2015-04-07

Thank you. I got it

From Maguelonne on 2015-05-07

Hi,

To have full information, I added the “StrandBiasBySample” option for the calling. Since the “StrandBiasBySample” documentation specifies that it gives the “numbers of forward and reverse reads that support REF and ALT alleles”, I’m wondering how to deal with a multi-allelic variant. To be sure, if a sample’s genotype is 1/2, do the “DP1,DP2,DP3,DP4” given by this option correspond to the numbers of forward and reverse reads that support alternative alleles “1” and “2”?

Thanks

From Sheila on 2015-05-07

@Maguelonne

Hi,

Unfortunately, StrandBiasBySample only gives the ref-forward, ref-reverse, first alt-forward, first alt-reverse. So, it does not give the two alt alleles.

The better annotation to use is StrandAlleleCountsBySample. It gives you the forward and reverse counts of all alleles (ref and all alternate alleles) in order.

-Sheila

From SV_487 on 2015-05-18

Hi,

VCF can be especially unwieldy and inefficient when dealing, for example, with rare variants, or when conducting a variation discovery (i.e. non-genotyping) study. Consider a study with thousands of samples, in which most variants are either rare (seen in only one or two samples) or are defined by just one observation because they are being discovered for the first time. VCF files for these data consist mostly of thousands upon thousands of null genotypes, drastically expanding storage requirements while providing no added informational benefit.

Instead of using the genotype column format / approach, it would seem to make sense in this case to use INFO tags to list the one (or perhaps two) samples in which a variant was seen, and dispense with the genotype columns. But I don’t see provisions for this in the VCF spec.

What approach would you use if you had to deal primarily with this type of data in VCF? Is there an alternative format or tool of which I am not aware?

Thank you

From Geraldine_VdAuwera on 2015-05-18

@SV_487 Currently we’re just grinning and bearing it, e.g. for the ExAC dataset the VCF file is just ridiculous, but we don’t have a better solution yet. Working on developing something better but it will be a while before that’s ready to even discuss (it’s exploratory work).

From freyes on 2015-05-21

Hi, regards from Mexico,

About VCF text in ’2. Basic structure of a VCF file’ you explain that first two variants show ‘very high confidence’ and third variant show ‘relative low confidence’. Please tell me are there specific numeric values that constitute the boundaries between ‘low confidence’ ‘relative low confidence’, ‘confidence’, ‘very high confidence’, etc.?, how could I define these boundaries as objectively as possible? or do these labels are totally subjective?

Thank you very much in advance for your kind answer.

From Geraldine_VdAuwera on 2015-05-22

@freyes These are subjective judgments based on experience. It’s very difficult to define objective boundaries because it depends a lot on the dataset and experiment, and some semi-random technical aspects. That’s why we use some filtering tools (see VQSR, variant recalibration in the Best Practices documentation) that use machine learning for filtering variants in a way that balances the tradeoff between sensitivity and specificity.

From freyes on 2015-05-25

Thank you very much for your answer Geraldine!, you are very kind!

From gideonite on 2015-07-27

The links are still broken on this page.

From Sheila on 2015-07-27

@gideonite

Hi,

Sorry about that. I will try to get them fixed soon. In the meantime, you can go here: https://www.broadinstitute.org/gatk/guide/tooldocs/ and look under Variant Annotations for the annotation explanations.

-Sheila

From Geraldine_VdAuwera on 2015-07-27

Tool docs links have been fixed.

From Geraldine_VdAuwera on 2015-07-29

Also, doc has been updated.

From vsvinti on 2015-08-27

Hi there, I am using version 3.4-46-gbc02625 to do joint genotyping with HC. My resulting vcf files contain * in the ALT column, which other software programs don’t like. Is this a new thing, and what does it mean? (they show up in the snps file when filtered on variant type)

eg.

1 789256 rs3131939 T C,* 17024.46 PASS AC=8,2;AF=0.400,0.100;AN=20;BaseQRankSum=2.40;ClippingRankSum=0.102;DB;DP=187751;FS=18.346;InbreedingCoeff=-0.3372;MLEAC=9,2;MLEAF=0.450,0.100;MQ=42.05;MQRankSum=4.61;NEGATIVE_TRAIN_SITE;POSITIVE_TRAIN_SITE;QD=25.45;ReadPosRankSum=0.714;SOR=1.113;VQSLOD=-4.416e-01;culprit=MQRankSum

From Zaag on 2015-08-28

> @vsvinti said:

> Hi there, I am using version 3.4-46-gbc02625 to do joint genotyping with HC. My resulting vcf files contain * in the ALT column, which other software programs don’t like. Is this a new thing, and what does it mean? (they show up in the snps file when filtered on variant type)

> eg.

> 1 789256 rs3131939 T C,* 17024.46 PASS AC=8,2;AF=0.400,0.100;AN=20;BaseQRankSum=2.40;ClippingRankSum=0.102;DB;DP=187751;FS=18.346;InbreedingCoeff=-0.3372;MLEAC=9,2;MLEAF=0.450,0.100;MQ=42.05;MQRankSum=4.61;NEGATIVE_TRAIN_SITE;POSITIVE_TRAIN_SITE;QD=25.45;ReadPosRankSum=0.714;SOR=1.113;VQSLOD=-4.416e-01;culprit=MQRankSum >

I believe it means the SNP lives underneath a deletion from another sample in the vcf.

From EADG on 2015-09-28

@Geraldine_VdAuwera

The link to variantsToTable seems to be broken. (Firefox 40.0.3)

Greetings EADG

From Geraldine_VdAuwera on 2015-09-28

Thanks for reporting it, @EADG — I’ll try to get this fixed asap.

From Geraldine_VdAuwera on 2015-09-28

File was corrupted — fixed now. Thanks for reporting.

From Santy_8128 on 2015-12-17

> Keep in mind, if you’re not familiar with the statistical lingo, that when we say PL is the “likelihood of the genotype”, we mean it is “the probability that the genotype is not correct”. That’s why the smaller the value, the better it is.

I think that is incorrect. PL values are the NEGATIVE of 10 times the log (base 10) of the genotype likelihood. Since its negative, small values imply higher probability. For e.g. PL of 0 means genotype likelihood of 1, while PL of 255 (minimum possible value) means almost 0. However, one must note that genotype likelihood GL (of say 0/1) does not mean probability of the genotype 0/1 (coz that would imply that the GLs should add up to 1, which they don’t) but in fact it means probability of reads given that genotype. Since, the true genotype is something we never know, we treat it like an unknown parameter and look at probabilities of reads given genotypes and not the other way round.

From Sheila on 2015-12-17

@Santy_8128

Hi,

The PL is Phred-scaled. If you look at [this article](http://gatkforums.broadinstitute.org/discussion/4260/how-should-i-interpret-phred-scaled-quality-scores#latest) you can see the Phred-scaled score takes into account the error probability. So, usually if a Phred-scaled score is high, we are more likely to believe the outcome is correct. In the case of the PL, the lower the Phred-scaled score, the more likely we are to believe the genotype is correct. The reason we say that the PL reflects “the probability that the genotype is not correct” is because a Phred-scaled of 0 indicates an error probability of 1.

The PL of 0 for the most likely genotype makes it easy to find in the PL field, so that is why it was chosen. It is easier than having to check for the highest PL value.

-Sheila

From MUHAMMADSOHAILRAZA on 2015-12-23

Hi @Sheila

Under:

5. How the genotype and other sample-level information is represented >> GQ : Quality of the assigned genotype

I quote: “The lack of certainty is evident in the PL field, where PL(0/1) = 0 (the normalized value that corresponds to a likelihood of 1.0) as is always the case for the assigned allele, but the next PL is PL(1/1) = 26 (which corresponds to 10^(-2.6), or 0.0025). “

here how 10^(-2.6) or 0.0025 is calculated? from where -2.6 come from? i am sorry i don’t know about GQ, PL value calculations

From Geraldine_VdAuwera on 2015-12-23

This shows how to work backward through the calculation to the value obtained from the genotyping engine. Say the genotyping found a probability of 0.0025, the phred scaling transformation produces the value of 26 as explained in the FAQ on Phred scaling.

From everestial007 on 2016-03-18

It may be a silly question, but I am little confused with the concept of AC (allele count) and AN (allele number).

AC – is the number of allele called in a sample, and AN – is the total number of alleles in the called genotype. **

So, for the portion of vcf file shown below;** for 1st position, shouldn’t the AC=1 (number of allele called for this sample at this locus) since its genotype is 1/1 and AN = 2 (since its diploid there are 2 alleles in total at this locus).

and for 3rd position AC = 2 (since two alleles were called) and AN = 2 (there are 2 alleles in total) for genotype 0/1.

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1

211000022279114 1152 . A C 22.76 . AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=31.62;QD=11.38;SOR=0.693 GT:AD:DP:GQ:PL 1/1:0,2:2:6:50,6,0

211000022279114 1617 . T C 45.74 . AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=34.12;QD=22.87;SOR=2.303 GT:AD:DP:GQ:PL 1/1:0,2:2:6:73,6,0

211000022280187 11226 . C T 112.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=-0.922;ClippingRankSum=0.000;DP=9;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=50.94;MQRankSum=0.000;QD=14.10;ReadPosRankSum=-0.421;SOR=1.981 GT:AD:DP:GQ:PL 0/1:4,4:8:99:141,0,15

What am I not understanding here ?)

- Thanks,

From everestial007 on 2016-03-18

Another question:

When calling raw variants using HaplotypeCaller-GATK 3.5 version I got this warning message:

_WARN 21:15:29,572 ExactAFCalculator – this tool is currently set to genotype at most 6 alternate alleles in a given context, but the context at 2L:7942137 has 8 alternate alleles so only the top alleles will be used; see the —max_alternate_alleles argument. This warning message is output just once per run and further warnings will be suppressed unless the DEBUG logging level is used. _

I think this has to do with the site being repetitive?? since I called variant from one diploid sample any sites shouldn’t have more than 2 types of alleles?? And I would like to eliminate any variants called at this sites – I was planning to use the —filterExpression AN > 2 for the purpose.

But, before filtering I checked the raw variants at this position (2L:7942137) and find that there are only two alleles reported – In the warning message it clearly says it called 8 alternate alleles and used 6 top alternate alleles.

2L 7942137 . A ATTT 276.73 . AC=2;AF=1.00;AN=2;DP=9;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;QD=24.48;SOR=1.179 GT:AD:DP:GQ:PL 1/1:0,7:7:24:285,24,0

Thanks in advance !

From Sheila on 2016-03-21

@everestial007

Hi,

The AC and AN are a little confusing, I agree. The AC just gives you the number of times the alternate allele shows up in the genotype. For example, let’s say I have this site:

`20 10008950 . AACACACACACAC A,AACACACACAC 4319.86 . AC=4,2;AF=0.667,0.333;AN=6;BaseQRankSum=-1.317;DP=83;FS=3.768;MLEAC=4,2;MLEAF=0.667,0.333;MQ=59.55;MQ0=0;MQRankSum=1.178;QD=18.02;RPA=23,17,22;RU=AC;ReadPosRankSum=1.455;SOR=0.829;STR GT:AD:DP:GQ:PL 1/2:0,6,11:28:99:1202,422,1392,711,0,661 1/2:1,5,12:27:99:1062,423,1376,574,0,540 1/1:0,15,0:28:45:2091,45,0,1837,45,1792`

Notice the two alternate alleles. The first alternate allele shows up 4 times in the genotypes. The second alternate allele shows up 2 times. So, the AC is 4,2 to represent that.

The AN basically gives you the number of alleles in the called genotypes. For example, let’s say I have this site:

`1 30923 . G T 64.17 . AC=2;AF=1.00;AN=2;DP=2;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=32.08;SOR=0.693 GT:AD:DP:GQ:PL ./. 1/1:0,2:2:6:90,6,0`

One of the samples is no-call, and the other sample has two alleles called. The AN shows 2. This can be useful for filtering when you want to fail sites that have more than half the samples with no-calls.

I hope this makes sense.

-Sheila

From Sheila on 2016-03-21

@everestial007

Hi again,

For your second question about the WARN for alternate alleles, that is just letting you know there were more than the default alternate alleles found at that site. You can change the default with [-maxAltAlleles](https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php#—max_alternate_alleles).

HaplotypeCaller evaluates evidence for all possible alternate alleles, but in your case the only two alleles output were the ones that had enough evidence to be reported in the GVCF. (The other alleles probably did not have enough evidence to be output).

-Sheila

From everestial007 on 2016-03-21

@Sheila

Thanks much for being so much helpful again. This clarification made a lots of difference to me.

Regarding the sites with warning I will check visually if variants from this sites should be ignored.

- K

From BobHarris on 2016-03-29

(This may not be the right page to ask this, but I didn’t find a better one)

Is there any fast way to test whether a .vcf file and/or its .idx file are corrupt?

I’ve been running many variant calling jobs on a SLURM cluster, and (sadly) it has been difficult to determine which jobs succeeded and which did not. I’ve been trying to look at the end of the index files with hex dump and what I’m seeing may indicate that some of them are truncated. This would suggest the corresponding job did not finish successfully, and I would throw out the vcf and idx and try again.

But I don’t really know what the idx file is supposed to look like so it’s hard to make a decision based on its content.

I’ve tried to find a formal spec for the .vcf.idx file, but as yet have not found one. Does it exist?

Or, does GATK have a VerifyVcf tool?

Bob H

From Sheila on 2016-03-30

@BobHarris

Hi Bob H,

You can use [ValidateVariants](https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_variantutils_ValidateVariants.php) to validate your VCF. I’m not aware of any vcf.idx validation tools. However, if your VCF validates, you can simply delete the .idx file and GATK will regenerate one for you :smiley:

Sheila

From BobHarris on 2016-03-30

Thanks @Sheila . But if a vcf file is truncated, isn’t there a chance it will pass the validation? There doesn’t seem to be anything in the file spec that indicates a proper ending, nor anything in the header that indicates the number of records to follow. So if a variant calling job failed, how can the validator know?

From Geraldine_VdAuwera on 2016-03-30

If a file is truncated (eg because transfer or writing was interrupted externally) it will lack an appropriate EOF (end of file) encoding, so the validator should be able to pick it up.

From BobHarris on 2016-03-30

Thanks @Geraldine_VdAuwera . But I don’t see any EOF encoding indicated in the vcf file spec, and I don’t see an ascii EOF character (hex 04) at the end of any of my vcf files. The spec I’m looking at is here:

http://samtools.github.io/hts-specs/VCFv4.2.pdf (my files indicate VCFv4.2 in the header).

I think Sheila interpreted my question as a concern about .idx files being corrupt. What I’m trying to do is find a way to tell if the job that created the .idx (and .vcf) completed successfully or not, so that I can know whether I need to rerun it.

Unfortunately the cluster I am using is is not reliable for reporting job success/failure.

The “writing was interrupted externally” is the case I am worried about, where the interrupt happened because the job died (e.g. because it needed more memory than was available) or was killed (apparently on SLURM if too much non-resident memory is in use the job managers starts killing jobs).

Thanks for your help,

Bob H

From Geraldine_VdAuwera on 2016-03-30

VCF spec itself doesn’t specify an EOF marker but I’d be surprised if there wasn’t one at all. Have you tried testing this? You can start a job that writes a VCF and kill the process mid-run. Then run the file (which should be invalid) through the validator and see if it complains. Or, have your pipelining system look for exit codes. If the run completed successfully GATK will output exit code 0; if something failed internally it will output exit code 1, and if anything happened to shut it down from outside it should return some other non-0 exit code.

From BobHarris on 2016-03-30

@Geraldine_VdAuwera I can try some tests related to EOF. I think programs that write an actual EOF character are pretty rare in *nix realm though. All the text-based bioinformatics formats I’ve looked at (with one exception) don’t have any file terminator specified. Some of the binary formats do (e.g. BAM), which is why I was hoping to find a spec for the .vcf.idx file. It’s increasingly apparent that’s a dead end, but surely a spec has to exist somewhere(!).

Regarding the exit codes, I can build something like that into my job scripts in the future. At present I have 687 vcf files (and .idx for each), about 10% of my project (for one of three species), and I’d like to avoid rerunning most of these if I can assure myself they are correct.

Thanks,

Bob H

From Geraldine_VdAuwera on 2016-03-30

Hmm that’s fair, my expectation of EOFs may be a naive one — I’ve never actually looked for them.

In case you’re open to migrating to a different pipelining system — we’re about to announce a new pipelining solution which you can read more about here: https://software.broadinstitute.org/wdl/. It includes checking for proper job completion. We’re in the process of migrating our production pipelines to this system and we will be providing support to anyone else who wants to use it.

From mbk0asis on 2016-04-16

Some data in my vcf files have 1/2 at genotype field. What does that mean? Thank you.

From Sheila on 2016-04-17

@mbk0asis

Hi,

If you look at the reported alleles at the site, you will see more than one alternate allele. The 1/2 genotype means the genotype is heterozygous for the first alternate allele and second alternate allele. Have a look at [the VCF spec](https://samtools.github.io/hts-specs/VCFv4.2.pdf) under “1.4.2 Genotype fields” for more information.

-Sheila

From Marrospi on 2016-07-01

Hi,

I have 160.000 variants and all of them have “QUAL” = 0

How I can correct?

Thanks.

From Geraldine_VdAuwera on 2016-07-01

@Marrospi It’s not possible to fix that, you need to redo the analysis. How was this vcf generated?

From Marrospi on 2016-07-01

for every .sam for each sample:

1-Samtools import to get the .bam file

2-AddOrReplaceReadGroups

3-PICARD SortSam

after, for each sample:

1-PICARD MergeSamFiles

2-GATK BaseRecalibrator

3-GATK PrintReads

finally,

java -jar GATK -T MuTect2 -R fast.fast -I:tumor inputFile1.bam -I:normal inputFile2.bam —dbsnp dbsnp.vcf -o outputFileMuTec.vcf

From Geraldine_VdAuwera on 2016-07-01

Oh, that’s fine then (I assumed you were calling germline variants). MuTect doesn’t emit QUAL scores. The lod annotations are what you should look at.

From lcscs on 2016-07-06

Hi @Geraldine_VdAuwera, what is the difference between 0|1 and 1|0 in GT field?

From Sheila on 2016-07-06

@lcscs

Hi,

I think you will find [this article](http://gatkforums.broadinstitute.org/gatk/discussion/45/purpose-and-operation-of-read-backed-phasing) helpful. Especially, have a look under “More detailed aspects of semantics of phasing in the VCF format”

-Sheila

From Marrospi on 2016-07-11

Thanks for your answer Geraldine, It has helped me a lot.

How can I get a value similar to a quality? My VCF file has the next information:

chr1 240188328 rs4659948 T C . germline_risk DB;ECNT=1;HCNT=2;MAX_ED=.;MIN_ED=.;NLOD=4.51;TLOD=17.13 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1 0/1:15,7:0.318:4:3:0.571:477,200:5:10 0/0:15,0:0.00:0:0:.:485,0:11:4

chr1 240188711 rs6429186 A G . clustered_events DB;ECNT=2;HCNT=2;MAX_ED=59;MIN_ED=59;NLOD=8.13;TLOD=38.88 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1 0/1:16,13:0.448:7:6:0.462:491,413:8:8 0/0:27,0:0.00:0:0:.:858,0:11:16

chr1 240188770 rs7530912 A G . clustered_events;t_lod_fstar DB;ECNT=2;HCNT=2;MAX_ED=59;MIN_ED=59;NLOD=5.72;TLOD=6.28 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1 0/1:13,3:0.188:0:3:1.00:412,82:7:6 0/0:19,0:0.00:0:0:.:597,0:9:10

From Geraldine_VdAuwera on 2016-07-11

@Marrospi This article shows the typical content of VCFs produced for germline calls; I just added a note to clarify that at the top of the page. Your VCF was produced by MuTect for somatic calls, which are a bit different in some respects. One major difference is that they don’t have a QUAL score. Instead, you can look at the NLOD and TLOD values. See the MuTect documentation for more information.

From flytrap on 2016-08-17

been hinted at in a question above, but just to be explicit:

is DP really “the number of filtered reads that support each of the reported alleles”, or maybe more “the number of filtered reads across all the reported alleles”?

they seem to be single values.

thanks.

From Sheila on 2016-08-17

@flytrap

Hi,

There are two different DP fields in the VCF.

Have a look [here](https://software.broadinstitute.org/gatk/documentation/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_Coverage.php) and [here](https://software.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_DepthPerAlleleBySample.php) for more information.

-Sheila

From 570932004 on 2016-09-22

Just so you know… Two links from Best-Practices page has been messed up.

https://software.broadinstitute.org/gatk/best-practices/bp_3step.php?case=GermShortWGS&p=3

Frequently Asked Questions

“What is a VCF and how should I interpret it?” Actually points to “Which tools use pedigree information?” , and vice versa.

From Sheila on 2016-09-28

@570932004

Hi,

Thank you for pointing it out :smiley:

We will fix it soon.

Thanks,

Sheila

From robertb on 2017-03-02

Hi,

If someone could help clarify this issue I’d really appreciate it. Thanks.

I had a multisample VCF that I split into samples with selectvar and then sent to tables VAriantstoTables using the splitmultiallelic option. I noticed that for some samples I will have variants with more than 3 PL values such 6 values.

Firstly, I annotated with the VEP which is where the ALLELE comes from. Now of course I noticed that for the first sample below the individual is two different alternate alleles ( A and T ) so I just assumed that the PLs were for those two variants.But the values don’t have the right format. You can see one set of triplets has a zero value ( the most likely genotype ) but the other does not. And the VEP calls all ALTs A. Furthermore the second individual has one ALT and 6 values.

Why is the VEP calling things A when we have ALT is * or ALT is T? ( o.k you aren’t expected to know that one but even a guess would be helpful). It matters because when you count the ALT alleles at this site for the first individual, for example, do you count one ALT or two ALT? There a number of sites likes this I need to know what’s going on here to proceed.

SAMPLE_ID SEX CHROM POS ID GQ PL REF ALT AC HET HOM ALLELE

1256-20693 XX chr1 17215982 rs3738617 99 2196,527,381,1668,0,1620 C A 1 1 0 A

1256-20693 XX chr1 17215982 rs3738617 99 2196,527,381,1668,0,1620 C T 1 1 0 A

1694-22917 XY chr1 17222284 . 99 127,0,1698,253,1718,1971 G * 1 1 0 A

From robertb on 2017-03-02

I should add that prior to splitting my multi sample VCF into sample level VCFs I used SelectVar to remove all nonvariant and low quality sites. That step was repeated again with the sample level VCFs. I’ve also taken a closer look and seen some strange PLs like all zeros 0,0,0 or two zeros 0,0,258.

From Sheila on 2017-03-03

@robertb

Hi,

Can you please post the corresponding VCF records for those sites?

Thanks,

Sheila

From robertb on 2017-03-12

Problem solved, maybe. Here’s line from the format genotype field for a multiallelic site in the VCF:

GT:AD:DP:FT:GQ:PL:PP

0/2:6,0,2:8:PASS:38:53,64,108,0,44,38:53,64,108,0,44,38

Here’s a similar site from the table:

CHROM POS ID REF ALT HET HOM AC

chr1 26211207 rs159529 A AAG 73 39 131

sample pp ref alt genotypes

1034-19683 337,8,0,343,33,368 A AAG AAG/AAG

1291-21555 154,0,49,160,65,224 A AAG A/AAG

1442-22354 753,66,0,753,66,754 A AAG AAG/AAG

1454-21674 734,58,0,736,66,744 A AAG AAG/AAG

1510-21940 452,39,0,452,39,452 A AAG AAG/AAG

chr1 26211207 rs159529 A G 73 39 20

1986-24087 53,63,133,0,70,64 A G A/G

1987-24091 208,0,264,238,290,529 A G A/AAG

1988-24094 122,0,242,146,257,403 A G A/AAG

1989-24099 107,0,200,128,212,340 A G A/AAG

19991-31601 127,0,268,154,284,438 A G A/AAG

20010-31644 151,177,381,0,205,187 A G A/G

I’m filtering genotypes and if a sample does not pass filter ( gq > 20 and dp > 10 ) then I need to adjust the AC and HET/HOM counts accordingly. Some variants even have three alternates and the PP field has 10 values (below). I’m not sure how interpret these. I mean the first three pp values correspond to 0/0, 0/1, 1/1 genotypes but what the next three or seven or more correspond to I’m not sure at this point. Apparently the fourth value is het-alt for the second alternate and the fifth value is hom-alt for the second alternate but what about the sixth? Not sure? And the tenth? I’ve been searching the web for answers but finding nothing. I someone could clarify this please I’d appreciate it. thanks.

CHROM POS ID REF ALT HET HOM AC chr1 161035658 rs376275626 CTTAT CTTATTTAT 16 7 15 REF ALT GT 1034-19683 121,127,226,0,100,90,127,226,100,226 CTTATCTTATTTAT CTTAT/C 1291-21555 70,85,333,0,249,242,85,333,249,333 CTTAT CTTATTTAT CTTAT/C 15542-26229 69,0,44,73,50,123,73,50,123,123 CTTAT CTTATTTATCTTAT/CTTATTTAT 1588-22326 210,126,121,83,0,74,210,126,83,210 CTTAT CTTATTTAT CTTATTTAT/C 1876-23644 79,85,185,0,100,93,85,185,100,185 CTTATCTTATTTAT CTTAT/C 19303-30889 134,9,0,135,9,135,135,9,135,135 CTTAT CTTATTTAT CTTATTTAT/CTTATTTAT 2007-24164 158,161,198,0,37,24,161,198,37,198 CTTATCTTATTTAT CTTAT/C 21684-34154 77,0,94,84,100,184,84,100,184,184 CTTAT CTTATTTAT CTTAT/CTTATTTAT 21762-34808 179,12,0,180,12,180,180,12,180,180 CTTATCTTATTTAT CTTATTTAT/CTTATTTAT 21928-37301 78,0,94,84,100,184,84,100,184,184 CTTAT CTTATTTAT CTTAT/CTTATTTAT 466-14420 82,85,127,0,42,35,85,127,42,127 CTTAT CTTATTTAT CTTAT/C 779-18300 134,8,0,134,9,135,134,9,135,135 CTTAT CTTATTTAT CTTATTTAT/CTTATTTAT

chr1 161035658 rs376275626 CTTAT C 16 7 13

1034-19683 121,127,226,0,100,90,127,226,100,226 CTTAT C CTTAT/C

1291-21555 70,85,333,0,249,242,85,333,249,333 CTTAT C CTTAT/C

15542-26229 69,0,44,73,50,123,73,50,123,123 CTTAT C CTTAT/CTTATTTAT

1588-22326 210,126,121,83,0,74,210,126,83,210 CTTAT C CTTATTTAT/C

1876-23644 79,85,185,0,100,93,85,185,100,185 CTTAT C CTTAT/C

19303-30889 134,9,0,135,9,135,135,9,135,135 CTTAT C CTTATTTAT/CTTATTTAT

2007-24164 158,161,198,0,37,24,161,198,37,198 CTTAT C CTTAT/C

21684-34154 77,0,94,84,100,184,84,100,184,184 CTTAT C CTTAT/CTTATTTAT

21762-34808 179,12,0,180,12,180,180,12,180,180 CTTAT C CTTATTTAT/CTTATTTAT

21928-37301 78,0,94,84,100,184,84,100,184,184 CTTAT C CTTAT/CTTATTTAT

466-14420 82,85,127,0,42,35,85,127,42,127 CTTAT C CTTAT/C

779-18300 134,8,0,134,9,135,134,9,135,135 CTTAT C CTTATTTAT/CTTATTTAT

chr1 161035658 rs376275626 CTTAT CTTATTTATTTAT 16 7 3

1034-19683 121,127,226,0,100,90,127,226,100,226 CTTAT CTTATTTATTTAT CTTAT/C

1291-21555 70,85,333,0,249,242,85,333,249,333 CTTAT CTTATTTATTTAT CTTAT/C

15542-26229 69,0,44,73,50,123,73,50,123,123 CTTAT CTTATTTATTTAT CTTAT/CTTATTTAT

1588-22326 210,126,121,83,0,74,210,126,83,210 CTTAT CTTATTTATTTAT CTTATTTAT/C

1876-23644 79,85,185,0,100,93,85,185,100,185 CTTAT CTTATTTATTTAT CTTAT/C

19303-30889 134,9,0,135,9,135,135,9,135,135 CTTAT CTTATTTATTTAT CTTATTTAT/CTTATTTAT

2007-24164 158,161,198,0,37,24,161,198,37,198 CTTAT CTTATTTATTTAT CTTAT/C

21684-34154 77,0,94,84,100,184,84,100,184,184 CTTAT CTTATTTATTTAT CTTAT/CTTATTTAT

21762-34808 179,12,0,180,12,180,180,12,180,180 CTTAT CTTATTTATTTAT CTTATTTAT/CTTATTTAT

21928-37301 78,0,94,84,100,184,84,100,184,184 CTTAT CTTATTTATTTAT CTTAT/CTTATTTAT

466-14420 82,85,127,0,42,35,85,127,42,127 CTTAT CTTATTTATTTAT CTTAT/C

779-18300 134,8,0,134,9,135,134,9,135,135 CTTAT CTTATTTATTTAT CTTATTTAT/CTTATTTAT

From robertb on 2017-03-14

here’s what’s going on and in case anyone reads this the answer is that for a split multiallelic site the PP or PL values will correspond to: 0/0,0/1,1/1,0/2,1/2,2/2,0/3,1/3,2/3,3/3 for a site with three alternates and hence to ten values I was asking about.

From Sheila on 2017-03-15

@robertb

Hi,

Thanks for posting your solution. I was a little confused what you were asking at first, but now I see. The [VCF spec](https://samtools.github.io/hts-specs/VCFv4.2.pdf) gives this definition which may help others as well:

`GL : genotype likelihoods comprised of comma separated floating point log10-scaled likelihoods for all possible

genotypes given the set of alleles defined in the REF and ALT fields. In presence of the GT field the same

ploidy is expected and the canonical order is used; without GT field, diploidy is assumed. If A is the allele in

REF and B,C,… are the alleles as ordered in ALT, the ordering of genotypes for the likelihoods is given by:

F(j/k) = (k*(k+1)/2)+j. In other words, for biallelic sites the ordering is: AA,AB,BB; for triallelic sites the

ordering is: AA,AB,BB,AC,BC,CC, etc. For example: GT:GL 0/1:-323.03,-99.29,-802.53 (Floats)`

-Sheila

From mjtiv on 2017-10-26

Let’s say “someone”, don’t want to incriminate anyone is writing a custom VCF parsing script to specifically look for allele specific expression (ASE) in variants, we have already applied GATK filters (FS>30, QD<20, DP (filterExpression) <100). All the variants that failed this criteria were removed and some other filters were performed like to remove Indels and SNPs close to Indels etc.

Now we are specifically looking at the GT calls per sample (0/1 etc.), how reliable are those calls if we select them to identify heterozygous calls (0/1, 1/2 etc)? One of the filters of the program assumes these calls are correct and filters homozygous samples.

Building on this question if we then use AD (unfiltered allele depth) counts to examine for ASE (example 0=20 counts and 1=50 counts) and test using a binomial test (scipy binom test in python) to get a P-value, is this o.k.? We do filter samples if one allele count is <1% of total read counts. The slight concern is AD is unfiltered allele depth, so it could be later misleading if the reads are not good.

Should we add another layer of filtering based on DP, GQ or PL columns? Or is this method the best we can do at the time with the data type we have?

From Sheila on 2017-10-31

@mjtiv

Hi,

Have you looked into [ASEReadCounter](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_rnaseq_ASEReadCounter.php)?

>how reliable are those calls if we select them to identify heterozygous calls (0/1, 1/2 etc)?

If you look at the GQ field, that will tell you how much confidence is in the genotype. You may be interested in the [Genotype Refinement workflow](https://gatkforums.broadinstitute.org/gatk/discussion/4723/genotype-refinement-workflow) if you need the genotypes to be as accurate as possible. [This article](https://gatkforums.broadinstitute.org/gatk/discussion/4860/what-is-the-difference-between-qual-and-gq-annotations) should help as well.

>if we then use AD (unfiltered allele depth) counts to examine for ASE (example 0=20 counts and 1=50 counts) and test using a binomial test (scipy binom test in python) to get a P-value, is this o.k.?

I think ASEReadCounter will help you :smile:

-Sheila

From mjtiv on 2017-11-02

@Sheila thanks for the links, earlier in the year (January 2017) we did try to use ASEReadCounter and then Mamba, but Mamba was very problematic in installing, so we gave up (emailed the group too with no success). But, I am going to dig into all the links and see if I can modify our code and adapt to what other programs are doing.

From Jaspreet on 2017-12-06

Hi,

I have a query regarding start and end positions to be used for indels while annotating using oncotator. Acc to my understanding of your statement “for deletions the position given is actually the base preceding the event” I am using the following coordinates in maf file for indel shown in vcf:

VFC FILE:

POS REF ALT

12345 T TAGT

34567 TACGG T

MAF FILE:

Start End Ref Alt

12345 12346 T TAGT

34568 34571 TACGG T

I am not sure that coordinates I am taking for MAF file are correct. It will be a great help if you can walk me out with this confusion.

Thanks, looking forward to your response.

From Sheila on 2017-12-12

@Jaspreet

Hi,

I will have out Oncotator expert get back to you.

-Sheila

From LeeTL1220 on 2017-12-12

@Jaspreet Actually, if it is “T” —> “TCGAT” in a VCF, it should be a “-” —> “CGAT” in MAF.

Your coordinates look correct to me.

From sheridanlgrant on 2018-06-25

“In the three sites shown in the example above, NA12878 is observed with the allele combinations T/G, G/G, and C/T respectively.”

Presumably this is a typo and the second combinations should be A/G?

From Sheila on 2018-06-26

@sheridanlgrant

Hi,

No, the G/G genotype is correct. Notice the GT field is 1/1. The 0 allele is the reference allele and the 1 allele is the alt allele.

I hope this helps.

-Sheila

From picard_gatk_mj on 2018-06-27

GT:AD:DP:GQ:PL 0/1:2,16:18:14:671,0,14 is this result good?

From wjy on 2019-03-08

Hi, I want to extract the GT field, then I use this commond

java -jar GenomeAnalysisTK.jar \

-R trinity_genes.fasta \

-T VariantsToTable \

-V filtered_output.vcf \

-F CHROM -F POS -F REF -F ALT -F FILTER -F AC -F DP -F FS -F QD -F GT \

-o results.table

But in the results.table, the GT column were NA. How to fix this problem?

Thanks.

From heskett on 2019-03-27

AD vs DP definitions are very vague here.

From what I can distill: AD = Informative reads supporting each allele including those that don’t pass filters. PROBLEM: why would reads not passing filters be used to support an allele?

DP = Number of filtered reads supporting each allele including uninformative reads. PROBLEM: It doesn’t make logical sense to assign an uninformative read to an allele, if by definition the read is uninformative and can’t be used for this exact purpose.

Seems like what we would want is FILTERED reads that are INFORMATIVE.

From ngerald on 2019-06-11

The format column of my VCF file (Generated by PostprocessGermlineCNV) contains GT:CN:NP:QA:QS:QSE:QSS.

GT – Genotype

CN – Copy Number

What do the rest of the acronyms mean? I’ve tried looking at the documentation and other forums, but couldn’t find a clear description of these. Could someone help me out with this?

From Andrius on 2019-11-05

sorry, please delete this

From dgewing on 2019-11-19

if the GT is 1/1 and the AD=14,4: what does the 14,4 indicate?

Report abuse