gvcftools‎ > ‎About gVCF‎ > ‎

gVCF Conventions

This page summarizes the set of conventions used in genome VCF (gVCF), which is a specialization of the VCF 4.1 format.  

Version

20120906a

1 Introduction:

Any VCF file following the gVCF convention combines information on variant calls (SNVs and small-indels) with genotype and read depth information for all non-variant positions in the reference. Integrating this information into a single file makes it straightforward to distinguish variant, reference and no-call states for any site of interest.

1.1 Interpretation:

The gVCF file can be interpreted in multiple ways:

(1)    Fast interpretation: As a discrete classification of the genome into ‘variant’, ‘reference’ and ‘no-call’ loci.

This is the simplest way to use the gVCF. The Filter fields for the gVCF file have already been set to mark uncertain calls as filtered for both variant and non-variant positions, so a simple analysis can be performed to look for all loci with a filter value of “PASS”, and treat these as called.

(2)    Research interpretation: As a ‘statistical genome':

Additional fields, such as genotype quality, are provided for both variant and reference positions to allow the threshold between called and uncalled sites to be varied, or for a more stringent criteria to be applied to a set of loci from an initial screen.

1.2 Use with external tools:

gVCF is written to the VCF 4.1 spec, so any tool which is compatible with this spec, such as IGV and tabix can use the file. However note that certain tools may (1) apply algorithms to vcf files which only make sense for variant calls (as opposed to variant and non-variant regions in the full gVCF) or (2) are only computationally feasible for variant calls. For these cases it is recommended to extract the variant calls from the full gVCF file.


2 Conventions:

Note that the gVCF conventions are written with the assumption that only one sample per file is being represented. 

2.1 Representation of non-variant segments:

2.1.1 Block representation using END key:

Continuous non-variant segments of the genome can be represented as single records in gVCF. These records use the standard 'END" INFO key to indicate the extent of the record. Even though the record can span multiple bases, only the first base is provided in the REF field to reduce file size. An example (simplified) non-reference block record is as follows:

##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	NA19238
chr1	51845	.	A	.	.	PASS	END=51862

The example record spans positions [51845,51862].

2.1.2 Joining non-variant sites into a single block record:

Two issues must be addressed in joining adjacent non-variant sites into block records

  1. Criteria which allow adjacent sites to be joined into a single block record
  2. Method to summarize the distribution of SAMPLE or INFO values from each site in the block record.

At any gVCF compression level, a set of sites can be joined into a block if:

  1. Each site is non-variant with the same genotype call. Expected non-variant genotype calls are { "0/0", "0", "./.", "." }
  2. Each site has the same coverage state, where 'coverage state' refers to whether at least one read maps to the site, ie. sites with zero coverage can't be joined into the same block with covered sites.
  3. Each site has the same set of FILTER tags.
  4. Sites must have less than a threshold fraction of non-reference allele observations compared to all observed alleles (based on AD and DP field information). This is used to keep sites with high ratios of non-reference alleles from being compressed into non-variant blocks. In the gvcftools reference implementation the maximum non-reference fraction is 0.2

2.1.3 Block Sample Values:

Any fields provided for a block of sites such as read depth (using the DP key), will show the minimum value observed among all sites encompassed by the block.

2.1.4 Non-variant block implementations:

Note that files conforming to the gVCF conventions above could use different criteria for creation of block records depending on the desired trade-off between compression and non-variant site detail. The gatk_to_gvcf tool in gvcftools provides the following blocking scheme:
  • Non-variant block compression scheme 'min30p3a':
    • Each sample value shown for the block, such as the depth (using the DP key), is restricted to have a range where the maximum value is within 30% or 3 of the minimum. ie.. for sample value range [x,y], y <= x+max(3,x*0.3). This range restriction applies to all sample values printed out in the final block record.

2.2 Special handling for indel conflicts:

Note that sites which are "filled in" inside of deletions have additional treatment:

  • For heterozygous deletions:
    • Sites inside of heterozygous deletions have haploid genotype entries (i.e. "0" instead of "0/0", "1" instead of "1/1").
    • Heterozygous SNVs are marked with the "SiteConflict" filter and given a genotype of "."
    • Sites inside of heterozygous deletions cannot have a genotype quality score higher than its enclosing deletion genotype quality.
  • For homozygous deletions:
    • Sites inside of homozygous deletions have genotype "."
    • Site and genotype quality are set to "."
  • For all deletions:
    • Sites inside of a deletion are marked with the deletion's filters (more filters may be added pertaining to the site itself).

The above modifications reflect the notion that the site confidence is bounded by the enclosing indel confidence.

2.2.1 Indel conflict Filters:

ID

Type

Description

IndelConflictsite/indelLocus is in region with conflicting indel calls
SiteConflictsiteSite genotype conflicts with proximal indel call. This conflict is typically a heterozygous genotype found inside of a heterzygous deletion.


2.3 Genotype Quality for Variant and Non-variant Sites:

The gVCF file uses an adapted version of genotype quality for variant and non-variant site filtration. This value is associated with the key GQX. The GQX value is intended to represent the minimum of {Phred genotype quality assuming the site is variant,Phred genotype quality assuming the site is non-variant}. The reason for using this is to allow a single value to be used as the primary quality filter for both variant and non-variant sites. Filtering on this value corresponds to a conservative assumption appropriate for applications where reference genotype calls must be determined at the same stringency as variant genotypes, ie:

  1. An assertion that a site is homozygous reference at GQX >= 20 is made assuming the site is variant.
  2. An assertion that a site is a non-reference genotype at GQX >= 20 is made assuming the site is non-variant. 

Few snp-callers will enumerate genotype qualities using both priors (starling is one example), however the concept is easily and effectively approximated from typical VCF output using GQX ~= min(GQ,QUAL).


3 gVCF FILTER Criteria:

3.1 Summary

The gVCF FILTER description below is divided into two sections: (1) describes filtration based on genotype quality and (2) describes all other filters. Note these are default filters values used in the current gvcftools implementation, however no set of filters or filtration levels are required for a file to conform to the gVCF conventions.

3.2 Genotype quality filtration:

The genotype quality is the primary determinant of filtration status for all sites in the genome. In particular, note that traditional discovery-based site quality values that convey confidence that the site is "anything besides the homozygous reference genotype” (such as snp quality) are not used – a site or locus is filtered based on the confidence in the reported genotype for the current sample.

The genotype quality used in gVCF is a phred-scaled probability that the given genotype is correct. It is indicated with the FORMAT field tag “GQX”. Any locus with genotype quality below the cutoff threshold is filtered with the tag “LowGQX”. If a locus is filtered for genotype quality, it may also be marked by additional filters (described below).

3.3 Default Filters:

The full set of filters used by default in the gvcftools reference implementation are:

ID

Type

Description

LowGQXsite/indelGQX is less than 20

MaxDepth

site/indel

Depth is greater than 3x the mean chromosome depth

LowQD

site/indel

QD is less than 3.73

LowMQ

site

Site MQ is less than 20

HighFSsiteSite FS is greater than 60

HighHaplotypeScore

site

Site HaplotypeScore is greater than 13

LowMQRankSum

site

Site MQRankSum is less than -12.5

LowReadPosRankSum

site

Site ReadPosRankSum is less than -2.386

HighIndelFS

indel

Indel FS is greater than 200.

LowIndelReadPosRankSumindelIndel ReadPosRankSum is less than -20