gvcftools‎ > ‎

About gVCF

Summary:

Human clinical applications require sequencing information for both variant and non-variant positions, yet there is currently no common exchange format for such data. Genome VCF (gVCF) was developed to address this issue. gVCF is a set of conventions applied to the standard variant call format (VCF). These conventions allow representation of genotype, annotation and other information across all sites in the genome in a reasonably compact format: typical human whole genome sequencing results expressed in gVCF with annotation are less than 1 Gbyte, or about 1/100 the size of the BAM file used for variant calling. gVCF is also equally appropriate to represent and compress targeted sequencing results. Compression is achieved by joining contiguous non-variant regions with similar properties into single ‘block’ VCF records. To maximize the utility of gVCF, especially for high stringency applications, the properties of the compressed blocks are conservative -- thus block properties like depth and genotype quality reflect the minimum of any site in the block. The gVCF file is also a valid VCF v4.1 file, thus it can be indexed and used with existing VCF tools such as tabix and IGV, making it convenient both for direct interpretation and as a starting point for tertiary analysis.

gVCF Conventions:

Please see the following page for the latest gVCF file conventions and details of the gvcftools reference implementation.

Implementation:

Two options available to create a gVCF file from BAM are:

  1. The Isaac Variant Caller workflow available on github: https://github.com/sequencing/isaac_variant_caller
  2. A custom version of the GATK UnifiedGenotyper together with the gatk_to_gvcf utility

Example:

The following is a segment of a VCF file following the gVCF conventions for representation of non-variant sites, and more specifically using the gvcftools block compression and filtration levels.

gVCF example segment

chr20   287125  .       T       .       .       PASS    END=287136;BLOCKAVG_min30p3a    GT:DP:GQX:MQ    0/0:40:78:40
chr20   287137  .       G       .       .       LowGQX  .       GT:DP:GQX:MQ    0/0:42:11:42
chr20   287138  .       C       .       .       PASS    END=287178;BLOCKAVG_min30p3a    GT:DP:GQX:MQ    0/0:36:96:42
chr20   287179  .       C       T       310.01  PASS    BaseQRankSum=-0.721;DP=37;Dels=0.00;FS=14.994;HaplotypeScore=0.0000;MLEAC=1;MLEAF=0.500;MQ=52.29;MQ0=
0;MQRankSum=-1.091;QD=8.38;ReadPosRankSum=-1.963;SB=-1.901e+01     GT:AD:DP:GQ:PL:MQ:GQX   0/1:24,13:37:99:340,0,810:52:99
chr20   287180  .       G       .       .       PASS    END=287245;BLOCKAVG_min30p3a    GT:DP:GQX:MQ    0/0:32:78:49
chr20   287246  .       G       A       567.01  PASS    BaseQRankSum=-0.718;DP=33;Dels=0.00;FS=5.093;HaplotypeScore=3.2995;MLEAC=1;MLEAF=0.500;MQ=49.01;MQ0=0
;MQRankSum=1.050;QD=17.18;ReadPosRankSum=0.129;SB=-2.920e+02       GT:AD:DP:GQ:PL:MQ:GQX   0/1:13,20:33:99:597,0,343:49:99
chr20   287247  .       C       .       .       PASS    END=287259;BLOCKAVG_min30p3a    GT:DP:GQX:MQ    0/0:27:75:46
chr20   287260  .       C       .       .       PASS    END=287270;BLOCKAVG_min30p3a    GT:DP:GQX:MQ    0/0:26:69:38
chr20   287271  .       T       G       778     PASS    DP=26;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=38.49;MQ0=0;QD=29.92;SB=-2.930e+
02 GT:AD:DP:GQ:PL:MQ:GQX   1/1:0,26:26:72:811,72,0:38:72
chr20   287272  .       A       .       .       PASS    END=287285;BLOCKAVG_min30p3a    GT:DP:GQX:MQ    0/0:26:69:34

In the example segment non-variant regions are shown in blue, and variants are shown in red. Note that the variant lines can be extracted from a gVCF file to produce a conventional variant VCF file.

Subpages (1): gVCF Conventions