created by Geraldine_VdAuwera
on 2014-04-03
GVCF stands for Genomic VCF. A GVCF is a kind of VCF, so the basic format specification is the same as for a regular VCF (see the spec documentation here), but a Genomic VCF contains extra information.
This document explains what that extra information is and how you can use it to empower your variants analyses.
What we're covering here is strictly limited to GVCFs produced by HaplotypeCaller in GATK versions 3.0 and above. The term GVCF is sometimes used simply to describe VCFs that contain a record for every position in the genome (or interval of interest) regardless of whether a variant was detected at that site or not (such as VCFs produced by UnifiedGenotyper with --output_mode EMIT_ALL_SITES). GVCFs produced by HaplotypeCaller 3.x contain additional information that is formatted in a very specific way. Read on to find out more.
The key difference between a regular VCF and a gVCF is that the gVCF has records for all sites, whether there is a variant call there or not. The goal is to have every site represented in the file in order to do joint analysis of a cohort in subsequent steps. The records in a gVCF include an accurate estimation of how confident we are in the determination that the sites are homozygous-reference or not. This estimation is generated by the HaplotypeCaller's built-in reference model.
Note that some other tools (including the GATK's own UnifiedGenotyper) may output an all-sites VCF that looks superficially like the BP_RESOLUTION gVCFs produced by HaplotypeCaller, but they do not provide an accurate estimate of reference confidence, and therefore cannot be used in joint genotyping analyses.
As you can see in the figure above, there are two options you can use with -ERC: GVCF and BP_RESOLUTION. With BP_RESOLUTION, you get a gVCF with an individual record at every site: either a variant record, or a non-variant record. With GVCF, you get a gVCF with individual variant records for variant sites, but the non-variant sites are grouped together into non-variant block records that represent intervals of sites for which the genotype quality (GQ) is within a certain range or band. The GQ ranges are defined in the ##GVCFBlock line of the gVCF header. The purpose of the blocks (also called banding) is to keep file size down, and there is no downside for the downstream analysis, so we do recommend using the -GVCF option.
This is a banded gVCF produced by HaplotypeCaller with the -GVCF option.
Header:
As you can see in the first line, the basic file format is a valid version 4.1 VCF:
##fileformat=VCFv4.1 ##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location"> ##FILTER=<ID=LowQual,Description="Low quality"> ##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> ##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification"> ##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias."> ##GVCFBlock=minGQ=0(inclusive),maxGQ=5(exclusive) ##GVCFBlock=minGQ=20(inclusive),maxGQ=60(exclusive) ##GVCFBlock=minGQ=5(inclusive),maxGQ=20(exclusive) ##GVCFBlock=minGQ=60(inclusive),maxGQ=2147483647(exclusive) ##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities"> ##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered"> ##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?"> ##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval"> ##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes"> ##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation"> ##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed"> ##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed"> ##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality"> ##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads"> ##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities"> ##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias"> ##contig=<ID=20,length=63025520,assembly=b37> ##reference=file:///humgen/1kg/reference/human_g1k_v37.fasta
Toward the middle you see the ##GVCFBlock lines (after the ##FORMAT lines) (repeated here for clarity):
##GVCFBlock=minGQ=0(inclusive),maxGQ=5(exclusive) ##GVCFBlock=minGQ=20(inclusive),maxGQ=60(exclusive) ##GVCFBlock=minGQ=5(inclusive),maxGQ=20(exclusive)
which indicate the GQ ranges used for banding (corresponding to the boundaries [5, 20, 60]).
You can also see the definition of the MIN_DP annotation in the ##FORMAT lines.
Records
The first thing you'll notice, hopefully, is the <NON_REF> symbolic allele listed in every record's ALT field. This provides us with a way to represent the possibility of having a non-reference allele at this site, and to indicate our confidence either way.
The second thing to look for is the END tag in the INFO field of non-variant block records. This tells you at what position the block ends. For example, the first line is a non-variant block that starts at position 20:10000000 and ends at 20:10000116.
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878 20 10000000 . T <NON_REF> . . END=10000116 GT:DP:GQ:MIN_DP:PL 0/0:44:99:38:0,89,1385 20 10000117 . C T,<NON_REF> 612.77 . BaseQRankSum=0.000;ClippingRankSum=-0.411;DP=38;MLEAC=1,0;MLEAF=0.500,0.00;MQ=221.39;MQ0=0;MQRankSum=-2.172;ReadPosRankSum=-0.235 GT:AD:DP:GQ:PL:SB 0/1:17,21,0:38:99:641,0,456,691,519,1210:6,11,11,10 20 10000118 . T <NON_REF> . . END=10000210 GT:DP:GQ:MIN_DP:PL 0/0:42:99:38:0,80,1314 20 10000211 . C T,<NON_REF> 638.77 . BaseQRankSum=0.894;ClippingRankSum=-1.927;DP=42;MLEAC=1,0;MLEAF=0.500,0.00;MQ=221.89;MQ0=0;MQRankSum=-1.750;ReadPosRankSum=1.549 GT:AD:DP:GQ:PL:SB 0/1:20,22,0:42:99:667,0,566,728,632,1360:9,11,12,10 20 10000212 . A <NON_REF> . . END=10000438 GT:DP:GQ:MIN_DP:PL 0/0:52:99:42:0,99,1403 20 10000439 . T G,<NON_REF> 1737.77 . DP=57;MLEAC=2,0;MLEAF=1.00,0.00;MQ=221.41;MQ0=0 GT:AD:DP:GQ:PL:SB 1/1:0,56,0:56:99:1771,168,0,1771,168,1771:0,0,0,0 20 10000440 . T <NON_REF> . . END=10000597 GT:DP:GQ:MIN_DP:PL 0/0:56:99:49:0,120,1800 20 10000598 . T A,<NON_REF> 1754.77 . DP=54;MLEAC=2,0;MLEAF=1.00,0.00;MQ=185.55;MQ0=0 GT:AD:DP:GQ:PL:SB 1/1:0,53,0:53:99:1788,158,0,1788,158,1788:0,0,0,0 20 10000599 . T <NON_REF> . . END=10000693 GT:DP:GQ:MIN_DP:PL 0/0:51:99:47:0,120,1800 20 10000694 . G A,<NON_REF> 961.77 . BaseQRankSum=0.736;ClippingRankSum=-0.009;DP=54;MLEAC=1,0;MLEAF=0.500,0.00;MQ=106.92;MQ0=0;MQRankSum=0.482;ReadPosRankSum=1.537 GT:AD:DP:GQ:PL:SB 0/1:21,32,0:53:99:990,0,579,1053,675,1728:9,12,10,22 20 10000695 . G <NON_REF> . . END=10000757 GT:DP:GQ:MIN_DP:PL 0/0:48:99:45:0,120,1800 20 10000758 . T A,<NON_REF> 1663.77 . DP=51;MLEAC=2,0;MLEAF=1.00,0.00;MQ=59.32;MQ0=0 GT:AD:DP:GQ:PL:SB 1/1:0,50,0:50:99:1697,149,0,1697,149,1697:0,0,0,0 20 10000759 . A <NON_REF> . . END=10001018 GT:DP:GQ:MIN_DP:PL 0/0:40:99:28:0,65,1080 20 10001019 . T G,<NON_REF> 93.77 . BaseQRankSum=0.058;ClippingRankSum=-0.347;DP=26;MLEAC=1,0;MLEAF=0.500,0.00;MQ=29.65;MQ0=0;MQRankSum=-0.925;ReadPosRankSum=0.000 GT:AD:DP:GQ:PL:SB 0/1:19,7,0:26:99:122,0,494,179,515,694:12,7,4,3 20 10001020 . C <NON_REF> . . END=10001020 GT:DP:GQ:MIN_DP:PL 0/0:26:72:26:0,72,1080 20 10001021 . T <NON_REF> . . END=10001021 GT:DP:GQ:MIN_DP:PL 0/0:25:37:25:0,37,909 20 10001022 . C <NON_REF> . . END=10001297 GT:DP:GQ:MIN_DP:PL 0/0:30:87:25:0,72,831 20 10001298 . T A,<NON_REF> 1404.77 . DP=41;MLEAC=2,0;MLEAF=1.00,0.00;MQ=171.56;MQ0=0 GT:AD:DP:GQ:PL:SB 1/1:0,41,0:41:99:1438,123,0,1438,123,1438:0,0,0,0 20 10001299 . C <NON_REF> . . END=10001386 GT:DP:GQ:MIN_DP:PL 0/0:43:99:39:0,95,1226 20 10001387 . C <NON_REF> . . END=10001418 GT:DP:GQ:MIN_DP:PL 0/0:41:42:39:0,21,315 20 10001419 . T <NON_REF> . . END=10001425 GT:DP:GQ:MIN_DP:PL 0/0:45:12:42:0,9,135 20 10001426 . A <NON_REF> . . END=10001427 GT:DP:GQ:MIN_DP:PL 0/0:49:0:48:0,0,1282 20 10001428 . T <NON_REF> . . END=10001428 GT:DP:GQ:MIN_DP:PL 0/0:49:21:49:0,21,315 20 10001429 . G <NON_REF> . . END=10001429 GT:DP:GQ:MIN_DP:PL 0/0:47:18:47:0,18,270 20 10001430 . G <NON_REF> . . END=10001431 GT:DP:GQ:MIN_DP:PL 0/0:45:0:44:0,0,1121 20 10001432 . A <NON_REF> . . END=10001432 GT:DP:GQ:MIN_DP:PL 0/0:43:18:43:0,18,270 20 10001433 . T <NON_REF> . . END=10001433 GT:DP:GQ:MIN_DP:PL 0/0:44:0:44:0,0,1201 20 10001434 . G <NON_REF> . . END=10001434 GT:DP:GQ:MIN_DP:PL 0/0:44:18:44:0,18,270 20 10001435 . A <NON_REF> . . END=10001435 GT:DP:GQ:MIN_DP:PL 0/0:44:0:44:0,0,1130 20 10001436 . A AAGGCT,<NON_REF> 1845.73 . DP=43;MLEAC=2,0;MLEAF=1.00,0.00;MQ=220.07;MQ0=0 GT:AD:DP:GQ:PL:SB 1/1:0,42,0:42:99:1886,125,0,1888,126,1890:0,0,0,0 20 10001437 . A <NON_REF> . . END=10001437 GT:DP:GQ:MIN_DP:PL 0/0:44:0:44:0,0,0
Note that toward the end of this snippet, you see multiple consecutive non-variant block records. These were not merged into a single record because the sites they contain belong to different ranges of GQ (which are defined in the header).
Updated on 2014-10-22
From lawremi on 2014-05-30
Hi, would you please clarify how the DP, GQ and PL fields are summarized for each run? Thanks!
From Giusy on 2014-09-10
Hello! I think I might have a problem in catching how the gVCF really works. WHat I mean is that I followed all the steps on hoe to call variants on cohorts of samples with HC so I obtained a final VCF containing the variants of something like 33 samples. But I noticed that the size is really small (around 122 Mb, which is about the size of a single gVCF that I obtained from HC). Then I performed the VSQR anf following annotation, but I think something went wrong since I end up with a file in which I can’t distinguish the differend individuals anymore, moreover the number of variants does seem to me quite small in respect to what I expected. I think I made a really huge mistake somewhere, but I really can’t figure out where! Could you help me?
Thank you!!
From Geraldine_VdAuwera on 2014-09-10
Hi @Giusy,
If you end up with a file in which you can’t distinguish the different individuals anymore, there is probably a problem with the read groups assigned to each sample. It’s a pretty common mistake, if you assigned read group information that is not distinct for the various samples (especially the SM tag) then the program merges the information.
From Giusy on 2014-09-10
Dear Geraldine, thank you for your input. I definitely check but I’m quite sure this is not the problem. I added the read groups (including the SM)myself to my sample and I’m pretty sure theone I downloaded from 1000genomes have it already. Moreover ig i look up the output from GenotypeGVCFs in igv I can see all the different samples. Just that when i run the VSQR and annovar as I said I end up with a text file in which I can’t distinguish them no longer and the variations are somenthing like 38000, which seems pretty low number to me considering the 34 samples. Is there an other step that I sould perform to extrapolate my samples before annotating it?
From Geraldine_VdAuwera on 2014-09-10
@Giusy
Ah, then something else is probably wrong, but I’ll need more information from you in order to help you diagnose it. Can you please post all the commands you ran, and clarify which file is the one in which you can longer distinguish the samples? If you can post a few records from that file as well that would be very helpful.
From Giusy on 2014-09-10
Sure! These are all the commands I used so far. For simplicity I just omitted all the folder paths
java -jar $GATK/GenomeAnalysisTK.jar -T HaplotypeCaller -I my_sample.bam —emitRefConfidence GVCF —variant_index_type LINEAR —variant_index_parameter 128000 -R $REF/hs37d5.fa —dbsnp $REF/dbsnp_138.b37.vcf -L $REF/S04380110_Regionsb37.bed -o $SAMPLES/output.raw.snps.indels.g.vcf
java -Xmx2g -jar $GATK/GenomeAnalysisTK.jar -R $REF/hs37d5.fa -T GenotypeGVCFs —max_alternate_alleles 10 —variant output1.raw.snps.indels.g.vcf —variant output2.raw.snps.indels.g.vcf ….. -o global_output.vcf
java -jar $GATK/GenomeAnalysisTK.jar -T VariantRecalibrator -R $REF/hs37d5.fa -input global_output.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 $REF/hapmap_3.3.b37.vcf -resource:omni,known=false,training=true,truth=true,prior=12.0 $REF/1000G_omni2.5.b37.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 $REF/1000G_phase1.snps.high_confidence.b37.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 $REF/dbsnp_138.b37.vcf -an DP -an QD -an FS -an MQRankSum -an ReadPosRankSum -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 -recalFile recalibrate_SNP.recal -tranchesFile recalibrate_SNP.tranches -rscriptFile recalibrate_SNP_plots.R
java -jar $GATK/GenomeAnalysisTK.jar -T ApplyRecalibration -R $REF/hs37d5.fa -input global_output.vcf -mode SNP —ts_filter_level 99.0 -recalFile recalibrate_SNP.recal -tranchesFile recalibrate_SNP.tranches -o global_recalibrated_snps_raw_indels.vcf
java -jar $GATK/GenomeAnalysisTK.jar -T VariantRecalibrator -R $REF/hs37d5.fa -input global_recalibrated_snps_raw_indels.vcf -resource:mills,known=true,training=true,truth=true,prior=12.0 $REF/Mills_and_1000G_gold_standard.indels.b37.vcf -an DP -an FS -an MQRankSum -an ReadPosRankSum -mode INDEL -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 —maxGaussians 4 -recalFile recalibrate_INDEL.recal -tranchesFile recalibrate_INDEL.tranches -rscriptFile recalibrate_INDEL_plots.R
java -jar $GATK/GenomeAnalysisTK.jar -T ApplyRecalibration -R $REF/hs37d5.fa -input global_recalibrated_snps_raw_indels.vcf -mode INDEL —ts_filter_level 99.0 -recalFile recalibrate_INDEL.recal -tranchesFile recalibrate_INDEL.tranches -o global_recalibrated_variants.vcf
perl $ANNOVAR/convert2annovar.pl -format vcf4 —includeinfo global_recalibrated_variants.vcf > snp.avinput
perl $ANNOVAR/annotate_variation.pl -geneanno $SAPLES/snp.avinput -buildver hg19 humandb/
perl $ANNOVAR/table_annovar.pl snp.avinput humandb/ -buildver hg19 -out myanno -remove -protocol refGene,cytoBand,genomicSuperDups,esp6500si_all,1000g2012apr_all,snp138,ljb23_all -operation g,r,r,f,f,f,f -nastring .
And this is a capture of the record. I actually noticed that something was wrong looking at it in Excel
From Geraldine_VdAuwera on 2014-09-10
Ok, well it looks like you’ve done everything correctly for the GATK part; assuming the output of VQSR is ok, it looks like the problem is in the ANNOVAR output. I can’t comment on Annovar’s expected behavior since it’s not our tool, so you should check with that tool’s support at http://www.openbioinformatics.org/annovar/annovar_faq.html#bug
From Giusy on 2014-09-11
Ok, well anyway it’s good news that the GATK part is fine. I’ll try to look better into annovar… Thank you very much for your quick help!
From blueskypy on 2014-09-30
Hi, Geraldine,
I wonder if you can help me with the following?
In the following portion of a gVCF file,
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT s1 1 1 . N . . END=10353 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0
Since the DP is 0, the GT should be No Call, i.e. ./.; why is it 0/0?
Also since each line may represent a block, does it mean the stats are the mean of all sites in the block?
A 2nd question, the doc of GenotypeGVCFs says it “produce correct genotype likelihoods, re-genotype the newly merged record”, does it mean it may change the GT of sites in a sample? For example, from A/T in the gVCF file to A/G after GenotypeGVCFs on a cohort containing the sample, or from No Call to a call?
Thanks,
From Geraldine_VdAuwera on 2014-09-30
Hi @blueskypy,
Funny thing, Craig came to me with the same question this morning…
Short answer is that you shouldn’t be looking at the genotype calls emitted by HC in GVCF mode. Longer answer, the gVCF is meant to be only an intermediate and the genotype calls are not final. In this kind of record, this will end up being a no-call (since REF is N) once it goes through GenotypeGVCFs. The only reason there is a genotype call made here is a leftover bit of code that we are going to take out (hopefully in the next release). In fact, in the near future HC will emit no-call genotypes for every position in a GVCF, to remove this source of confusion. The behavior of regular (non-ERC) mode HC will not be affected.
To your second question, yes, genotype calls can be changed by GenotypeGVCFs.
Does that clarify things?
From blueskypy on 2014-09-30
Thanks so much, Geraldine! So to get the GT of all sites in a sample, I should use GenotypeGVCFs —includeNonVariantSites? if so, two more questions:
1. does it has an option similar to -ERC GVCF or BP_RESOLUTION for those non-variant sites?
2. Does it include both 0/0 and ./. sites? if not, does it mean that sites absent in the vcf output are all ./. sites? Is there a way to include ./. sites as well?
Finally, not related but just wondering, does —standard_min_confidence_threshold_for_calling in GenotypeGVCFs refer to the GQ in GVCF file?
Thanks a lot!
From blueskypy on 2014-09-30
A followup question, my understanding is that VQSR only uses variant sites, so whether —includeNonVariantSites or not in GenotypeGVCFs has no effect on VQSR, is it right?
From Geraldine_VdAuwera on 2014-09-30
@blueskypy
Whoo that’s a lot of questions. Yes, no, yes (I think). No, that arg refers to the variant confidence (QUAL). And yes, VQSR will ignore monomorphic hom-ref sites.
From naymikm on 2014-10-13
Hi, I’m curious if I made a gvcf using 2 samples (1 normal, 1 cancer) from the same individual, could I use this resulting vcf file directly with snpEff?
From Geraldine_VdAuwera on 2014-10-13
@naymikm, the GVCF is only an intermediate format. You must run GenotypeGVCFs on it before doing any further processing such as functional annotation with snpEFF.
From naymikm on 2014-10-13
Sorry I worded that poorly. Would the resulting vcf file after running GenotypeGVCFs be usable with snpEFF directly?
From Sheila on 2014-10-14
@naymikm
Hello,
Yes, your vcf will work with snpEFF.
-Sheila
From pwhiteusa on 2014-11-21
Please could you explain how GQ is calculated in the GVCF format for a given block. I noticed that in cases where the MIN_DP value is less than the DP value, GQ is always significantly higher than the PL for a 0/1 call. Why would that be the case?
From your example GVCF above:
20 10000759 . A . . END=10001018 GT:DP:GQ:MIN_DP:PL 0/0:40:99:28:0,65,1080
Why is GQ not 65? Note that DP=40 and MIN_DP=28?
From Geraldine_VdAuwera on 2014-11-21
Hi @pwhiteusa,
I’m not sure what you mean. The PL values are normalized so that the PL of the called genotype is 0 (see [this FAQ](https://www.broadinstitute.org/gatk/guide/article?id=1268) for details). The PL 65 is for the het possibility, but the record you mention is hom-ref.
From pwhiteusa on 2014-11-22
Here are four examples where GQ is high:
20 10001299 . C . . END=10001386 GT:DP:GQ:MIN_DP:PL 0/0:43:99:39:0,95,1226 20 10000695 . G . . END=10000757 GT:DP:GQ:MIN_DP:PL 0/0:48:99:45:0,120,1800 20 10001419 . T . . END=10001425 GT:DP:GQ:MIN_DP:PL 0/0:45:12:42:0,9,135 20 10001387 . C . . END=10001418 GT:DP:GQ:MIN_DP:PL 0/0:41:42:39:0,21,315
And four examples where it is not:
20 10001430 . G . . END=10001431 GT:DP:GQ:MIN_DP:PL 0/0:45:0:44:0,0,1121 20 10001020 . C . . END=10001020 GT:DP:GQ:MIN_DP:PL 0/0:26:72:26:0,72,1080 20 10001021 . T . . END=10001021 GT:DP:GQ:MIN_DP:PL 0/0:25:37:25:0,37,909 20 10001428 . T . . END=10001428 GT:DP:GQ:MIN_DP:PL 0/0:49:21:49:0,21,315
What I have observed is that in GVCF format, a block of reference calls typically is given a GQ of the 0/1 PL, but only if DP=MIN_DP.
If Min_DP < DP, then the GQ value is always higher than the 0/1 PL. You see the effect more clearly if there is a lower coverage sample, but essentially the higher the delta between the MIN_DP and DP value then the greater the GQ.
Q1. Why is that the case?
Q2. How is GQ calculated in the GVCF format for a given block?
From Geraldine_VdAuwera on 2014-11-24
@pwhiteusa I don’t see the trends you are seeing. The first 4 records, which you say have high GQ, have GQs of 99, 99, 12, 42. The other 4 have GQs of 45, 26, 25, 49. So two of the first four are really in the wrong category. And for the second category, the DP=MIN_DP observation does not apply for the first element. I think you may be seeing coincidental similarities.
Generally speaking, it’s not surprising that you might see larger GQs when the MIN_DP and DP are farther apart, because that suggests that there are some samples with even more coverage than those already respectable depth values. With that amount of depth, you can get a good amount of confidence in the call.
IIRC, the GQ of a block is the lowest GQ of the sites that are all within the same GQ range (as defined in the vcf header).
From pwhiteusa on 2014-12-01
@Geraldine_VdAuwera here’s what I am observing in the examples.
When DP = Min_DP the GQ value is always the second highest PL value (or very close to it). This for the examples above in the second set of four:
GQ = 0 and PL = 0,**0**,1121
GQ = 72 and PL = 0,**72**,1080
GQ = 37 and PL = 0,**37**,909
GQ = 21 and PL = 0,**21**,315
Now, lets look at the first block where DP > Min_DP. This is no longer the case:
GQ = 99 and PL = 0,**95**,1226 Note: DP – Min_DP = 4
GQ = 99 and PL = 0,**120**,1800 Note: DP – Min_DP = 3
GQ = 12 and PL = 0,**9**,135 Note: DP – Min_DP = 3
GQ = 42 and PL = 0,**21**,315 Note: DP – Min_DP = 2
Do you see the trend? When Min_DP is less than DP, the GQ value is always higher than the second highest PL value – why would that be the case?
For example, in the 4th example, where DP and Min_DP are both 49, the GQ (21) is the same as the 0/1 PL (21). Now look at the last example (the 8th) – GQ (42) is double the 0/1 PL (21) and DP is actually less (39-41) in this case.
So I guess it ultimately comes down to the answer you gave to question 2 above:
How is GQ calculated in the GVCF format for a given block?: “the GQ of a block is the lowest GQ of the sites that are all within the same GQ range”. If that were the case, in situations where DP > Min_DP would you not expect the GQ value to go down, not up?
From pwhiteusa on 2014-12-01
Sorry @Geraldine_VdAuwera could you also explain your comment: “because that suggests that there are some samples with even more coverage than those already respectable depth values” – does MinDP refer to the depth from other samples – or did you mean bases?
From Geraldine_VdAuwera on 2014-12-01
@pwhiteusa Thanks for clarifying — I think I understand better now what you mean, but the “trend” is just the difference between single-site blocks (your first set) and multi-site blocks (your second set), which you can tell from the END tag in your earlier post.
For the first set, you can read the PL and GQ as if it was a regular (not GVCF) record: the GQ is equal to the second-likeliest PL cue to how the normalizing is done; that is expected. In the other set, there are other sites with GQs that may be different, although they should still be in the same GQ band (as defined above), so that’s why you can see a slight mismatch between the GQ and the second-likeliest PL. I’ll check with the developers whether it makes sense that the rule for integrating the values from multiple sites is not exactly the same for different fields, since that seems to be causing the confusion here.
To your other question, I mistyped my answer — I did mean sites, not samples, sorry. Note that when you have a multi-site block, DP should always be either equal to or greater than MIN_DP (since by definition MIN_DP is the lowest DP observed in the group, while DP is an average). Finally, I don’t understand what you mean by “expect the GQ value to go down, not up”. Up or down from what? If you mean relative to the single-site records, keep in mind that those are usually on their own because there is something there that is unlike the surrounding sites — some of them better, some of them worse. Given that is all that differentiates the two sets, I don’t think you can infer much from comparing just a handful of sites.
From pwhiteusa on 2014-12-01
@Geraldine_VdAuwera I totally didn’t notice the single site vs. multi-site, thank you. For a given block in the same GQ band, I would expect that the reported GQ would be equal to or less than the second-likeliest PL. If the GQ that is being reported is the lowest GQ for that GQ band, should the PL not also be the lowest PL in the band? I guess the question for the developers would be “How is PL calculated for a given band?” and “Why would a GQ be higher than the second-likeliest PL”.
From Geraldine_VdAuwera on 2014-12-01
@pwhiteusa Yep I need to check how the band PL is determined — we might have an inconsistency here.
From Geraldine_VdAuwera on 2014-12-03
@pwhiteusa After discussing this with the developer, we found that there is indeed an inconsistency — when summarizing sites in a reference block, the program takes the median GQ, but the mean of the PLs, which is why you are seeing different values. I’ve put in a request to have this made more consistent in the next version.
From chenyu600 on 2015-03-18
Hi,
I have followed the Best practice (first HaplotypeCaller with the -GVCF option, then GenotypeGVCFs) and version3.2 for my study. But I got a record like this:
chr1 802320 rs12565310 G A 6702.90 . BaseQRankSum=1.95;DB;DP=805;FS=3.582;MLEAC=2;MLEAF=1.00;MQ=43.69;MQ0=0;MQRankSum=-1.700e-02;QD=30.47;ReadPosRankSum=1.01;SOR=3.363 GT:AD:DP:GQ:PL ./.:298,0:298 1/1:2,218:.:99:6729,618,0 ./.:279,0:279
As far as I am concerned, both the first and third individual got fairly sequencing depth at this position (298 and 279, respectively), but the genotype was ./. . I thought that genotype would represent like this unless there were no reads cover the site.
I also check the gVCF file for each sample on this site, though @Geraldine_VdAuwera has suggested that we shouldn’t be looking at the genotype calls emitted by HC in GVCF mode. Anyway here are the records from the gvcf:
chr1 802320 . G . . END=802320 GT:DP:GQ:MIN_DP:PL 0/0:298:0:298:0,0,0
chr1 802320 rs12565310 G A, 6699.77 . BaseQRankSum=1.947;DB;DP=228;MLEAC=2,0;MLEAF=1.00,0.00;MQ=43.69;MQ0=0;MQRankSum=-0.017;ReadPosRankSum=1.010 GT:AD:GQ:PL:SB 1/1:2,218,0:99:6729,618,0,6734,652,6768:1,1,164,54
chr1 802320 . G . . END=802320 GT:DP:GQ:MIN_DP:PL 0/0:279:0:279:0,0,0
Can anyone give me a hint on this? Thanks in advance.
From Geraldine_VdAuwera on 2015-03-18
Hi @chenyu600 ,
You’re right that no-calls (./.) are usually reserved for sites with no read coverage. But there’s clearly something odd going on at these sites — you see those two samples have PLs of “0,0,0” which means that the program has no clue what is going on there. Have you looked at the site in a genome browser like IGV? That may shed some light on what’s happening here.
From chenyu600 on 2015-03-18
Thank you @Geraldine_VdAuwera !
I would check it out and come back to you as soon as possible!
From chenyu600 on 2015-03-19
Hi @Geraldine_VdAuwera,
I have check the BAM file for each sample in IGV. I just looked into the bam file after performing BWA and Picard MarkDuplicates. Since the the first (II5A) and third (III3A) individual did not called at this position, I think there is no need to look into the bam generated by —bamOutput.
Here is the depth for each sample (from IGV):
II5A
Total count: 840
A : 577 (69%, 291+, 286- )
C : 10 (1%, 1+, 9- )
G : 252 (30%, 111+, 141- )
T : 1 (0%, 0+, 1- )
N : 0
———————-
II7A
Total count: 654
A : 460 (70%, 240+, 220- )
C : 7 (1%, 0+, 7- )
G : 186 (28%, 70+, 116- )
T : 1 (0%, 0+, 1- )
N : 0
———————-
III3A
Total count: 666
A : 500 (75%, 273+, 227- )
C : 4 (1%, 0+, 4- )
G : 156 (23%, 67+, 89- )
T : 6 (1%, 0+, 6- )
N : 0
———————-
It seems that each sample got pretty good depth for this site (chr1:802320).
I also attach the bam files. These files are just alignment from chr1:801320-803320.
From Geraldine_VdAuwera on 2015-03-20
The files were not attached — this forum can only handle plain text and pictures. We prefer that you post a screenshot of your data viewed in IGV.
Please also check the mapping qualities of your reads and also the base qualities. Lots of depth is useless if it’s not good quality.
From chenyu600 on 2015-03-26
Hi @Geraldine_VdAuwera ,
Sorry for the delay response!
Here is one of the sample that did not call at chr1:802320, but with reads covered it.
Statistics from IGV is below:
Total count: 840
A : 577 (69%, 291+, 286- )
C : 10 (1%, 1+, 9- )
G : 252 (30%, 111+, 141- )
T : 1 (0%, 0+, 1- )
N : 0
while information grep from result generated by HaplotypeCaller with the -GVCF option is:
chr1 802320 . G . . END=802320 GT:DP:GQ:MIN_DP:PL 0/0:298:0:298:0,0,0
ASAIK, “DP” value (298) in this record is based on filtered reads counts, which is much smaller than that from IGV (840).
[https://dropbox.com/sh/y3xxmaqygxg9s9g/AACGB848dEUkruJNm2TfA8Qla?dl=0](https://www.dropbox.com/sh/y3xxmaqygxg9s9g/AACGB848dEUkruJNm2TfA8Qla?dl=0)
From Geraldine_VdAuwera on 2015-03-26
@chenyu600 It looks like this region has a big tandem duplication in the sequence, so it’s very likely that after reassembly, the mapping looks different. If you can share a snippet of data we can try to figure out exactly what’s happening here. Instructions are here: https://www.broadinstitute.org/gatk/guide/article?id=1894
From chenyu600 on 2015-03-27
Hi @Geraldine_VdAuwera ,
I have followed the instruction and uploaded my dataset. The archive file is chenyu600_chr1_802320.tar.gz.
There are three samples, II5A, II7A and III3A. All the bam are snippets of the original chromosome 1 bam files. To be exactly, they are bam files (II5A_snippet.bam, for example) with 500 bp padding on either side of chr1:802320. Using these files can reproduce almost the same result except that the annotations of the record are a bit different.
The original result is :
chr1 802320 rs12565310 G A 6702.90 . BaseQRankSum=1.95;DB;DP=805;FS=3.582;MLEAC=2;MLEAF=1.00;MQ=43.69;MQ0=0;MQRankSum=-1.700e-02;QD=30.47;ReadPosRankSum=1.01;SOR=3.363 GT:AD:DP:GQ:PL ./.:298,0:298 1/1:2,218:.:99:6729,618,0 ./.:279,0:279
And the new result is :
chr1 802320 rs12565310 G A 6811.90 . BaseQRankSum=1.70;ClippingRankSum=1.20;DB;DP=807;FS=6.140;MLEAC=2;MLEAF=1.00;MQ=44.03;MQ0=0;MQRankSum=0.811;QD=30.68;ReadPosRankSum=-9.830e-01;SOR=34.852 GT:AD:DP:GQ:PL ./.:297,0:297 1/1:1,221:222:99:6838,654,0 ./.:281,0:281
In both results, II5A and III3A were not genotyped even though the site has been covered by almost 300 reads. But the annotation values are somehow different. Besides, I wonder why the values of DP presented in the new results are more than those presented in the original results (Should they equal to or less than the original ones since there are fewer data). And in the orignial result, DP of the second individual is “.”, I think, which means that all the reads covering this site were filtered out. However, in the snippet dataset, 222 reads are supposed to pass the filter. I am kind of lost. Would you giving me a hint on these too?
There is one more thing confused me. When running on the orignial chromosome 1 bam files, I set -Xmx 5G for each step and that ran pretty well. But when I ran the snippets, errors kept promting up saying “There was a failure because you did not provide enough memory to run this program. See the -Xmx JVM argument to adjust the maximum heap size provided to Java”. And I have to set -Xmx up to 9G to make HaplotypeCaller run on. I thought memory may go up when feeding the tools with big data. But this time it required more memory when processing fewer data (as you can see the exact command lines in the run.sh, and run.sh.e451547 is the log file). Would you mind explaining that to me?
Thank you!
From Geraldine_VdAuwera on 2015-03-30
@chenyu600, thanks for the snippet. I’m not sure what’s going on with the depth; we’ll start by checking out the data you sent and let you know.
I can’t explain the higher memory requirement you encountered running on the smaller dataset, sorry. Would be interesting to know if this is reproducible or if it was perhaps a transient issue on your platform.
From chenyu600 on 2015-04-01
@Geraldine_VdAuwera, hope you can work it out soon ;)
Thank you!
From Sheila on 2015-04-08
@chenyu600
Hi,
This issue has been fixed in the latest release. You can download it here: https://www.broadinstitute.org/gatk/download/
-Sheila
From SteveL on 2015-04-09
Hi @Geraldine_VdAuwera. On this thread we have two answers as to how the GQ in a band of a gVCF is produced:
> November 2014: IIRC, the GQ of a block is the lowest GQ of the sites that are all within the same GQ range (as defined in the vcf header).
> December 2014: After discussing this with the developer, we found that there is indeed an inconsistency — when summarizing sites in a reference block, the program takes the median GQ, but the mean of the PLs, which is why you are seeing different values. I’ve put in a request to have this made more consistent in the next version.
I assume the second is correct, but does that mean that the gVCF is being represented wrongly – i.e. the median GQ is being used to determine bands rather than the more accurate mean PL? 1) How will these be made “more consistent” – will the GQ become equivalent to the second-lowest value in the PL, as for variant positions?
This is important because I notice that when going from 100 gVCFs (-GQB 20) to post-VQSR 100-sample VCF, the GQ in the latter always gets set to it’s PL equivalent, often resulting in a huge change in “GQ” e.g. the following, GQ of 99 in the homozygous REF bands, gets reduced to just 21 and 23 when they are treated as calls:
In the gVCF: 19 39062815 . G C, 3131.77 PASS BaseQRankSum=0.328;ClippingRankSum=1.256;DP=243; LikelihoodRankSum=1.530;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=0.358;ReadPosRankSum=-1.672; GT:AD:DP:GQ:PL:SB 0/1:126,116,0:242:99:3160,0,3374,3538,3722,7260:43,83,34,82 19 39062522 . A 0.0 . END=39062984 GT:DP:GQ:MIN_DP:PL 0/0:83:99:7:0,21,242 19 39062568 . G 0.0 . END=39062940 GT:DP:GQ:MIN_DP:PL 0/0:69:99:8:0,23,270
In the post-VQSR VCF:
19 39062815 . G C 3117.43 VQSRTrancheSNP90.00to99.00 AC=1;AF=4.545e-03;AN=220; BaseQRankSum=0.328;ClippingRankSum=1.26;DP=1331;FS=3.097;InbreedingCoeff=-0.0077;LikelihoodRankSum=1.53; MLEAC=1;MLEAF=4.545e-03;MQ=60.00;MQ0=0;MQRankSum=0.358;QD=12.88;ReadPosRankSum=-1.672e+00;SOR=0.933;VQSLOD=14.63;culprit=MQ GT:AD:DP:GQ:PL 0/1:126,116:242:99:3160,0,3374 0/0:7,0:7:21:0,21,242 0/0:8,0:8:23:0,23,270
2) Is this expected behaviour?
The point is, as shown by the DP in the original, and having looked at the BAMs, that this position is very well supported – this is a real deNovo in this trio – at least it Sangered positive, but if I saw a read depth of 8 and a GQ of 23 in the homozygous REF calls, I would usually ignore it – also notice that it doesn’t get a “PASS”. Further, here the information in the gVCF is more reliable, but it is lost by the time VQSR has been performed.
I guess that adding more banding to my gVCFs might help to resolve this – certainly the DP would be expected to improve (since it is the min_DP that passed to the post-VQSR VCF). 3) Is the DP reported in the gVCF the mean, or the median, do you know?
Hi @Sheila – 4) can you explain how the issue was “fixed” and if this impacts on my questions above?
Sorry to ask so many questions, but at least they are clearly numbered. :-)
Thanks as always for the great support you are providing, Steve
From chenyu600 on 2015-04-10
Hi @Shelia , thank you very much for the work! Following your instruction, I have downloaded the GATK version3.3 and tested it on my dataset. It worked pretty well this time and I think the problem have been fixed. Thanks again!
But I am now a little confused by the coverage depth for this position. The depth (view from IGV) for this position from the origianl BAM file (attained from BWA + de-dup by Picard + recalibrated by GATK)for each sample were these:
sample 1(II5A):
Total count: 840
A : 577 (69%, 291+, 286- )
C : 10 (1%, 1+, 9- )
G : 252 (30%, 111+, 141- )
T : 1 (0%, 0+, 1- )
N : 0
———————-
sample 2(II7A):
Total count: 654
A : 460 (70%, 240+, 220- )
C : 7 (1%, 0+, 7- )
G : 186 (28%, 70+, 116- )
T : 1 (0%, 0+, 1- )
N : 0
———————-
sample 3(III3A):
Total count: 666
A : 500 (75%, 273+, 227- )
C : 4 (1%, 0+, 4- )
G : 156 (23%, 67+, 89- )
T : 6 (1%, 0+, 6- )
N : 0
———————-
But the depth (either DP or AD) in the final VCF file were much less than the original one. Here is the record grep from the final VCF produced by GATK version 3.3:
chr1 802320 rs12565310 G A 22078.90 . BaseQRankSum=2.38;ClippingRankSum=0.195;DB;DP=778;FS=6.999;MLEAC=6;MLEAF=1.00;MQ=42.26;MQ0=0;MQRankSum=0.466;QD=29.01;ReadPosRankSum=1.84;SOR=1.925 GT:AD:DP:GQ:PL 1/1:6,271:277:99:7657,690,0 1/1:2,218:220:99:6744,638,0 1/1:6,258:264:99:7704,655,0
You see sample 1 with depth of 277, sample 2 with 220 and sample 3 with 264. I know that HaplotypeCaller will Downsample the data to 250x, and that some reads with bad qualtiy will be discarded. I also make HaplotypeCaller output the BAM files to which assembled haplotypes were written. The depth (view from IGV) for this position this time were below:
sample 1 (II5A)
Total count: 280
A : 272 (97%, 193+, 79- )
C : 1 (0%, 0+, 1- )
G : 7 (3%, 3+, 4- )
T : 0
N : 0
———————-
sample 2 (II7A)
Total count: 224
A : 219 (98%, 162+, 57- )
C : 1 (0%, 0+, 1- )
G : 3 (1%, 0+, 3- )
T : 1 (0%, 0+, 1- )
N : 0
———————-
sample 3 (III3A)
Total count: 268
A : 259 (97%, 179+, 80- )
C : 0
G : 7 (3%, 3+, 4- )
T : 2 (1%, 0+, 2- )
N : 0
———————-
I have two questions:
1. Will the downsample step affect the result? I mean, is there ever a chance that most reads supporting the reference base are discarded(for example, in sample 1, it discarded 240 out of 252 reads supporting G and 250 out of 577 reads A, so that the genotype would turn out to be 1/1, since G is the reference base) ?
2.Why the depth attain from IGV is different from that from the final VCF? From the IGV, there are 7 reads supporting G and 272 reads supporting A in sample 1. However, the AD in this sample was 6,271. From the IGV, there are 3 reads supporting G and 219 reads supporting A in sample 2 while the AD is 2,218. In sample 3, 7 reads supporting G and 259 reads supporting A while the AD is 6,258. Interestingly, there is always one more read for both allel in IGV than that grep from fianl VCF. Would you please give me a hint on this?
Thank you!
From SteveL on 2015-04-10
Hi again @Geraldine_VdAuwera – I see I may have misunderstood what you were saying about the band above (November) – of course the boundaries of the band are determined by the lowest GQ, by definition. I have now run HC specifying 5 or 10 GQBands, and that has helped with the final DP and GQ values substantially, at least in this instance, and only increased HC run-time by ~10%.
I am still curious about my other questions.
Thanks, Steve
From Geraldine_VdAuwera on 2015-04-10
@SteveL
1)-2) You’re doing the right thing — we’ve actually increased the granularity of the default GQ bands in the upcoming version (3.4, to be released soon-ish).
3) I think it’s the mean DP but I could be wrong. Will have to check.
4) I believe Sheila was talking to the other user who has been active in this thread — that’s why for more complex questions that are likely to take some back and forth, it’s usually better to create a new discussion than to post a comment in a document thread.
From Geraldine_VdAuwera on 2015-04-10
@chenyu600 The downsampling can have marginal effects on results, but nothing dramatic. If you’re seeing 240 out of 252 reads being discarded, that’s not due to downsampling. It’s more likely that it’s due to quality issues in your reads and they’re being filtered out.
I don’t know why the count of AD is off by one, can you show a screenshot of this?
From SteveL on 2015-04-10
> @Geraldine_VdAuwera said:
> SteveL
>
> 1)-2) You’re doing the right thing — we’ve actually increased the granularity of the default GQ bands in the upcoming version (3.4, to be released soon-ish).
>
> 3) I think it’s the mean DP but I could be wrong. Will have to check.
>
> 4) I believe Sheila was talking to the other user who has been active in this thread — that’s why for more complex questions that are likely to take some back and forth, it’s usually better to create a new discussion than to post a comment in a document thread.
Thanks Geraldine. I realised Sheila was talking to @chenyu600 , but I wasn’t sure exactly how the issue was fixed, and whether it would mean I should switch to downloading a nightly version, rather than using the original 3.3?
Sorry I didn’t start a new thread ;-( – thought my original queries were relevant to the tool, but in future I’ll post a new thread.
From Geraldine_VdAuwera on 2015-04-10
Oh ok, I guess I was the one confused then :)
That’s right, anytime we say something is fixed, we mean in the latest nightly… unless we recently released a new version (which is going to happen sooooon).
You don’t absolutely have to create a new thread everytime, but a good rule of thumb is if you find yourself writing more than one paragraph, it should probably be a new question. Document threads are meant more for clarification of specific points of the doc rather than reporting problem. But sometimes things are not obviously one or the other, so don’t worry too much about it. We’ll still answer you either way :)
From Sheila on 2015-04-11
SteveL Actually, in this case, version 3.3 has the fix. chenyu600 was using version 3.2. Sorry for the confusion. I guess I need to be more specific about that in the future.
@chenyu600 I will have a look at your files by Monday and get back to you with a proper response.
-Sheila
From chenyu600 on 2015-04-14
Hi @Geraldine_VdAuwera ,
I am very sorry for the late response, but I wonder what screenshot you would like to see? Do you mean the screenshot of IGV result? If it is what you mean, the screenshot was below. Hope this help.
@Sheila , thank you for your work and I hope the screenshot would help. And if you want any other dataset, just mention that;)
https://www.dropbox.com/s/vm93stuhxtvfqio/igv_depth.png?dl=0
From Sheila on 2015-04-15
@chenyu600
Hi,
I just looked at your files (which I ran with the latest nightly build), and the values match the bamout file. I think you may be looking at the original bam file and not the bamout file. Haplotype Caller does a reassembly step, and reads may get shifted. You can use the -bamout argument to see what the reads look like after reassembly. https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php#—bamOutput
-Sheila
From jwagner on 2015-09-16
Hi, are there any publicly available examples of gVCF format with a whole genome or whole exome sequencing of a group of real life human beings? Just to get an idea of what is looks like with real data?
From Sheila on 2015-09-17
@jwagner
Hi,
No we do not provide any public GVCFs. The snippets we show in our examples are real examples. The entire GVCF will be the same format but much bigger.
-Sheila
From u0597274 on 2016-03-11
Are there any publicly available tools to easily convert between the two GVCF subtypes?
From Geraldine_VdAuwera on 2016-03-11
@u0597274 I’m not sure what you mean by that, do you mean between BP_RESOLUTION and the banded GVCF? If so no, we don’t have any utilities to do that. I’m not aware of anyone else having built that either.
From jfb on 2016-04-18
I’ve read through most of this thread and I’m confused about how best to use the haplotype caller with a single WGS sample that I am trying to genotype. Is it still recommended to use the GVCF format? If I want to use the genotype calls do I still need to convert my GVCF to a VCF via genotypeGVCF even if there is one sample?
My confusion stems from Geraldine’s comment in September 2014 that “you shouldn’t be looking at the genotype calls emitted by HC in GVCF mode. Longer answer, the gVCF is meant to be only an intermediate and the genotype calls are not final.”
From Sheila on 2016-04-18
@jfb
Hi,
The GVCF workflow has two major benefits over the normal HaplotypeCaller workflow.
1) It allows you to add samples to your analysis easily.
2) It allows you to jointly genotype all of your samples together without using too much compute.
If you only have one sample and are absolutely sure you will never have any more samples, you can go ahead and run HaplotypeCaller in normal mode to get the final VCF in one step. However, if you do use the GVCF workflow, you will indeed need to run GenotypeGVCFs on your one GVCF to get the final VCF.
I hope this helps.
-Sheila
From jfb on 2016-04-18
Great. thanks for the clarification.
From magicDGS on 2016-05-27
Hello. I wonder if there is a possibility to make gVCFs in the GATK framework with a different band pattern. For instance, I’m interested in combine non-variant blocks using depth (DP) in some ranges instead of GQ.
On the other hand, I would like to know how it is possible to index this kind of files with tabix. If the end of a non-reference block is in the INFO field and for variation it is the same as the position + allele length, how can tabix query overlaps that falls between the start and the END tag?
Thank you very much in advance.
From Sheila on 2016-06-06
@magicDGS
Hi,
Sorry for the delay.
1) No, there is no possibility of having GVCF blocks based on anything other than GQ. Of course, you are welcome to implement it yourself and submit a patch, preferably in GATK4 :smile:
2) We don’t recommend working with gzipped GVCFs, as we have observed some issues. However, GVCFs are valid VCFs and follow the specification. But, we don’t develop tabix, so you should ask the developers for help.
-Sheila
From fdnj on 2016-08-10
Hello.
I would like to know if it is possible to create a vcf file (i.e.: post GenotypeGVCFs step) in a gvcf format (as produced by -ERC: GVCF option).
Briefly, I would like to query a fast increasing number of vcf files without rerunning GenotypeGVCFs on the overall cohort every-time a new sample is sequenced. Therefore I would like to have fully informative files (containing information on homozygous reference or no-calls) so that samples could be queried together even if they come from different GenotypeGVCFs steps.
So I was wondering if using “post GenotypeGVCFs gvcf files” could be a good solution to combine information for all-sites for all samples while keeping file size as low as possible.
Thank you in advance.
From SteveL on 2016-08-10
Hi @fdnj,
Did you not run combineGVCFs prior to doing genotype gVCF?
I think this is probably the functionality you are looking for. e.g. First cohort is 100 samples – make a combine gVCF, and then run genotypeGVCF on the combined file to get your 100-sample VCF (keeping the combinedGVCF for late). Second cohort is a batch of 10 samples. Run genotypeGVCF giving these 11 files as input – i.e. your initial combineGVCF + 10 new gVCFs to give a 110 sample VCF. Or am I misundertanding something? This will be faster than trying to genotypeGVCF 110 samples.
From Sheila on 2016-08-10
@fdnj
Hi,
I’m a little confused about what you are asking. To answer your first question, no there is no way to output a GVCF-like final VCF. However, there is the [-allSites](https://software.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_GenotypeGVCFs.php#—includeNonVariantSites) option that outputs every single site. However, I don’t think that is what you are asking for.
Why are you running GenotypeGVCFs not on all your samples? That is not our Best Practices recommendation. GenotypeGVCFs is meant to be very fast, so you can quickly add more sample to you cohort. Have a look [here](https://software.broadinstitute.org/gatk/guide/article?id=3893) and [here](https://software.broadinstitute.org/gatk/documentation/article?id=4150) for more reasons to run GenotypeGVCFs on all your GVCFs.
@SteveL is correct that you use CombineGVCFs to combine GVCFs before feeding them into GenotypeGVCFs. This will also help save time.
-Sheila
From fdnj on 2016-08-11
Dear Steve and Sheila,
I understand that CombineGVCFs plus genotypeGVCF is the best option in my case.
Thank you for your answer
From mscjulia on 2017-07-06
Hello GATK Team,
I am wondering if there is an easy way to convert GVCF to VCF. I need to look at some samples of my cohort study separately for research purpose, and the numbers are not enough to use the Best Practise, but I already have the GVCF previously generated, so it would be the easiest if I just convert them to VCF.
Thank you very much.
From SteveL on 2017-07-06
You just run genotypeGVCF directly on your gVCFs @mscjulia
From mscjulia on 2017-07-06
Thank you so much @SteveL
From waldirmbf on 2017-12-13
Hello,
I am using BIS-SNPs to generate SNV callings, and the output is a vcf. I would like to use GenotypeGVCs. to join genotypes from two different individuals and to pursue to the filtering options. However, GenotypeGVCs requires gVCFs files. Has anyone passed through this situation? I was wondering if there is any option of converting VCF files into gVCF files without using Haplotype Caller? Or should I find other options to filter my data?
From Sheila on 2017-12-18
@waldirmbf
Hi,
There is no way to convert a VCF to a GVCF. GVCFs can only be made by HaplotypeCaller.
Are you looking to combine your two VCFs into one VCF? You may find [CombineVariants](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_variantutils_CombineVariants.php) helpful.
-Sheila
From mjtiv on 2018-04-18
Question? I have noticed with some vcf files a homozygous call can be 0/0:20: whereas the same call could be 0/0:20,0: , why does this inconsistency occur? Does this type of difference occur with a gvcf merged with GenotypeGVCF (0/0:20,0:) file versus a vcf merged with CombinedVariants (0/0:20:). Trying to understand why vcf files can be so different/inconsistent at times.
From Sheila on 2018-04-25
@mjtiv
Hi,
Are you using GATK4? This was an issue in GATK3 that should be fixed in GATK4. Have a look at [this thread](https://gatkforums.broadinstitute.org/gatk/discussion/6938/a-snip-and-deletion-without-dp-in-g-vcf-file/p1).
-Sheila
From clomoreau on 2018-08-21
Hi, I’m having a similar problem:
I have WES from UK10K that I jointed called with my own WES and at one position (maybe more, I did not look at all positions…), I have no call for all heterozygotes in the UK10K dataset, but I do have heterozygote calls for samples in my own data. This position is around 1% in the population.
Here is what I have in the gvcf and in the final vcf (joint calling) for 2 of the UK10K samples, one (who should be) heterozygote and one homozygote:
gvcf:
FORMAT GT:DP:GQ:MIN_DP:PL
EGAN00001068854 ./.:67:0:67:0,0,0
EGAN00001068855 ./.:57:99:21:0,63,767
vcf:
FORMAT GT:AD:DP:GQ:PGT:PID:PL
EGAN00001068854 ./.:67,0:67:.:.:.:0,0,0
EGAN00001068855 0/0:21,0:21:63:.:.:0,63,767
And here is what was in the vcf directly generated by the UK10K project (that looks exactly like what I see in an IGV screen shot of the bams that I used to joint call):
FORMAT GT:DP:DV:GQ:PL:QP:SP
EGAN00001068854 0/1:67:35:99:255,0,255:42:1
EGAN00001068855 0/0:57:0:99:0,172,255:44:0
So it looks like the HC missed the heterozygote calls and putted the wrong values in the DP, GQ and PL fields…is that related to the same issue? Should I try GATK version 4?
Thanks a lot.
From helene on 2018-08-27
Hi GATK team,
I am working on some WGS data, and hope to learn how many reasonably covered “sites” (sites at which GATK can make a call, regardless of quality) have GATK looked over when doing the variant calling in HaplotypeCaller.
I realized that GVCF has blocks, so I cannot simply count lines. Do you know if there is a better way to do it? Thank you very much.
From SteveL on 2018-08-27
Have you looked for lines with “END=” in them @helene – you most likely have non-variant blocks. See the description at the top of this page to understand better what a gVCF is.
From helene on 2018-08-28
@SteveL Thanks so much for your reply. You might replied to my post before I edited it… You are right, I completely forgot there were blocks in GVCF. I am so sorry. Now with this in mind, is there an easy way to have a count about how many “sites” GATK tried to work on please? (All sites minors sites with zero coverage). Thanks so much.
From Sheila on 2018-08-28
@clomoreau
Hi,
Can you post IGV screenshots of the original BAM file and [bamout file](https://gatkforums.broadinstitute.org/gatk/discussion/5484/howto-generate-a-bamout-file-showing-how-haplotypecaller-has-remapped-sequence-reads) here? Please include ~300 bases before and after the sites of interest.
Thanks,
Sheila
From clomoreau on 2018-08-29


From clomoreau on 2018-08-29

From clomoreau on 2018-08-29
Thanks for your answer @Sheila
I had to force active regions to get something in the bamout.
For the homozygote sample (EGAN00001068855), there is nothing even if I force active regions…is that suppose to or I missed something?
From clomoreau on 2018-08-29
Sorry…the first screenshot is the heterozygote EGAN00001068854 bam, the second is the homozygote EGAN00001068855 and the third is the heterozygote EGAN00001068854 bamout…
From SteveL on 2018-08-30
Hi @clomoreau,
I am not sure exactly what you mean when you say “your own WES”, but looking at the IGV shots, this looks like it would be a tricky region to map to, and the mapping quality given to the reads by whichever aligner you have used may affect whether the variants are called or not. Note that many of the reads with the g-variant start at the same position and look like they could be shifted 19nt to the left and they would align perfectly (of course we can’t see what is on the right of the read). Also you seem to have some very short reads being mapped there, which looks strange – some are only 11nt in length in the 3rd screenshot – I don’t know if this is a side-effect of forcing the call.
To me it looks like HC decided it didn’t want to make a call in that position because it felt the data was inconclusive – hence the PL 0,0,0, and the ./. GT call – however, I wouldn’t be sure I am correct. Let’s see what the experts think.
From clomoreau on 2018-08-30
Thanks for your answer @SteveL
Do you know if there is way to identify those tricky regions? Because there should be other regions like this in my data and I would like to remove these from further analyses.
My own WES are whole exomes sequenced in our lab and UK10K are controls from another public resource. I know it is somehow tricky to merge such datasets coming from different sources (I spent a lot of time cleaning these and trying to remove all batch effect biaises), but this position (and probably others…) is very strange as all WES from my lab have been called (homozygotes and heterozygotes), homozygotes from the UK10K were called as well, but all heterozygotes from the UK10K were not called for some reason, they were set to missing. As you can imagine this position is one of the top hits in my association study…but it is an artefact due to the batch effect. That’s why I would like to understand this.
From Sheila on 2018-09-05
@helene
Hi,
> Now with this in mind, is there an easy way to have a count about how many “sites” GATK tried to work on please?
Are you asking how many sites HaplotypeCaller considered as potentially variant? Or, how many sites HaplotypeCaller looked at to make sure there were no variants (or were variants) there?
-Sheila
From Sheila on 2018-09-06
@clomoreau
Hi,
The screenshots seem to be lost. Can you post them again?
-Sheila
EDIT: I was thinking they were attachments, but I see you posted them in the actual post. Can you color the reads by sample in the bamout? Please also zoom out so I can see ~300 bases before and after the site of interest. Thanks.
From clomoreau on 2018-09-10
@Sheila I do not understand what means colouring by sample. I have only 1 sample per screen shot. So I send you the one which I think should be heterozygote, but is called missing in the vcf. It is the bamout zoomed out for this sample.
From clomoreau on 2018-09-10

From aush on 2018-09-13
Hello,
if possible could you please explain how those “potential” (non-variant) sites are detected when we build GVCF? For example, I run `HaplotypeCaller` for each sample with the following arguments:
`-I split.bam -O out.g.vcf -R GRCm38.fa -ERC GVCF —dont-use-soft-clipped-bases -stand-call-conf 20.0` – so I provide only one bam file here. In the resulting file, I see the actual variants from the given bam file, and also non-variant sites – how the latter are build?
Thanks in advance!
From clomoreau on 2018-09-13
Hi, thanks for your comment I realize that the title of the graphs are not there so it is mixing. Here is the exact command that I ran to produce the last 2 graphs above:
java -Xmx64g -Djava.io.tmpdir=tmp -jar ${GATKPATH}gatk-package-4.0.1.2-local.jar HaplotypeCaller \ —emit-ref-confidence GVCF \ -R /exec5/GROUP/pacoss/COMMUN/GENOME_REF/Homo_sapiens.GRCh37.fa \ -I /exec5/GROUP/pacoss/COMMUN/data/exome/EGAN00001068854.sorted.dup.recal.bam \ -L 1:152975510-152976110 \ —dont-trim-active-regions true \ —disable-optimizations true \ -bamout EGAN00001068854.1:152975510-152976110.forced.bamout.bam \ -O EGAN00001068854.chr1:152975510-152976110.hc.g.vcf.gz
And here is what I have at the position which is highlighted in the bamout graph in EGAN00001068854.chr1:152975510-152976110.hc.g.vcf.gz:
1 152975810 . G . . END=152975810 GT:DP:GQ:MIN_DP:PL 0/0:67:0:67:0,0,0
So my problem is that this position looks like heterozygote in the bamout graph, but it is called homozygote in the gvcf. Then when I joint called (CombineGVCFs, GenotypeGVCFs, VariantRecalibrator, ApplyRecalibration, VariantFiltration) all samples in the final vcf this position was set as missing for this sample:
#CHROM POS EGAN00001068854
1 152975810 ./.:67,0:67:PASS:.:.:.:0,0,0
Thanks!
Claudia
From helene on 2018-09-18
Hello GATK team – I wonder how would gatk behave for a site that is not covered by any read? Would it be included into the block or excluded from the gvcf output?
(Doe it have anything to do with “FORMAT=” in the header comments? E.g. in this case, sites with coverage less than 1 would not be included in gvcf – is my understanding correct please?)
Thank you very much
Helen
From SteveL on 2018-09-19
Hi @helene , you would get a block that looks like the middle block in the following 3 lines from a real gVCF . The MIN_DP in the middle line shows that there was 0 reads covering these positions. (hence all other value are “0” as well. So while the GT shows a 0/0, it is equivalent to a ./. as there is no evidence for the GT and when we joint-genotype, if such a position is shown in the final VCF it will be represented as “./.” in this sample.
Note “MIN_DP” is not used as a parameter in Haplotype Caller – it is just produced as an annotation.
1 668406 . G . . END=668423 GT:DP:GQ:MIN_DP:PL 0/0:11:21:8:0,21,281
1 668424 . T . . END=668553 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0
1 668554 . A . . END=668563 GT:DP:GQ:MIN_DP:PL 0/0:7:21:7:0,21,217
From helene on 2018-09-19
@SteveL Thanks so much! This is of great help. I really appreciate it.