How to Consolidate GVCFs for joint calling with GenotypeGVCFs

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by Geraldine_VdAuwera

on 2018-04-16

In GATK4, the GenotypeGVCFs tool can only take a single input i.e., 1) a single single-sample GVCF 2) a single multi-sample GVCF created by CombineGVCFs or 3) a GenomicsDB workspace created by GenomicsDBImport. If you have GVCFs from multiple samples (which is usually the case) you will need to combine them before feeding them to GenotypeGVCFs. The input samples must possess genotype likelihoods containing the allele produced by HaplotypeCaller with -ERC GVCF or -ERC BP_RESOLUTION.

Although there are several tools in the GATK and Picard toolkits that provide some type of VCF merging functionality, for this use case ONLY two of them can do the GVCF consolidation step correctly: GenomicsDBImport and CombineGVCFs.

GenomicsDBImport is the preferred tool (see detailed instructions below); CombineGVCFs is provided only as a backup solution for people who cannot use GenomicsDBImport. We know CombineGVCFs is quite inefficient and typically requires a lot of memory, so we encourage you to try GenomicsDBImport first and only fall back on CombineGVCFs if you experience issues that we are unable to help you solve (ask us for help in the forum!).

UsingGenomicsDBImport in practice

The GenomicsDBImport tool takes in one or more single-sample GVCFs and imports data over at least one genomics interval (this feature is available in v4.0.6.0 and later and stable in v4.0.8.0 and later), and outputs a directory containing a GenomicsDB datastore with combined multi-sample data. GenotypeGVCFs can then read from the created GenomicsDB directly and output the final multi-sample VCF.

So if you have a trio of GVCFs your GenomicsDBImport command would look like this, assuming you're running per chromosome (here we're showing the tool running on chromosome 20 and chromosome 21):

gatk GenomicsDBImport \ -V data/gvcfs/mother.g.vcf \ -V data/gvcfs/father.g.vcf \ -V data/gvcfs/son.g.vcf \ --genomicsdb-workspace-path my_database \ --intervals chr20,chr21

That generates a directory called my_database containing the combined GVCF data for chromosome 20 and 21. (The contents of the directory are not really human-readable; see “extracting GVCF data from a GenomicsDB” to evaluate the combined, pre-genotyped data. Also note that the log will contain a series of messages like

Buffer resized from 178298bytes to 262033

-- this is expected.) For larger cohort sizes, we recommend specifying a batch size of 50 for improved memory usage. A sample map file can also be specified when enumerating the GVCFs individually as above becomes arduous.

Then you run joint genotyping; note the gendb:// prefix to the database input directory path. Note that this step requires a reference, even though the import can be run without one.

gatk GenotypeGVCFs \ -R data/ref/ref.fasta \ -V gendb://my_database \ -newQual \ -O test_output.vcf

And that's all there is to it.

Important limitations and Common “Gotchas”:

You can't add data to an existing database; you have to keep the original GVCFs around and reimport them all together when you get new samples. For very large numbers of samples, there are some batching options.
At least one interval must be provided when using GenomicsDBImport.
Input GVCFs cannot contain multiple entries for a single genomic position
GenomicsDBImport cannot accept multiple GVCFs for the same sample, so if for example you generated separate GVCFs per chromosome for each sample, you'll need to either concatenate the chromosome GVCFs to produce a single GVCF per sample (using GatherVcfs) or scatter the following steps by chromosome as well.
The annotation counts specified in the header MUST BE VALID! If not, you may see an error like
1. A fatal error has been detected by the Java Runtime Environment [...] SIGSEGV
with mention of a core dump (which may or may not be output depending on your system configuration.) You can check you annotation headers with vcf-validator from VCFtools [https://github.com/vcftools/vcftools]
GenomicsDB will not overwrite an existing workspace. To rerun an import, you will have to manually delete the workspace before running the command again.
If you’re working on a POSIX filesystem (e.g. Lustre, NFS, xfs, ext4 etc), you must set the environment variable TILEDBDISABLEFILE_LOCKING=1 before running any GenomicsDB tool. If you don’t, you will likely see an error like Could not open array genomicsdb_array at workspace:[...]
HaplotypeCaller output containing MNPs cannot be merged with CombineGVCFs or GenotypeGVCFs. For phasing nearby variants in multi-sample callsets, MNPs can be inferred from the phase set (PS) tag in the FORMAT field.
There are a few other, rare bugs we’re in the process of working out. If you run into problems, you can check the open github issues [https://github.com/broadinstitute/gatk/issues?utf8=%E2%9C%93&q=is%3Aissue+is%3Aopen+genomicsdb] to see if a fix is in in progress.

If you can't use GenomicsDBImport for whatever reason, fall back to CombineGVCFs instead. It is slower but will allow you to combine GVCFs the old-fashioned way.

Addendum: extracting GVCF data from the GenomicsDB

If you want to generate a flat multisample GVCF file from the GenomicsDB you created, you can do so with SelectVariants as follows:

gatk SelectVariants \ -R data/ref/ref.fasta \ -V gendb://my_database \ -O combined.g.vcf

Bells and Whistles

GenomicsDB now supports allele-specific annotations [ https://software.broadinstitute.org/gatk/documentation/article?id=9622 ], which have become standard in our Broad exome production pipeline.

GenomicsDB can now import directly from a Google cloud path (i.e. gs://) using NIO.

Updated on 2019-01-23

From Rosmaninho on 2018-04-20

If we want to run GenomicsDBImport in all chromosomes how do we specify it?

Is it possible to specificy samples using the following command? -V $path/*.g.vcf

From Sheila on 2018-04-22

@Rosmaninho

Hi,

Have a look at Geraldine’s last answer in [this thread](https://gatkforums.broadinstitute.org/gatk/discussion/10061/using-genomicsdbimport-to-consolidate-gvcfs-for-input-to-genotypegvcfs-in-gatk4).

>Is it possible to specificy samples using the following command?

-V $path/*.g.vcf

That is the way to do it :smile:

-Sheila

From EliseG on 2018-05-02

Hi,

So if I understand correctly, the best way is to use GenomicsDBImport when it’s possible (as a BestPractice).

But is-it possible to use ~~V ${sep = “~~ V” GVCFs} like in the following tutorial: https://software.broadinstitute.org/wdl/documentation/article?id=7614 ?

Elise

From Sheila on 2018-05-08

@EliseG

Hi Elise,

No, GenotypeGVCFs in GATK4 only accepts 1 input GVCF. Have a look at [this article](https://software.broadinstitute.org/gatk/documentation/article?id=11813).

-Sheila

From EliseG on 2018-05-16

Thanks Sheila!

Elise

From biojiangke on 2018-06-22

Hi,

I have a question about the behavior of the interval option in CombineGVCF: I understand it could take standard samtools format chr:start-end, and BED format, but it also could take the format of chr:pos, as I tried. I would think GATK processes one genomic position in this situation, but instead, I’m getting results up to 5bp from this specified position. Would anyone provide more information about this behavior?

The application behind this is that sometimes we use this type of operation to fetch genotypes across samples with WGS data and compare with results from other genotyping platforms such as SNP chips and amplicons. In this case, the sites to be checked are discrete and scattered across the genome and I had to supply GATK with multiple intervals.

Thanks!

From Sheila on 2018-06-25

@biojiangke

Hi Ke,

>I’m getting results up to 5bp from this specified position.

I am not sure what you mean. Can you post some example records?

Thanks,

Sheila

From biojiangke on 2018-06-26

For example, I supply a position to the GenotypGVCF operation in this format: 2:42889589 with the -L option. I would expect genotypes from this location, but instead, I got variant information at 2:42889592, which is 3bp away from the expected location.

From biojiangke on 2018-06-26

Sometimes I can see this is an InDel issue, but sometimes it is a SNP, but the position is still off.

From Sheila on 2018-06-27

@biojiangke

Hi Ke,

I see. Can you post the exact command you ran? Are you only using one position in the interval?

Thanks,

Sheila

From biojiangke on 2018-06-27

sites.list (GTAK format but only 1bp)

2:121169112

2:35031950

2:42889592

2:4365346

Combine VCF step:

gatk CombineGVCFs -R ref.fa -L sites.list -O Combined.vcf -V gVCF.list

Joint call step:

gatk GenotypeGVCFs -O GTed.vcf -R ref.fa -V Combined.vcf

The positions in GTed.vcf are:

2 4365345

2 35031949

2 42889589

2 121169110

After inspecting the results, I seem to know the reason. As long as there is an additional alternative allele, even from a few reads, supporting InDels that overlap with the queried location, the sites involving these alleles would be written into output. I think this makes sense, as any sites, even far away from the queried site, may have alleles affecting the status at the queried site.

From Sheila on 2018-06-29

@biojiangke

Hi,

>As long as there is an additional alternative allele, even from a few reads, supporting InDels that overlap with the queried location, the sites involving these alleles would be written into output. I think this makes sense, as any sites, even far away from the queried site, may have alleles affecting the status at the queried site.

Yes, that is correct. Thanks for posting.

-Sheila

From QuinnC on 2018-08-03

I am currently using GATK 4.0.5.1 and looking to analyse SNPs from a target capture dataset done on a non-model organism without linkage or chromosome information. There are 2900 targeted regions/contigs each across 45 individuals. I have run HaplotypeCaller using -ERC mode and now have 45 g.vcf files (about 400-700mb each) that I will like to combine for GenotypeGVCFs.

Does the GenomicsDBImport for this version or newer versions support multiple intervals? I will prefer not to have 2900 separate output files after this step.

I have tried combineGVCFs across the 45 g.vcf files without using the -L option but always ran into “GC threads out of memory error” even with -Xmx32G.

Previously we used GATK 3.7 without any problem as GenotypeGVCFs accepted multiple g.vcf files as input.

I’d appreciate your suggestion on how I can move forward with my analyses.

Thanks in advance,

Quinn

From gmauro72 on 2018-08-03

@QuinnC

Have a look at https://gatkforums.broadinstitute.org/gatk/discussion/11231/are-there-any-plans-to-add-multi-interval-support-to-genomicsdbimport#latest

Mauro

From QuinnC on 2018-08-03

Thank you, Mauro! Will upgrade to 4.0.6.0 and give it a try!

From MehulS on 2018-12-20

Do I need to install a separate GenomicsDB tool for this command to work ? The[ GitHub](http://https://github.com/Intel-HLS/GenomicsDB/wiki/Querying-GenomicsDB#producing-combined-gvcf “ GitHub”) link seems to suggest so.

From hdetering on 2018-12-26

Using `GenotypeGVCFs` from GATK 4.0.10.0, the option `-newQual` gave me an error:

```

A USER ERROR has occurred: n is not a recognized option

```

Trying `—newQual` was unsuccessful, too:

```

A USER ERROR has occurred: newQual is not a recognized option

```

Am I right in assuming that this was an upward-compatibility option in GATK3?

From leshwill on 2019-01-28

Hello, I have 520 WES samples. I’m trying to run GenomicsDBImport per chromosome.

I used the —intervals chr1 as per the example but got “Query interval “chr1” is not valid for this input”.

Is it possible to run GenomincsDBImport per chromosome?

My plan is to create joint called VCFs for each chromosome then combine the chromosome VCFs to get a final VCF file. I’ve been running CombineGVCFs for about 12days and so far the progress meter indicates it’s on chromosome 5. I need help. How can I make this process go faster?

From danangcrysnanto on 2019-02-07

I parallelize genomicsdbimport tools to 2000 interval across genome with sample size ~400 and run about 100 jobs simultaneously. I got complain from my cluster admin that my jobs overloaded the Lustre filesystems by having lots I/O (I already set —batch-size 50). Alternatively, I could use the SSD attached to the computing nodes but I need to move the output files whenever the jobs finished as it will be immediately deleted.

Is it now possible to move the genomic database output after creation? Do you recommend to use CombineGVCFs instead (and 400 samples are realistics to be put in a single big gvcf files)?

Thank you

From lauren_white on 2019-03-04

Hello,

I am trying to combine 350 exome-capture GVCFs using GenomicDBImport. I have started with chr21 (-L chr21) as a test run, but the program seems to be stuck. The last output showed the headings for the progress meter, but it has not moved on in nearly an hour.

I have only recently switched to gatk4. When using gatk3 GenotypeGVCFs (without first having to combine the gvcfs) on chr21 alone, the process usually took less than an hour. I have tried switching back to this version, but the single gvcfs now seem to contain header information that gatk3 can not deal with (Unable to parse header with error: For input string: “R”).

I appreciate any advice or help you can provide.

Thankyou

-Lauren

From EvanKristiansen on 2019-03-13

I’m curious if this is still the most common thread for GenomicsDBImport? I’ve yet to see anyone answer the question of whether it’s possible to run GenomicsDBImport with all contigs. I’m running gatk/4.0.7.0 on an SCC.

Basically, I have 20 individuals aligned and called with a not so great reference genome; so I have ~300 scaffolds. Is it possible to run GenomicsDBImport without having to manually input a list of 300 scaffolds?

Thanks!

Evan

From jmedarom on 2019-03-14

Hi,

I have a question about this Geraldine’s comment :

“Hi @sam0per, in our pipelines we run through GenomicsDBImport and GenotypeGVCFs per interval, then we combine the per-interval VCFs after that. We don’t combine workspaces; that is not possible. However there are a few tweaks you can apply.”

In my case, I ran GenomicsDBImport per interval (per each chromosome as -L 1 -L 2 -L 3…), but I did not do the same with GenotypeVCFs and thus did not combine the per-interval VCFs. is this correct? or is it necessary running both tools per interval?.

Thanks,

Jenn

From vaa2011 on 2019-03-19

Hi,

I would like to combine a cohort of samples from WGS. For this I followed the indications in this post (on a small subset of my cohort) and was trying to run GenomicsDBImport on all of the chromosomes at the same time. As the command forces you to specify at least one interval, I listed all the chromosomes and I get this error:

A USER ERROR has occurred: Badly formed genome unclippedLoc: Query interval “chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr20,chr21,chr22,chrX” is not valid for this input

Has anyone been able to successfully run this command for the whole genome? I have no issues running CombineGVCFs except for it being slow.

From Yifang on 2019-03-22

Hello,

I hit the same wall recently.

The format for options -L and —intervals are still unclear to me. I tried to include all the chromosomes at once, but not successful. The interval seems taking entries separated by comma as I did not get error without -L option: —intervals N1,N2,N3,N4,N5,N6,N7,N8,N9,N10,N11,N12,N13,N14,N15,N16,N17,N18,N19

as “ERROR -L option is needed ……“

I then added -L option but got error with “-L options unrecognized N2 etc…” (sorry, did not have screenshot record!)

…………….

The two options seem mutual exclusive to each other as I could not get them run at the same time even I only provide a single chromosome. (-L N1 and —interval N1). Bang bang bang!!!

It seems only one interval is allowed with the -L option.

Spent hours testing this and finally got one successful case. Here is my script:

gDB=/path/to/genomeDBImport #Must exist, and will be created by the program

gatk4_jar=/storage/yifang/download-software/anaconda3/envs/exome/share/gatk4-4.1.0.0-0/gatk-package-4.1.0.0-local.jar

for chr in N1 N2 N3 N4 N5 N6 N7 N8 N9 N10 N11 N12 N13 N14 N15 N16 N17 N18 N19; do java -DGATK_STACKTRACE_ON_USER_EXCEPTION=true \ -jar ${gatk4_jar} GenomicsDBImport \ —genomicsdb-workspace-path ${gDB}_${chr} \ —batch-size 16 \ -L ${chr} \ —sample-name-map /storage/yifang/20190225/data1/cohort_samples.map \ —tmp-dir=/storage/ppl/yifang/20190225/data1/tmp \ —reader-threads 8 & done

I put here for a reference. Hope it can become a good starter for expert to give better example with deeper insight.

Gook luck to all!

Yifang

From danat_yermakovich on 2019-04-09

Yifang

vaa2011

hi guys

I’m a little bit annoyed of this examples of commands on the top cause nor one of them are not working.

I tried my old 4.0.6.0 and the latest 4.1.1.0 gatk

The first one fall down on this line:

gatk GenomicsDBImport \ -V data/gvcfs/mother.g.vcf \ -V data/gvcfs/father.g.vcf \ -V data/gvcfs/son.g.vcf \ —genomicsdb-workspace-path my_database \

> —intervals chr20,chr21

after spending some time, i thought up syntax that allowed – you need put -L flag to each chr

./gatk-4.1.1.0/gatk —java-options “-Xmx8g -Xms6g” GenomicsDBImport \ —genomicsdb-workspace-path $Path/genomic_db \ —sample-name-map cohort.sample_map \ —reader-threads 2 \

> -L chr20 -L chr21 -L chr22 -L chrX -L chr2

weird not fixed shi__t

@hdetering

and the second one:

gatk GenotypeGVCFs \ -R data/ref/ref.fasta \ -V gendb://my_database \

> -newQual \ -O test_output.vcf

$ gatk-4.1.1.0/gatk GenotypeGVCFs

after this command you have a help message, and there is no flag —newQual

it really was in gatk3 GenomeAnalysisTK.jar

but u can find the same by sense:

_—use-new-qual-calculator,-new-qual:Boolean Use the new AF model instead of the so-called exact model Default value: true.

Possible values: {true, false}_

as you can see, the default value is true – u don’t need to clearly indicates this option

i’m really disappointed – it’s not the best practise to mix up options from old and new versions

p.s. sorry for not excellent English, but i hope it was understandable

From Alva on 2019-07-23

Hello Geraldine_VdAuwera Sheila ,

I have a question regarding using `GenomicsDBImport` over `CombineGVCFs`, I have been using `CombineGVCFs` for joining 320 gvf files using following memory options,

• —java-options ‘-Xmx350G -XX:+UseParallelGC -XX:ParallelGCThreads=40’ [ERROR] • —java-options -Xms454m -Xmx3181m -XX:+UseSerialGC [Error] • —java-options ‘-Xmx650G -XX:+UseParallelGC -XX:ParallelGCThreads=70’ [ERROR] —java-options ‘-Xmx900G -XX:+UseParallelGC -XX:ParallelGCThreads=80’ [ERROR]

In all above cases I got the following error, # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0×00002ad5d784802e, pid=31158, tid=0×00002ad5d6327280 # # JRE version: Java™ SE Runtime Environment (8.0_172-b11) (build 1.8.0_172-b11) # Java VM: Java HotSpot™ 64-Bit Server VM (25.172-b11 mixed mode linux-amd64 ) # Problematic frame: # V [libjvm.so+0×92c02e] SR_handler(int, siginfo*, ucontext*)+0×3e # # Core dump written. Default location: /cluster/work/grlab/projects/tmp_alva/test_wxs_aml/scripts/core or core.31158 # # An error report file with more information is saved as: # /cluster/work/grlab/projects/tmp_alva/test_wxs_aml/scripts/hs_err_pid31158.log # # If you would like to submit a bug report, please visit: # http://bugreport.java.com/bugreport/crash.jsp

Therefore, I thought of using `GenomicsDBImport ` tool over the `Combinegvcfs`,. However, in my case I have 1824 contigs from within humnan and viral sequences for each`gvcf` files. I would like to whether the `intervals` parameters can be used efficentevly in my case?

From wbsimey on 2019-07-31

Hello,

We are using the latest GATK4 conda install on Ubuntu 16 with 1TB RAM.

We are trying to use the GenomicsDBImport tool to combine dozens of gVCFs from WGS individuals. Our reference genome has thousands of scaffolds (40,232), 25 of which we have assigned as primary chromosomes.

We have tried many different interval options with no luck. The example provided above with commas (chr20,chr21) does not work. We tried multiple -L commands for each scaffold, this also does not work.

We have tried to use a list, but I think we must be formatting the list incorrectly (I tried to find an example .list, but could not. Sorry in advance if there is one).

What are we doing wrong?

I am not posting the error messages, because they are the same as others have posted here.

The following are a couple examples of what we have tried:

Example 1:

gatk —java-options “-Xmx200G” GenomicsDBImport \ -V TP30046_sorted_dedup_rg.gVCF \ -V TP29957_sorted_dedup_rg.gVCF \

… -V TP29969_sorted_dedup_rg.gVCF \ -V AHP1168_sorted_dedup_rg.gVCF \ —genomicsdb-workspace-path ../Genomics_DB-all \ —intervals scaffold_1,scaffold_2,scaffold_3,…,scaffold_25

Example 2:

gatk —java-options “-Xmx200G” GenomicsDBImport \ -V TP30046_sorted_dedup_rg.gVCF \ -V TP29957_sorted_dedup_rg.gVCF \

… -V TP29969_sorted_dedup_rg.gVCF \ -V AHP1168_sorted_dedup_rg.gVCF \ —genomicsdb-workspace-path ../Genomics_DB-all \ -L scaffold_1 -L scaffold_2 -L scaffold_3

… -L scaffold_25

From jmedarom on 2019-08-26

Hi @Sheila ,

I have been waiting for this answer so long. Could someone from gatk team give an answer to this post please?.

Thank you for your help.

jmedarom said: > Hi, > > I have a question about this Geraldine's comment : > > "Hi

sam0per, in our pipelines we run through GenomicsDBImport and GenotypeGVCFs per interval, then we combine the per-interval VCFs after that. We don’t combine workspaces; that is not possible. However there are a few tweaks you can apply.“

> In my case, I ran GenomicsDBImport per interval (per each chromosome as -L 1 -L 2 -L 3…), but I did not do the same with GenotypeVCFs and thus did not combine the per-interval VCFs. is this correct? or is it necessary running both tools per interval?.

> Thanks,

> Jenn

From lidia on 2019-08-28

Hi @wbsimey ,

I am having the same problem that you have and I can´t seem to find an answer in the documentation, nor in the forums. Were you able to find a solution?

If so, I would very much appreciate your help.

Lidia

From wbsimey on 2019-08-28

Hi @lidia, we were never able to get GenomicsDBImport to accept more than one interval. But our workflow includes 43 individuals (2.3 Gbp each) and 29 chromosomes/scaffolds/intervals, which, by our estimation would require about 2-3 months for HaplotypeCaller to finish. So, I broke up HaplotypeCaller analyses into individual chromosomes and ran many instances simultaneously (I have access to large multi threaded Xeon servers). My plan was to then use CombineGVCFs to merge the chromosomes of each individual into a single gVCF file, then use GenomicsDBImport to create a single database for all samples and chromosomes. But, as the GATK documentation points out, CombineGVCFs is very slow; it required more than 48 hours for one of our smallest alignments. So, I am planning to use GenomicsDBImport to create a unique database for each chromosome (29 databases).

From fengtao on 2019-11-06

lidia wbsimey

I use gatk4.0.8, and it works with multiple `—intervals` arguments:

gatk —java-options “-Xmx200G” GenomicsDBImport \ —sample-name-map cohort.sample_map \ —genomicsdb-workspace-path output_DB\ —intervals Chr1 \ —intervals Chr2 \ —intervals Chr3 \ —intervals Chr4 \ —intervals Chr5

From wbsimey on 2019-11-08

Thanks @fengtao ! This did indeed work for me on 4.1.4 conda install.

Report abuse