created by Geraldine_VdAuwera
on 2018-04-16
In GATK4, the GenotypeGVCFs
tool can only take a single input i.e., 1) a single single-sample GVCF 2) a single multi-sample GVCF created by CombineGVCFs or 3) a GenomicsDB workspace created by GenomicsDBImport. If you have GVCFs from multiple samples (which is usually the case) you will need to combine them before feeding them to GenotypeGVCFs
. The input samples must possess genotype likelihoods containing the allele produced by HaplotypeCaller with -ERC GVCF
or -ERC BP_RESOLUTION
.
Although there are several tools in the GATK and Picard toolkits that provide some type of VCF merging functionality, for this use case ONLY two of them can do the GVCF consolidation step correctly: GenomicsDBImport
and CombineGVCFs
.
GenomicsDBImport
is the preferred tool (see detailed instructions below); CombineGVCFs
is provided only as a backup solution for people who cannot use GenomicsDBImport
. We know CombineGVCFs
is quite inefficient and typically requires a lot of memory, so we encourage you to try GenomicsDBImport
first and only fall back on CombineGVCFs
if you experience issues that we are unable to help you solve (ask us for help in the forum!).
The GenomicsDBImport
tool takes in one or more single-sample GVCFs and imports data over at least one genomics interval (this feature is available in v4.0.6.0 and later and stable in v4.0.8.0 and later), and outputs a directory containing a GenomicsDB datastore with combined multi-sample data. GenotypeGVCFs
can then read from the created GenomicsDB directly and output the final multi-sample VCF.
So if you have a trio of GVCFs your GenomicsDBImport
command would look like this, assuming you're running per chromosome (here we're showing the tool running on chromosome 20 and chromosome 21):
gatk GenomicsDBImport \ -V data/gvcfs/mother.g.vcf \ -V data/gvcfs/father.g.vcf \ -V data/gvcfs/son.g.vcf \ --genomicsdb-workspace-path my_database \ --intervals chr20,chr21
That generates a directory called my_database
containing the combined GVCF data for chromosome 20 and 21. (The contents of the directory are not really human-readable; see “extracting GVCF data from a GenomicsDB” to evaluate the combined, pre-genotyped data. Also note that the log will contain a series of messages like
Buffer resized from 178298bytes to 262033
-- this is expected.) For larger cohort sizes, we recommend specifying a batch size of 50 for improved memory usage. A sample map file can also be specified when enumerating the GVCFs individually as above becomes arduous.
Then you run joint genotyping; note the gendb://
prefix to the database input directory path. Note that this step requires a reference, even though the import can be run without one.
gatk GenotypeGVCFs \ -R data/ref/ref.fasta \ -V gendb://my_database \ -newQual \ -O test_output.vcf
And that's all there is to it.
GenomicsDBImport
.GenomicsDBImport
cannot accept multiple GVCFs for the same sample, so if for example you generated separate GVCFs per chromosome for each sample, you'll need to either concatenate the chromosome GVCFs to produce a single GVCF per sample (using GatherVcfs) or scatter the following steps by chromosome as well.Could not open array genomicsdb_array at workspace:[...]
PS
) tag in the FORMAT field.If you can't use GenomicsDBImport
for whatever reason, fall back to CombineGVCFs
instead. It is slower but will allow you to combine GVCFs the old-fashioned way.
If you want to generate a flat multisample GVCF file from the GenomicsDB you created, you can do so with SelectVariants
as follows:
gatk SelectVariants \ -R data/ref/ref.fasta \ -V gendb://my_database \ -O combined.g.vcf
GenomicsDB now supports allele-specific annotations [ https://software.broadinstitute.org/gatk/documentation/article?id=9622 ], which have become standard in our Broad exome production pipeline.
GenomicsDB can now import directly from a Google cloud path (i.e. gs://
) using NIO.
Updated on 2019-01-23
From Rosmaninho on 2018-04-20
If we want to run GenomicsDBImport in all chromosomes how do we specify it?
Is it possible to specificy samples using the following command? -V $path/*.g.vcf
From Sheila on 2018-04-22
@Rosmaninho
Hi,
Have a look at Geraldine’s last answer in [this thread](https://gatkforums.broadinstitute.org/gatk/discussion/10061/using-genomicsdbimport-to-consolidate-gvcfs-for-input-to-genotypegvcfs-in-gatk4).
>Is it possible to specificy samples using the following command?
-V $path/*.g.vcf
That is the way to do it :smile:
-Sheila
From EliseG on 2018-05-02
Hi,
So if I understand correctly, the best way is to use GenomicsDBImport when it’s possible (as a BestPractice).
But is-it possible to use V ${sep = “ V” GVCFs} like in the following tutorial: https://software.broadinstitute.org/wdl/documentation/article?id=7614 ?
Elise
From Sheila on 2018-05-08
@EliseG
Hi Elise,
No, GenotypeGVCFs in GATK4 only accepts 1 input GVCF. Have a look at [this article](https://software.broadinstitute.org/gatk/documentation/article?id=11813).
-Sheila
From EliseG on 2018-05-16
Thanks Sheila!
Elise
From biojiangke on 2018-06-22
Hi,
I have a question about the behavior of the interval option in CombineGVCF: I understand it could take standard samtools format chr:start-end, and BED format, but it also could take the format of chr:pos, as I tried. I would think GATK processes one genomic position in this situation, but instead, I’m getting results up to 5bp from this specified position. Would anyone provide more information about this behavior?
The application behind this is that sometimes we use this type of operation to fetch genotypes across samples with WGS data and compare with results from other genotyping platforms such as SNP chips and amplicons. In this case, the sites to be checked are discrete and scattered across the genome and I had to supply GATK with multiple intervals.
Thanks!
Ke
From Sheila on 2018-06-25
@biojiangke
Hi Ke,
>I’m getting results up to 5bp from this specified position.
I am not sure what you mean. Can you post some example records?
Thanks,
Sheila
From biojiangke on 2018-06-26
For example, I supply a position to the GenotypGVCF operation in this format: 2:42889589 with the -L option. I would expect genotypes from this location, but instead, I got variant information at 2:42889592, which is 3bp away from the expected location.
From biojiangke on 2018-06-26
Sometimes I can see this is an InDel issue, but sometimes it is a SNP, but the position is still off.
From Sheila on 2018-06-27
@biojiangke
Hi Ke,
I see. Can you post the exact command you ran? Are you only using one position in the interval?
Thanks,
Sheila
From biojiangke on 2018-06-27
sites.list (GTAK format but only 1bp)
2:121169112
2:35031950
2:42889592
2:4365346
Combine VCF step:
gatk CombineGVCFs -R ref.fa -L sites.list -O Combined.vcf -V gVCF.list
Joint call step:
gatk GenotypeGVCFs -O GTed.vcf -R ref.fa -V Combined.vcf
The positions in GTed.vcf are:
2 4365345
2 35031949
2 42889589
2 121169110
After inspecting the results, I seem to know the reason. As long as there is an additional alternative allele, even from a few reads, supporting InDels that overlap with the queried location, the sites involving these alleles would be written into output. I think this makes sense, as any sites, even far away from the queried site, may have alleles affecting the status at the queried site.
From Sheila on 2018-06-29
@biojiangke
Hi,
>As long as there is an additional alternative allele, even from a few reads, supporting InDels that overlap with the queried location, the sites involving these alleles would be written into output. I think this makes sense, as any sites, even far away from the queried site, may have alleles affecting the status at the queried site.
Yes, that is correct. Thanks for posting.
-Sheila
From QuinnC on 2018-08-03
I am currently using GATK 4.0.5.1 and looking to analyse SNPs from a target capture dataset done on a non-model organism without linkage or chromosome information. There are 2900 targeted regions/contigs each across 45 individuals. I have run HaplotypeCaller using -ERC mode and now have 45 g.vcf files (about 400-700mb each) that I will like to combine for GenotypeGVCFs.
Does the GenomicsDBImport for this version or newer versions support multiple intervals? I will prefer not to have 2900 separate output files after this step.
I have tried combineGVCFs across the 45 g.vcf files without using the -L option but always ran into “GC threads out of memory error” even with -Xmx32G.
Previously we used GATK 3.7 without any problem as GenotypeGVCFs accepted multiple g.vcf files as input.
I’d appreciate your suggestion on how I can move forward with my analyses.
Thanks in advance,
Quinn
From gmauro72 on 2018-08-03
@QuinnC
Have a look at https://gatkforums.broadinstitute.org/gatk/discussion/11231/are-there-any-plans-to-add-multi-interval-support-to-genomicsdbimport#latest
Mauro
From QuinnC on 2018-08-03
Thank you, Mauro! Will upgrade to 4.0.6.0 and give it a try!
From MehulS on 2018-12-20
Do I need to install a separate GenomicsDB tool for this command to work ? The[ GitHub](http://https://github.com/Intel-HLS/GenomicsDB/wiki/Querying-GenomicsDB#producing-combined-gvcf “ GitHub”) link seems to suggest so.
From hdetering on 2018-12-26
Using `GenotypeGVCFs` from GATK 4.0.10.0, the option `-newQual` gave me an error:
```
A USER ERROR has occurred: n is not a recognized option
```
Trying `—newQual` was unsuccessful, too:
```
A USER ERROR has occurred: newQual is not a recognized option
```
Am I right in assuming that this was an upward-compatibility option in GATK3?
From leshwill on 2019-01-28
Hello, I have 520 WES samples. I’m trying to run GenomicsDBImport per chromosome.
I used the —intervals chr1 as per the example but got “Query interval “chr1” is not valid for this input”.
Is it possible to run GenomincsDBImport per chromosome?
My plan is to create joint called VCFs for each chromosome then combine the chromosome VCFs to get a final VCF file. I’ve been running CombineGVCFs for about 12days and so far the progress meter indicates it’s on chromosome 5. I need help. How can I make this process go faster?
From danangcrysnanto on 2019-02-07
Hi
I parallelize genomicsdbimport tools to 2000 interval across genome with sample size ~400 and run about 100 jobs simultaneously. I got complain from my cluster admin that my jobs overloaded the Lustre filesystems by having lots I/O (I already set —batch-size 50). Alternatively, I could use the SSD attached to the computing nodes but I need to move the output files whenever the jobs finished as it will be immediately deleted.
Is it now possible to move the genomic database output after creation? Do you recommend to use CombineGVCFs instead (and 400 samples are realistics to be put in a single big gvcf files)?
Thank you
From lauren_white on 2019-03-04
Hello,
I am trying to combine 350 exome-capture GVCFs using GenomicDBImport. I have started with chr21 (-L chr21) as a test run, but the program seems to be stuck. The last output showed the headings for the progress meter, but it has not moved on in nearly an hour.
I have only recently switched to gatk4. When using gatk3 GenotypeGVCFs (without first having to combine the gvcfs) on chr21 alone, the process usually took less than an hour. I have tried switching back to this version, but the single gvcfs now seem to contain header information that gatk3 can not deal with (Unable to parse header with error: For input string: “R”).
I appreciate any advice or help you can provide.
Thankyou
-Lauren
From EvanKristiansen on 2019-03-13
I’m curious if this is still the most common thread for GenomicsDBImport? I’ve yet to see anyone answer the question of whether it’s possible to run GenomicsDBImport with all contigs. I’m running gatk/4.0.7.0 on an SCC.
Basically, I have 20 individuals aligned and called with a not so great reference genome; so I have ~300 scaffolds. Is it possible to run GenomicsDBImport without having to manually input a list of 300 scaffolds?
Thanks!
Evan
From jmedarom on 2019-03-14
Hi,
I have a question about this Geraldine’s comment :
“Hi @sam0per, in our pipelines we run through GenomicsDBImport and GenotypeGVCFs per interval, then we combine the per-interval VCFs after that. We don’t combine workspaces; that is not possible. However there are a few tweaks you can apply.”
In my case, I ran GenomicsDBImport per interval (per each chromosome as -L 1 -L 2 -L 3…), but I did not do the same with GenotypeVCFs and thus did not combine the per-interval VCFs. is this correct? or is it necessary running both tools per interval?.
Thanks,
Jenn
From vaa2011 on 2019-03-19
Hi,
I would like to combine a cohort of samples from WGS. For this I followed the indications in this post (on a small subset of my cohort) and was trying to run GenomicsDBImport on all of the chromosomes at the same time. As the command forces you to specify at least one interval, I listed all the chromosomes and I get this error:
A USER ERROR has occurred: Badly formed genome unclippedLoc: Query interval “chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr20,chr21,chr22,chrX” is not valid for this input
Has anyone been able to successfully run this command for the whole genome? I have no issues running CombineGVCFs except for it being slow.
From Yifang on 2019-03-22
Hello,
I hit the same wall recently.
The format for options -L and —intervals are still unclear to me. I tried to include all the chromosomes at once, but not successful. The interval seems taking entries separated by comma as I did not get error without -L option: —intervals N1,N2,N3,N4,N5,N6,N7,N8,N9,N10,N11,N12,N13,N14,N15,N16,N17,N18,N19
as “ERROR -L option is needed ……“
I then added -L option but got error with “-L options unrecognized N2 etc…” (sorry, did not have screenshot record!)
…………….
The two options seem mutual exclusive to each other as I could not get them run at the same time even I only provide a single chromosome. (-L N1 and —interval N1). Bang bang bang!!!
It seems only one interval is allowed with the -L option.
Spent hours testing this and finally got one successful case. Here is my script:
gDB=/path/to/genomeDBImport #Must exist, and will be created by the program
gatk4_jar=/storage/yifang/download-software/anaconda3/envs/exome/share/gatk4-4.1.0.0-0/gatk-package-4.1.0.0-local.jar
for chr in N1 N2 N3 N4 N5 N6 N7 N8 N9 N10 N11 N12 N13 N14 N15 N16 N17 N18 N19; do java -DGATK_STACKTRACE_ON_USER_EXCEPTION=true \ -jar ${gatk4_jar} GenomicsDBImport \ —genomicsdb-workspace-path ${gDB}_${chr} \ —batch-size 16 \ -L ${chr} \ —sample-name-map /storage/yifang/20190225/data1/cohort_samples.map \ —tmp-dir=/storage/ppl/yifang/20190225/data1/tmp \ —reader-threads 8 & done
I put here for a reference. Hope it can become a good starter for expert to give better example with deeper insight.
Gook luck to all!
Yifang
From danat_yermakovich on 2019-04-09
Yifang
vaa2011
hi guys
I’m a little bit annoyed of this examples of commands on the top cause nor one of them are not working.
I tried my old 4.0.6.0 and the latest 4.1.1.0 gatk
The first one fall down on this line:
gatk GenomicsDBImport \ -V data/gvcfs/mother.g.vcf \ -V data/gvcfs/father.g.vcf \ -V data/gvcfs/son.g.vcf \ —genomicsdb-workspace-path my_database \
> —intervals chr20,chr21
after spending some time, i thought up syntax that allowed – you need put -L flag to each chr
like
./gatk-4.1.1.0/gatk —java-options “-Xmx8g -Xms6g” GenomicsDBImport \ —genomicsdb-workspace-path $Path/genomic_db \ —sample-name-map cohort.sample_map \ —reader-threads 2 \
> -L chr20 -L chr21 -L chr22 -L chrX -L chr2
weird not fixed shi__t
@hdetering
and the second one:
gatk GenotypeGVCFs \ -R data/ref/ref.fasta \ -V gendb://my_database \
> -newQual \ -O test_output.vcf
so
$ gatk-4.1.1.0/gatk GenotypeGVCFs
after this command you have a help message, and there is no flag —newQual
it really was in gatk3 GenomeAnalysisTK.jar
but u can find the same by sense:
_—use-new-qual-calculator,-new-qual:Boolean Use the new AF model instead of the so-called exact model Default value: true.
Possible values: {true, false}_
as you can see, the default value is true – u don’t need to clearly indicates this option
i’m really disappointed – it’s not the best practise to mix up options from old and new versions
p.s. sorry for not excellent English, but i hope it was understandable
From Alva on 2019-07-23
Hello Geraldine_VdAuwera
Sheila ,
I have a question regarding using `GenomicsDBImport` over `CombineGVCFs`, I have been using `CombineGVCFs` for joining 320 gvf files using following memory options,
• —java-options ‘-Xmx350G -XX:+UseParallelGC -XX:ParallelGCThreads=40’ [ERROR] • —java-options -Xms454m -Xmx3181m -XX:+UseSerialGC [Error] • —java-options ‘-Xmx650G -XX:+UseParallelGC -XX:ParallelGCThreads=70’ [ERROR] —java-options ‘-Xmx900G -XX:+UseParallelGC -XX:ParallelGCThreads=80’ [ERROR]
In all above cases I got the following error, # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0×00002ad5d784802e, pid=31158, tid=0×00002ad5d6327280 # # JRE version: Java™ SE Runtime Environment (8.0_172-b11) (build 1.8.0_172-b11) # Java VM: Java HotSpot™ 64-Bit Server VM (25.172-b11 mixed mode linux-amd64 ) # Problematic frame: # V [libjvm.so+0×92c02e] SR_handler(int, siginfo*, ucontext*)+0×3e # # Core dump written. Default location: /cluster/work/grlab/projects/tmp_alva/test_wxs_aml/scripts/core or core.31158 # # An error report file with more information is saved as: # /cluster/work/grlab/projects/tmp_alva/test_wxs_aml/scripts/hs_err_pid31158.log # # If you would like to submit a bug report, please visit: # http://bugreport.java.com/bugreport/crash.jsp
Therefore, I thought of using `GenomicsDBImport ` tool over the `Combinegvcfs`,. However, in my case I have 1824 contigs from within humnan and viral sequences for each`gvcf` files. I would like to whether the `intervals` parameters can be used efficentevly in my case?
From wbsimey on 2019-07-31
Hello,
We are using the latest GATK4 conda install on Ubuntu 16 with 1TB RAM.
We are trying to use the GenomicsDBImport tool to combine dozens of gVCFs from WGS individuals. Our reference genome has thousands of scaffolds (40,232), 25 of which we have assigned as primary chromosomes.
We have tried many different interval options with no luck. The example provided above with commas (chr20,chr21) does not work. We tried multiple -L commands for each scaffold, this also does not work.
We have tried to use a list, but I think we must be formatting the list incorrectly (I tried to find an example .list, but could not. Sorry in advance if there is one).
What are we doing wrong?
I am not posting the error messages, because they are the same as others have posted here.
The following are a couple examples of what we have tried:
Example 1:
gatk —java-options “-Xmx200G” GenomicsDBImport \ -V TP30046_sorted_dedup_rg.gVCF \ -V TP29957_sorted_dedup_rg.gVCF \
… -V TP29969_sorted_dedup_rg.gVCF \ -V AHP1168_sorted_dedup_rg.gVCF \ —genomicsdb-workspace-path ../Genomics_DB-all \ —intervals scaffold_1,scaffold_2,scaffold_3,…,scaffold_25
Example 2:
gatk —java-options “-Xmx200G” GenomicsDBImport \ -V TP30046_sorted_dedup_rg.gVCF \ -V TP29957_sorted_dedup_rg.gVCF \
… -V TP29969_sorted_dedup_rg.gVCF \ -V AHP1168_sorted_dedup_rg.gVCF \ —genomicsdb-workspace-path ../Genomics_DB-all \ -L scaffold_1 -L scaffold_2 -L scaffold_3
… -L scaffold_25
From jmedarom on 2019-08-26
Hi @Sheila ,
I have been waiting for this answer so long. Could someone from gatk team give an answer to this post please?.
Thank you for your help.
>
jmedarom said: > Hi, > > I have a question about this Geraldine's comment : > > "Hi
sam0per, in our pipelines we run through GenomicsDBImport and GenotypeGVCFs per interval, then we combine the per-interval VCFs after that. We don’t combine workspaces; that is not possible. However there are a few tweaks you can apply.“
>
> In my case, I ran GenomicsDBImport per interval (per each chromosome as -L 1 -L 2 -L 3…), but I did not do the same with GenotypeVCFs and thus did not combine the per-interval VCFs. is this correct? or is it necessary running both tools per interval?.
>
> Thanks,
>
> Jenn
From lidia on 2019-08-28
Hi @wbsimey ,
I am having the same problem that you have and I can´t seem to find an answer in the documentation, nor in the forums. Were you able to find a solution?
If so, I would very much appreciate your help.
Lidia
From wbsimey on 2019-08-28
Hi @lidia, we were never able to get GenomicsDBImport to accept more than one interval. But our workflow includes 43 individuals (2.3 Gbp each) and 29 chromosomes/scaffolds/intervals, which, by our estimation would require about 2-3 months for HaplotypeCaller to finish. So, I broke up HaplotypeCaller analyses into individual chromosomes and ran many instances simultaneously (I have access to large multi threaded Xeon servers). My plan was to then use CombineGVCFs to merge the chromosomes of each individual into a single gVCF file, then use GenomicsDBImport to create a single database for all samples and chromosomes. But, as the GATK documentation points out, CombineGVCFs is very slow; it required more than 48 hours for one of our smallest alignments. So, I am planning to use GenomicsDBImport to create a unique database for each chromosome (29 databases).
From fengtao on 2019-11-06
lidia
wbsimey
I use gatk4.0.8, and it works with multiple `—intervals` arguments:
gatk —java-options “-Xmx200G” GenomicsDBImport \ —sample-name-map cohort.sample_map \ —genomicsdb-workspace-path output_DB\ —intervals Chr1 \ —intervals Chr2 \ —intervals Chr3 \ —intervals Chr4 \ —intervals Chr5
From wbsimey on 2019-11-08
Thanks @fengtao ! This did indeed work for me on 4.1.4 conda install.