created by Geraldine_VdAuwera
on 2017-07-29
This article is out of date and has been replaced by https://software.broadinstitute.org/gatk/documentation/article?id=11813
In GATK4, the GenotypeGVCFs
tool can only take a single input, so if you have GVCFs from multiple samples (which is usually the case) you will need to combine them before feeding them to GenotypeGVCFs
. Although there are several tools in the GATK and Picard toolkits that provide some type of VCF or GVCF merging functionality, for this use case there is only one valid way to do it: with GenomicsDBImport
.
The GenomicsDBImport
tool takes in one or more single-sample GVCFs and imports data over a single interval, and outputs a directory containing a GenomicsDB datastore with combined multi-sample data. GenotypeGVCFs
can then read from the created GenomicsDB directly and output a VCF. Note that GenomicsDBImport does not take two or more same sample GVCFs. You will need to create one GVCF per-sample before running the tool.
Here are example commands to use it:
gatk-launch GenomicsDBImport \ -V data/gvcfs/mother.g.vcf \ -V data/gvcfs/father.g.vcf \ -V data/gvcfs/son.g.vcf \ --genomicsDBWorkspace my_database \ --intervals 20
That generates a directory called my_database
containing the combined gvcf data.
Then you run joint genotyping; note the gendb://
prefix to the database input directory path.
gatk-launch GenotypeGVCFs \ -R data/ref/ref.fasta \ -V gendb://my_database \ -G StandardAnnotation -newQual \ -O test_output.vcf
And that's all there is to it.
There are three caveats:
GenomicsDBImport
on a single genomic interval (ie max one contig) at a time. Down the road this will change (the work is tentatively scheduled for the second quarter of 2018), because we want to make it possible to run on one multiple intervals in one go. But for now you need to run on each interval separately. We recommend scripting this of course.If you can't use GenomicsDB for whatever reason, fall back to CombineGVCFs instead. It is slower but will allow you to combine GVCFs the old-fashioned way.
Addendum: extracting data from the GenomicsDB
If you want to generate a flat multisample GVCF file from the GenomicsDB you created, you can do so with SelectVariants as follows:
gatk-launch SelectVariants \ -R data/ref/ref.fasta \ -V gendb://my_database \ -O combined.g.vcf
Caveat: cannot move database after creation
Currently the GenomicsDB internal code uses the absolute path of the location of the database as part of the data encoding. As a consequence, you cannot move the database to a different location before running GenotypeGVCFs on it. If you do, it will no longer work. This is obviously not desirable, and the development team is looking at options to remediate this.
Updated on 2018-04-16
From chapmanb on 2017-08-10
Geraldine;
Thanks much for this tutorial. I’m working on implementing joint calling in bcbio with GATK4 beta3 and running into issues getting data in and out of GenomicsDB. I appear to lose genotypes when importing and as a result GenotypeGVCFs outputs are empty. I put together a self contained reproducible test case here:
https://s3.amazonaws.com/chapmanb/testcases/gatk4_genotypegvcfs.tar.gz
which consists of running an import, and then a GenotypeGVCFs and SelectVariants to test export:
```
gatk-launch GenomicsDBImport —genomicsDBWorkspace test_genomicsdb -L chrM:1-1000 —variant Test1.vcf.gz —variant Test2.vcf.gz
gatk-launch GenotypeGVCFs —variant gendb://test_genomicsdb -R hg19.fa —output joint.vcf.gz -L chrM:1-1000
gatk-launch SelectVariants -R hg19.fa -V gendb://test_genomicsdb -O extract.vcf.gz
```
The input VCFs look have calls within the chrM region I’m testing:
```
Test1.vcf.gz:chrM 150 . T C, 49406.77 . DP=1084;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;MQ0=0;RAW_MQ=3902400.00 GT:AD:DP:GQ:MCL:MMQ:PGT:PID:PL:SB 1/1:0,1081,0:1081:99:0,0,0:0,60,0:0|1:150_T_C:49435,3362,0,49435,3362,49435:0,0,483,598
```
but are empty in the `GenotypeGVCFs` output. Running `SelectVariants` shows the sample but with empty genotype calls:
```
extract.vcf.gz:chrM 150 . T C, . . DP=2168;ExcessHet=3.01;MQ0=0;RAW_MQ=7804800.00 GT:AD:DP:GQ:MCL ./.:0,1081,0:1081:99:0.00,0.00 ./.:0,1081,0:1081:99:0.00,0.00
```
I’m stuck on what I’m doing wrong. Could anyone offer any pointers on ways to fix/tweak this to get correct genotyping? Thanks much.
From Geraldine_VdAuwera on 2017-08-10
Hi Brad, when you say the outputs are empty in GenotypeGVCFs, do you mean the files are empty, or some positions are missing?
From chapmanb on 2017-08-10
Geraldine;
Thanks for getting back with me so quickly. GenotypeGVCFs produces an VCF file with just the header and no variant positions called. This is a small test dataset we use a lot so expect a number of calls like at the chrM 150 position I show above. Thanks again for looking at this.
From Geraldine_VdAuwera on 2017-08-10
Hmm, someone else reported something like this recently. We’ll look into it.
From Geraldine_VdAuwera on 2017-08-10
Issue created [here](https://github.com/broadinstitute/gatk/issues/3429) in case you want to track progress
From chapmanb on 2017-08-10
Brilliant, thanks Geraldine. I subscribed and happy to help there with any other information needed.
From jjmmii on 2017-08-11
Hi Geraldine,
May you please show me how to combine per-chromosome
I am unable to pass the intervals file I used in the previous steps of the Best Practices workflow, namely the exome intervals, with the error “More than one interval specified. The tool takes only one”. But when I don’t pass any intervals, the error becomes “Argument intervals was missing: Argument ‘intervals’ is required.” May you please show me the correct way to specify the exome intervals? Thank you very much.
EDIT: I just read that I can only specify one interval at a time. If I specify per-chromosome intervals, then how do I merge them together to form a whole-genome VCF file? Thank you.
From Sheila on 2017-08-21
@jjmmii
Hi,
Sorry for the delay. Let me check with the team and get back to you.
-Sheila
EDIT: I think you can use MergeVcfs.
From Sheila on 2017-08-23
@jjmmii
Hi,
I just heard back from the developers. If you want to merge disjoint shards after a split operation you should use GatherVCFs; if you want to to merge VCFs that can overlap each other you should use MergeVariants.
-Sheila
From bbimber on 2017-09-11
Hello – this is tangential to this thread, but I have a question on GenomicsDB in general. If I understand this, in the example you’re using GenomicsDB to aggregate files to run computation that in turn produces a VCF output. I can see the utility of that. In your example, the GenomicsDB workspace is essentially just a transient tool to support the generation of the actual VCF (though I suppose technically it could stick around).
GenomicsDB seems like it would also make sense as a long term data store for production genotype calls. I see that GenomicsDB’s import will work for any VCF and there are hooks available to issue queries. From the outside, it appear this functionality might be a little less developed than your joint genotyping use case? Are there any public projects or examples you know about using GenomicsDB for data storage? Thanks.
From Sheila on 2017-09-19
@bbimber
Hi,
Keep in mind, the tools in GATK4 are still in beta.
That said, right now, I am not aware of any public projects that use it. I think that is one of the goals of the tool, however. I think the team would like it to be a persistent place for vcfs to live and to make quick callsets.
I hope that helps. We should have more documentation and information in the next few months :smile:
-Sheila
From berkel on 2017-11-15
Hi Geraldine,
I found the same problem with missing variant positions in GATK4.b6 and followed the bug report you linked to. The issue was marked as closed a month ago but the issue was not fixed in GATK4.b6. Is it known when will the fix be available?
From Sheila on 2017-11-20
@berkel
Hi,
Are you referring to the empty VCF when using GenomicsDBImport? That should be fixed in beta 6. Can you post the exact commands you ran?
Thanks,
Sheila
From eykim909 on 2018-01-18
Hi, GenomicsDBImport in GATK4, I used -L my_exome_bed_file and -ip 100 for interval, and got the error “More than one interval specified. The tool takes only one”. Without -ip 100 option, I still get the same error. The -L with my_exome_bed has been used for all steps. I have ~1000 g.vcf files, so I would like to use GenomicsDBImport than CombineGVCFs per 1000 single g.vcf files. How could I proceed this?
Thanks!
From jfarrell on 2018-01-18
Is there a way to specify a file containing a list of the gVCFs to merge? Repeating -V for each vcf makes for a long command-line. I would just like to pass it a file with a list of gVCFs.
Previously with CombineGVCF, the -V parameter would take a filename (-V mygvcf.list). GenomicsDBImport does not use this convention for the -V arguement.
From pallevillesen on 2018-01-19
> @jfarrell said:
> Is there a way to specify a file containing a list of the gVCFs to merge? Repeating -V for each vcf makes for a long command-line. I would just like to pass it a file with a list of gVCFs.
>
> Previously with CombineGVCF, the -V parameter would take a filename (-V mygvcf.list). GenomicsDBImport does not use this convention for the -V arguement.
—sample-name-map FILE.list
must be sample TAB vcf_file_name
From jfarrell on 2018-01-19
Thanks! Just what I was looking for.
From alphahmed on 2018-01-20
Hi,
Congratulations for the release of this great version of GATK.
It is still not clear to me on how to call all GVCFs without specifying a specific interval. I need to run GenotypeGVCFs which requires the GenomicsDBImport output, but when I enter the intervals for the whole genome sequencing (WGS_intervals), I get a message that only one interval is accepted.
Thanks!
From Sheila on 2018-01-22
@alphahmed
Hi,
Yes, GenomicsDBImport does not accept more than one interval right now. I think there are plans to change that, but I need to check with the developers on the timeline. Keep an eye on [this thread](https://gatkforums.broadinstitute.org/gatk/discussion/11231/are-there-any-plans-to-add-multi-interval-support-to-genomicsdbimport) for updates.
-Sheila
From alphahmed on 2018-01-23
Thank you Sheila…
For the time being… Can you suggest any other way of Genotyping multiple WGS gVCFs using GATK4?
From Sheila on 2018-01-23
@eykim909
Hi,
Right now, GenomicsDBImport only accepts 1 interval at a time. There are plans to change that, and you can keep track of the issue [here](https://github.com/broadinstitute/gatk/issues/3269).
-Sheila
From Sheila on 2018-01-24
@alphahmed
Hi,
Let me confirm with the team and get back to you.
-Sheila
From wxhyih on 2018-02-09
>
Sheila said: >
jjmmii
> Hi,
>
> I just heard back from the developers. If you want to merge disjoint shards after a split operation you should use GatherVCFs; if you want to to merge VCFs that can overlap each other you should use MergeVariants.
>
> -Sheila
Hi,sheila I am a bit confused. I process the data for one sample according to Germline-SNPs-Indels-GATK4-hg38 (https://portal.firecloud.org/#workspaces/help-gatk/Germline-SNPs-Indels-GATK4-hg38/monitor ); In the WDL, the HaplotypeCaller step is to process sampletest.gh38.bam with 50 intervals files generated by the ScatterIntervalsByNs tool, and then get a sampletest.hg38.g.vcf.gz using MergeVcfs.
But then (also JointGenotyping step), I can not do a GenomicsDBImport on sampletest.hg38.g.vcf.gz with 50 intervals files generated by the ScatterIntervalsByNs tool, just as @jjmmii does.Can you tell me what to do next?
BTW, in the Germline-SNPs-Indels-GATK4-hg38, JointGenotyping step use a diff sample from HaplotypeCallerGvcf_GATK4, and the —intervals is diff too. Thanks in advance. ;) ;)
From Sheila on 2018-02-14
@wxhyih
Hi,
I am a little confused about what you are asking. Are you asking why you cannot run GenomicsDBImport on more than 1 interval?
Thanks,
Sheila
From wxhyih on 2018-02-14
>
Sheila said: >
wxhyih
> Hi,
>
> I am a little confused about what you are asking. Are you asking why you cannot run GenomicsDBImport on more than 1 interval?
>
> Thanks,
> Sheila
Hi
I am sorry for the unclear description.
In JointGenotyping step(https://console.cloud.google.com/storage/browser/fc-347d972a-6a14-4199-8061-56f3940ee0da/23a01750-76de-4df2-9a66-ef6c881945ff/JointGenotyping/13864639-6cb9-4e9b-ad00-e4abb870ae8e/?pli=1), the command GenomicsDBImport step for one sample is like this :
_/gatk/gatk-launch —javaOptions “-Xmx4g -Xms4g” \
GenomicsDBImport \
—genomicsDBWorkspace genomicsdb \
—batchSize 50 \
-L chr19:1-58617616 \
—sampleNameMap inputs.list \
—readerThreads 5 \
-ip 500_
but, in HaplotypeCaller step (https://accounts.google.com/AccountChooser?continue=https://console.cloud.google.com/storage/browser/fc-347d972a-6a14-4199-8061-56f3940ee0da/237359f1-0a97-4325-9b20-230908313f9c/HaplotypeCallerGvcf_GATK4/817d8cb8-368c-425d-93f7-4141406e322d/), the command HaplotypeCaller step for one sample is like this :
_cd /cromwell_root
/gatk/gatk-launch —javaOptions -Xms8000m \ HaplotypeCaller \ -R /cromwell_root/broad-references/hg38/v0/Homo_sapiens_assembly38.fasta \ -I /cromwell_root/gatk-test-data/wgs_bam/NA12878_24RG_hg38/NA12878_24RG_small.hg38.bam \ -O NA12878_24RG_small.hg38.g.vcf.gz \ -L /cromwell_root/genomics-public-data/resources/broad/hg38/v0/scattered_calling_intervals/temp_0019_of_50/scattered.interval_list \ -ip 100 \ -contamination 0 \ —max_alternate_alleles 3 \ -ERC GVCF _
As you can see, the values of -L parameters are set differently.In HaplotypeCaller, the scattered.interval_list have more than 1 interval and it can not used by GenomicsDBImport for now.So, I would like to ask you how to do the GenomicsDBImport step after I finished the HaplotypeCaller step with the scattered.interval_list.That is, how to combine the haplotypecaller-gvcf-gatk4 and joint-discovery-gatk4 steps (like this: https://portal.firecloud.org/#workspaces/help-gatk/Germline-SNPs-Indels-GATK4-hg38/monitor) together?
Thanks && Happy Chinese New Year :)
wxhyih.
From bwray on 2018-02-16
Hey GATK team,
I’m running GenomicsDBImport in GATK 4.0.1.0 and I’m experiencing the same error presented at the top of the forums here, namely that all of the resulting vcf file contains a header and nothing else. I’ve looked in my post-haplotypecaller vcf files and confirmed that there are indeed variants being called within the region I’m specifying. Here is the command I’m running:
`gatk GenomicsDBImport
-V sample1.g.vcf.gz
-V sample2.g.vcf.gz
—genomicsdb-workspace-path $gatkProjectDir/genomicsdb_workspace/interval1
—L chr1:1735000-1832000`
I have tried running this with just a single sample and end up with the same result.
I should also note that I then tried running CombineGVCFs within GATK 4.0.1.0 by using the following command:
`gatk —java-options “-Xmx80G” CombineGVCFs -R bundle/hg19.fa —variant sample1.g.vcf.gz —variant sample2.g.vcf.gz -O samples1_2_combined.g.vcf.gz`
and I end up with the following error:
> htsjdk.samtools.util.RuntimeIOException: sample1.g.vcf.gz has invalid uncompressedLength: -1290556380
My VCF files are huge, like 60gb after uncompressing. I don’t get this error from GenomicsDBImport, for what that’s worth.
Thanks!
From Sheila on 2018-02-20
@wxhyih
Hi wxhyih,
Ah, I see. Thanks for the clarification. So, HaplotypeCaller can run on multiple intervals which is what is specified in that -L. GenomicsDBImport can only take one interval at a time, so the -L has a small interval (probably the first in the interval list provided to HaplotypeCaller). Have a look at [this thread](https://gatkforums.broadinstitute.org/gatk/discussion/10039/running-genotypegvcfs-with-4000-human-exome-data-stuck-on-progressmeter-starting). The team recommends running on ~1 interval per sample in your cohort so GenomicsDBImport runs faster.
I hope this helps.
You may also be interested in [the gatk-workflows](https://github.com/gatk-workflows/gatk4-germline-snps-indels) github repository.
-Sheila
From Sheila on 2018-02-21
@bwray
Hi,
That is odd. Can you try with the very latest version and let us know if that helps? I may need you to submit a bug report.
-Sheila
From bwray on 2018-02-22
@Sheila
Thanks for responding. I’ll give the very latest GATK version a try.
I ran ValidateVariants on my original huge post-HaplotypeCaller VCF files, and I had one VCF file fail (with the same “invalid uncompressedLength” error), but the other one was successfully validated. I’m not sure what the point of running a single VCF file through GenomicsDBImport would be, but I tried it with the one valid VCF file and again generated a set of VCF files with only headers.
For now, what I’ve done is rerun HaplotypeCaller, this time passing my list of target intervals (which I hadn’t passed the first time I ran HapltypeCaller) to create much smaller GVCF files to pass downstream. I tried running GenomicsDBImport / GenotypeGVCFs on these new smaller GVCF files but ended up with the same set of header-only vcf files.
I then tried GATK 4.0.1.0’s CombineGVCFs on the smaller post-HaplotypeCaller VCF files, and managed to get a combined VCF file populated with variants that was then successfully processed by GenotypeGVCF.
I’m now wading through VQSR docs trying to compose my next commands. I’ll try the very latest GATK once I finish processing these samples.
Thanks again!
edit – ValidateVariants, not ValidateVCF
From Sheila on 2018-02-22
@bwray
Hi,
To confirm, CombineGVCFs worked on both of your sample GVCFs, but GenomicsDBImport did not?
If you can confirm this GenomicsDBImport issue still happens in the latest version, I will need you to submit a bug report.
In the meantime, I am happy CombineGVCFs worked for you and you are able to move on :smile:
-Sheila
From Geraldine_VdAuwera on 2018-02-24
@alphahmed See the workflow script we use for joint calling in https://github.com/gatk-workflows/gatk4-germline-snps-indels
Alternatively, you can use CombineGVCFs as the combine step before GenotypeGVCFs, which follows the same paradigm as what we did in GATK3.
From sam0per on 2018-04-16
Hi,
After variant calling using HaplotypeCaller in GVCF mode, I am consolidating the .g.vcf.gz outputs with the tool GenomicsDBImport. I have a list of samples in the file “samples.txt” and a list of contigs to pass individually. This is the command:
`while read i; do java -Xmx34g -jar /proj/data9/samuel/modules/gatk-4.0.2.0/gatk-package-4.0.2.0-local.jar GenomicsDBImport —sample-name-map samples.txt —genomicsdb-workspace-path gVCF_database -L $i; done
Howerver, a user error occurs because I am not supposed to save each output for a single contig interval under the same workspace.
`A USER ERROR has occurred: The workspace you’re trying to create already exists. Writing into an existing workspace can cause data corruption. Please choose an output path that doesn’t already exist.`
As far as I understand, each interval requires a separate workspace and I was wondering whether I could merge the workspaces of each contig interval before joint genotyping. For example, I would like to combine gVCF_database_contig0 with gVCF_database_contig4 (I have thousands of contigs).
I assume that I could iterate joint genotyping over the thousands of workspaces but I am still quite unsure how to merge the final output.
Can somebody suggest me at which step it is more convenient to combine the multiple workspaces for each contig interval? before or after joint genotyping?
From Geraldine_VdAuwera on 2018-04-16
Hi @sam0per, in our pipelines we run through GenomicsDBImport and GenotypeGVCFs per interval, then we combine the per-interval VCFs after that. We don’t combine workspaces; that is not possible. However there are a few tweaks you can apply.
1. We don’t have as many contigs as you do but we run on WGS interval lists with a lot of intervals, and to keep things reasonable we merge intervals to produce a 2.5:1 ratio of intervals to samples, which you can find implemented here: https://github.com/gatk-workflows/gatk4-germline-snps-indels/blob/master/joint-discovery-gatk4.wdl (which unfortunately has some additional complexity due to the inclusion of filtering steps + some cloud-specific optimizations).
2. Because we’re running on cloud, each separate job is run on a different machine, so for us it doesn’t matter that the workspace name is the same in all cases. In your case however I would recommend making the workspace name/path a variable that includes eg an interval identifier, to avoid collisions.
Note that I’m going to close this thread because the article is out of date and has been replaced by https://gatkforums.broadinstitute.org/gatk/discussion/11813/how-to-consolidate-gvcfs-for-joint-calling-with-genotypegvcfs; if you still have questions please feel free to post them on that article or just create a new discussion thread.