created by Geraldine_VdAuwera
on 2015-05-22
Folks, I’m all out of banter for this one, so let’s go straight to the facts. GATK 3.4 contains a shedload of improvements and bug fixes, including some new functionality that we hope you’ll find useful. The full list is available in the detailed release notes.
None of the recent changes involves any disruption to the Best Practice workflow (I hear some sighs of relief) but you’ll definitely want to check out the tweaks we made to the joint discovery tools (HaplotypeCaller, CombineGVCFs and GenotypeGVCFs), which are rapidly maturing as they log more flight time at Broad and in the wild.
Let’s start at the very beginning with HaplotypeCaller (a very good place to start). On the usability front, we’ve finally given in to the nigh-universal complaint about the required variant indexing arguments (--variant_index_type LINEAR --variant_index_parameter 128000
) being obnoxious and a waste of characters. So, tadaa, they are no longer required, as long as you name your output file with the extension .g.vcf
so that the engine knows what level of compression to use to write the gVCF index (which leads to better performance in downstream tools). We think this naming convention makes a lot of sense anyway, as it’s a great way to distinguish gVCFs from regular VCFs on sight, so we hope most of you will adopt it. That said, we stopped short of making this convention mandatory (for now…) so you don’t have to change all your scripts and conventions if you don’t want to. All that will happen (assuming you still specify the variant index parameters as previously) is that you’ll get a warning in the log telling you that you could use the new convention.
Where we’ve been a bit more dictatorial is that we’ve completely disabled the use of -dcov
with HaplotypeCaller because it was causing very buggy behavior due to an unforeseen complication in how different levels of downsampling are applied in HaplotypeCaller. We know that the default setting does the right thing, and there’s almost no legitimate reason to change it, so we’re disabling this for the greater good pending a fix (which may be a long time coming due to the complexity of the code involved).
Next up, CombineGVCFs gets a new option to break up reference blocks at every N sites. The new argument --breakBandsAtMultiplesOf N
will ensure that no reference blocks in the combined gVCF span genomic positions that are multiples of N. This is meant to enable scatter-gather parallelization of joint genotyping on whole-genome data, as a workaround to some annoying limitations of the GATK engine that make it unsafe to use -L
intervals that might start within the span of a block record. For exome data, joint genotyping can easily be parallelized by scatter-gathering across exome capture target intervals, because we know that there won’t be any hom-ref block records spanning the target interval boundaries. In contrast, in whole-genome data, there is no equivalent predictable termination of block records, so it’s not possible to know up front where it would be safe to set scatter-gather interval start and end points -- until now!
And finally, GenotypeGVCFs gets an important bug fix, and a very useful new annotation.
The bug is something that has arisen mostly (though not exclusively) from large cohort studies. What happened is that, when a SNP occurred in sample A at a position that was in the middle of a deletion for sample B, GenotypeGVCFs would emit a homozygous reference genotype for sample B at that position -- which is obviously incorrect. The fix is that now, sample B will be genotyped as having a symbolic <*:DEL>
allele representing the deletion.
The new annotation is called RGQ
for Reference Genotype Quality. It is a new sample-level annotation that will be added by GenotypeGVCFs to monomorphic sites if you use the -allSites
argument to emit non-variant sites to the output VCF. This is obviously super useful for evaluating the level of confidence of those sites called homozygous-reference.
This new coverage analysis tool is designed to count read depth in a way that is appropriate for allele-specific expression (ASE) analysis. It counts the number of reads that support the REF allele and the ALT allele, filtering low qual reads and bases and keeping only properly paired reads. The default output format produced by this tool is a structured text file intended to be consumed by the statistical analysis toolkit MAMBA. A paper by Stephane Castel and colleagues describing the complete ASE analysis workflow is available as a preprint on bioarxiv.
We’ve added two new documentation resources to the Guide.
One is a new category of documentation articles called Common Problems, to cover topics that are a specialized subset of FAQs: problems that many users encounter, which are typically due to misunderstandings about input requirements or about the expected behavior of the tools, or complications that arise from certain experimental designs. This category is being actively worked on and we welcome suggestions of additional topics that it should cover.
The second is an Issue Tracker that lists issues that have been reported as well as features or enhancements that have been requested. If you encounter a problem that you think might be a bug (or you have a feature request in mind), you can check this page to see if it’s something we already know about. If you have submitted a bug report, you can use the issue tracker to check whether your issue is in the backlog, in the queue, or is being actively worked on. In future we’ll add some functionality to enable voting on what issues or features should be prioritized, so stay tuned for an announcement on that!
Updated on 2015-07-09
From KlausNZ on 2015-05-22
Hi GATK team,
Congratulations on the new version! Very interesting to read why and how you’re making certain decisions about the tools (no disagreement).
We wonder what kind of universal complaint, groundswell, bribe, begging, or promise to bake a nice cake would be required to generate an option to output HC’s full realignments or haplotypes (i.e write a new genome-wide bam file)? The main reason given against this option was effect on runtime – but that was in era of joint calling when HC runtimes were much longer than now. Also, if we users select it in full awareness of the consequences, we can’t really complain……
This would be a great improvement to our workflow (we really want to inspect variants in the alignment!). We wouldn’t ask for it if we could find a workaround (considered scripting separate re-run for each variant with the —bamout option but that won’t do it because it doesn’t write the full reads).
Meanwhile, let’s enjoy 3.4 – and many thanks once more for a great set of tools!
From Geraldine_VdAuwera on 2015-05-22
Um, you mean like -outBam perhaps? That’s been around for a while :)
From Sheila on 2015-05-22
@KlausNZ
Hi,
I think you are asking for the entire region to be output in the bamout file, instead of the active regions only? You can do this by specifying the -bamout argument along with -disableOptimizations, -forceActive, and -dontTrimeActiveRegions. I hope this helps!
https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php#—disableOptimizations
https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php#—dontTrimActiveRegions
https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php#—forceActive
If you provide the entire region you are interested in to -forceActive, you will get the entire realigned bam file.
-Sheila
From KlausNZ on 2015-05-23
Hi,
Thanks for the quick reply, Geraldine and Sheila. I’ve been using -bamout for a while (and love it!). I did indeed suggest an option to write a new bam for the entire genome (or exome, or any large set of intervals).
Sheila, your last statement suggests that this can be achieved by supplying an intervals list (or even an entire genome?) to forceActive (in combination with -disableOptimizations, and -dontTrimActiveRegions)?
I have done some limited experimentation with a WES alignment for single chromosome, and the combination of `-bamout disableOptimizations -dontTrimActiveRegions -forceActive` does indeed provide a much more useful output than `-bamout` by itself. Runtimes were around 16 times longer but no worries about that (I had been warned ;)
But oddly, I had no luck passing any value to `-forceActive`. I have tried passing the same .list file that I passed to -L, `-forceActive 2:1-50000000`, and `-forceActive 2`. but always get error ‘invalid argument value’. Does the argument to `-forceActive` need to be in unusual format, or is maybe incompatible if intervals are applied with `-L`?
(For clarity, no error message if I don’t provide a value to -forceActive). All tested with version 3.3-0.
Thanks for all your help!
K
From Geraldine_VdAuwera on 2015-05-23
Oh I see, my mistake. -forceActive is a flag so it won’t take a value; I believe the intervals should jut be those passed with -L. But Sheila knows those args better than I do and may jump in with more details.
From frankfeng on 2015-05-25
Hi GATK team,
With HaplotypeCaller, can I name the output file with extension `.g.vcf.gz` to create compressed file? Can CombineGVCFs and GenotypeGVCFs handle `.g.vcf.gz` files appropriately?
Thank you very much!
Frank
From TechnicalVault on 2015-05-26
Hi,
The release notes did promise some details in the Highlights on CRAM support… I understand it’s 2.1 (2.0 -> 2.1 is basically EOF marker support) without index support. It would be helpful to know if any options have been added to control the details of CRAM output (binning off or on, embedded references for bacterial genomes etc).
Martin
P.S. Might be good to propose the .g.vcf to the GA4GH file formats group. Also I hope the .gz bit is supported as we’ve been leaning rather heavily on our users to compress intermediate files to address disk space issues.
From Sheila on 2015-05-26
@KlausNZ
Hi K,
Yes, the -forceActive is a flag, but when you pass in the intervals to -L, those regions will be tagged as active.
-Sheila
From Geraldine_VdAuwera on 2015-05-26
@TechnicalVault You’re right, it seems I forgot to add a note on CRAM, sorry about that — will do. In a nutshell, your understanding is correct — it’s basic read/write support without any indexing. No options have been added beyond that. I believe the engineers are aiming to provide “real” CRAM support in the next major version of GATK (4.0) because up to 3.x there are many limitations in the GATK engine that complicate adoption of CRAM, whereas in the GATK 4.x engine (which is basically a complete rewrite), those limitations are all gone. I can’t give you a timeline for GATK 4 but my guess is it will be at least 3 to 6 months before we can talk about a release. It’s a major undertaking.
I’ll discuss the *.g.vcf question with our GA4GH delegates. And yes, the .gz bit remains supported.
From Geraldine_VdAuwera on 2015-05-26
@frankfeng Yes you can use `.g.vcf.gz` or just `.vcf.gz` to compress the files automatically. When using `.gz` compression, a different sort of index is used (`.tbi` instead of `.idx`) so the variant indexing is done differently anyway.
From vifehe on 2015-05-27
Hi,
I am reporting an issue I have detected with the new "symbolic DEL allele" After running GenotypeGVCFs on my set of samples I wanted to Combine my dataset with another dataset also created with the new GATK-v3.4. Here is my command:
java -Xmx10g -jar $GATK -T CombineVariants -R $REF --variant joint1-VCF.vcf --variant joint2-VCF.vcf -o Combined-VCF.vcf -genotypeMergeOptions UNIQUIFY
And here's the ERROR message I receive:
ERROR MESSAGE: Allele in genotype <:DEL> not in the variant context [CA, AA, C]
So, is there a way to make CombineVariants understand this "allele" (or presumably any other tool that will have to use genotypedVCFs created with GATK-3.4) ?
Thanks in advance
Victoria
From Geraldine_VdAuwera on 2015-05-27
Hi @vihefe, it seems that this is bug, sorry. Would you be able to share a file snippet that we can use for debugging? We’d like to issue a patch for this asap. Instructions for submitting a snippet are here: https://www.broadinstitute.org/gatk/guide/article?id=1894
From vifehe on 2015-05-28
Hi @Geraldine_VdAuwera
so I’ve uploaded a compressed folder named: vifehe-GATK-ERROR-report which includes a text file with the command used and the output received and a subset of one of the files used. The other file has been processed exactly equal but has different individuals.
let me know if you need a subset of the other file too
Thanks
From Geraldine_VdAuwera on 2015-05-29
Thanks @vihefe, we’ll try to fix this asap.
From Fer on 2015-06-01
Dear GATK team, I would like to report that that the parameters --variant_index_type LINEAR --variant_index_parameter 128000
seem to still be required when the output extension is *.g.vcf.gz
, while when it is simply *.g.vcf
everything is fine.
Version: 3.4 Command: java -Xmx28g -jar GenomeAnalysisTK.jar -R $ref -T HaplotypeCaller -ERC GVCF -I $acc.sort.mkdup.realigned.recal.bam -o $acc.TEST.g.vcf.gz -nct 8
##### ERROR MESSAGE: GVCF output requires a specific indexing strategy. Please re-run including the arguments -variant_index_type LINEAR -variant_index_parameter 128000.
From Geraldine_VdAuwera on 2015-06-01
Hey @vifehe, we just realized you didn’t provide any input files in your bug report upload. Can you please upload an input file that reproduces the issue? Without that we can’t fix the issue. Thanks.
From Geraldine_VdAuwera on 2015-06-01
@Fer Sorry about that, it seems we omitted to explicitly test that case. We’ll fix it in the upcoming patch.
From vifehe on 2015-06-03
HI @Geraldine_VdAuwera
> we just realized you didn’t provide any input files in your bug report upload>
this is weird. In my local copy I have a vcf file .. .this is not the original but a subset of it, and I only provided you one becasue the other presents the same error but has different samples.
I have uploaded it again, please let me know whether you still dont find it
Thxs
From Sheila on 2015-06-03
@vifehe
Hi,
I just tried testing on the vcf in your uploaded file called vifehe-20150603_ERROR-report.zip. Again, this zip file contained 1 vcf and 1 command.txt file. Is the vcf the output of GenotypeGVCFs? When I run CombineVariants on it, I do not get an error. If you only submitted one vcf from GenotypeGVCFs, and you were running CombineVariants on two vcfs from GenotypeGVCFs, I will need the second vcf to reproduce the error.
Thanks,
Sheila
From vifehe on 2015-06-04
@Sheila
Hi, so this is what I did. I generated two different VCFs (containing different samples) with GenotypegVCF. These genotype gVCFs were coded with the new :*DEL allele codification. Then, I proceed to Combien these two VCFs with CombineVariants and it is when I got the error.
In parallel, I have been running few tests on the subset that I provided you (just one chromosome and few samples) per each dataset and now CombineVariants works fine (I’ve tried with chr 1 and chr22). But if I run it with the whole dataset (about 1200 samples on one dataset and 60 samples on the other dataset) is when it reports the error.
Given the new results … I am no longer sure what dataset to upload for you to test the problem, since my whole dataset is about 32G …
From Geraldine_VdAuwera on 2015-06-04
@vifehe I think what we need is a slice of both of the two VCFs you are trying to combine, with all samples, but only a few positions around the position that has the `*DEL` allele. That should amount to a reasonable size. But it’s definitely important that you check before uploading that the data you send us can be used to reproduce the problem. Otherwise it is not useful to us.
Thank you.
From vifehe on 2015-06-04
@Geraldine_VdAuwera
so I’ve managed to reproduce the ERROR with whole chr1 subset … I wasn’t able to point at the actual position with DEL allele that was causing the ERROR because the log file doesn’t specify it, and there are SEVERAL positions with DEL alleles.
So I’ve uploaded a new compressed file (20150604-vifehe-CombineGVCFs-ERROR) with the two datasets, and the incomplete output plus a text file for the commands and with the outputs.
I hope you can reproduce it too and we can figure out the source of the ERROR.
Thanks
From Sheila on 2015-06-04
@vifehe
Hi,
Thanks, I just submitted a bug report. I will keep you updated on the status.
-Sheila
From nvteja on 2015-06-18
Hi @Sheila
Is the bug reported by @vifehe patched? If I understand correctly , that bug affects the cases where we try to CombineVariants from two different vcf files with deletions right?
Thanks,
Teja.
From Geraldine_VdAuwera on 2015-06-18
The fix is in the latest nightly build. Note that we’ve changed the representation to comply with the VCF specification. This will all be detailed in the version notes for the upcoming 3.4-1 patch release.
From nvteja on 2015-06-23
Hello @Geraldine_VdAuwera,
Do we have to regenerate the “background” GVCF files that we use for jointGenotyping again with the 3.4 version of GATK, or it is okay to use the GVCF files generated from 3.3 for use with GVCFs generated with 3.4? I did test some samples combining the 3.3 generated “background” GVCF files along with a 3.4 generated GVCF file for jointGenotyping and it worked fine, but I just want to make sure that it is okay.
Thanks,
Teja.
From Geraldine_VdAuwera on 2015-06-26
Hi Teja,
Generally we don’t recommend changing versions within the same project. Even if at all works, you risk getting batch effects. The official recommendation is that should either reprocess existing samples with the new version, or stick with the previous version.
From frankfeng on 2015-07-05
Hi GATK team,
Do you have any idea about when approximately 3.4-1 patch will be released? Thanks!
From Geraldine_VdAuwera on 2015-07-06
@frankfeng Definitely this week. I’m the squeaky wheel — was in South Africa for workshop + vacation the past two weeks. Once I’m caught up with the absurd amount of email in my inbox I’ll make the release happen. Sorry for the delay.
From bkelly on 2015-08-19
Can you explain the version numbering scheme? Does 3.4-46 include the bug fixes promised for 3.4-1? It seems like it went from 3.4 to 3.4-46 with nothing in between (I’ll admit I don’t check the download page daily). Are there any patch release notes? Thanks.
From Geraldine_VdAuwera on 2015-08-19
Yes, 3.4-46 is what we originally promised as 3.4-1. If I recall correctly the blog post announcing 3.4-46 explained the numbering — in a nutshell, the number after the dashes refers to the number of git commit (code changes) since the last release.
From bkelly on 2015-08-19
Perfect. Thanks, Geraldine.
From SteveL on 2015-10-21
> @Geraldine_VdAuwera said:
> Hi Teja,
>
> Generally we don’t recommend changing versions within the same project. Even if at all works, you risk getting batch effects. The official recommendation is that should either reprocess existing samples with the new version, or stick with the previous version.
Hi @Geraldine_VdAuwera – I just want to confirm what you express here – do you mean that if I have 1000 gVCF generated using 3.3x, I should not use Combine/GenotypeGVCF from 3.4x? Since I understand you have solved the issue with variants overlapping deletions in different samples in the latest version, I was hoping I could use it.
I can understand not mixing gVCF generated by 3.3 and 3.4 together in joint-genotyping, but I would have thought if all gVCF were generated with the same tool version, then the joint-gentoyping should function OK?
Thanks, Steve
From Geraldine_VdAuwera on 2015-10-21
Hi @SteveL,
Based on what has (and hasn’t) changed between 3.3 and 3.4 I think that should work fine.
From SteveL on 2015-10-21
Thanks for confirming.