created by ebanks
on 2014-07-15
GATK 3.2 was released on July 14, 2014. Itemized changes are listed below. For more details, see the user-friendly version highlights.
We also want to take this opportunity to thank super-user Phillip Dexheimer for all of his excellent contributions to the codebase, especially for this release.
theme
instead of opt
functions to work with recent versions of the ggplot2 R library.theme
instead of opt
functions to work with recent versions of the ggplot2 R library.Updated on 2014-10-23
From miked on 2014-07-15
Can I process a gVCF generated by HC v3.1 downstream with CombineGVCFs and GenotypeGVCFs v3.2 ?
Does this cause backwards incompatibility:
“Reads are now realigned to the most likely haplotype before being used by the annotations, so AD and DP will now correspond directly to the reads that were used to generate the likelihoods.”
I’m interested in using v3.2 CombineGVCFs and CatVariants because a bug has been fixed allowing it to support gzipped VCFs as input and output as previously reported here: http://gatkforums.broadinstitute.org/discussion/3904/incremental-joint-variant-discovery-and-number-of-samples
Thanks for the help.
From Geraldine_VdAuwera on 2014-07-15
The two versions of HaplotypeCaller are technically compatible, so running 3.1 output gVCFs through 3.2 should work, but it comes with a big caveat: if you do this for a dataset generated with 3.1, then add new samples called using 3.2 to your cohort, you may end up with batch effects. While the difference between 3.0 and 3.1 was minimal, there is substantially more difference between 3.1 and 3.2. Results coming out of 3.2 will be better and have qualitatively different information (e.g. the post-reassembly AD and DP values as you mention), which is undesirable for project consistency. So we do recommend sticking with one version for a given project. But if you have your entire cohort and just want to run it as a “one and done” analysis, that should be okay. Just don’t mix and match GVCFs from different versions.
From erikt on 2014-07-18
Hi! First comment, so I want to thank all of you at Broad for all your work on this incredible tool, for sharing it with the greater community, and for supporting it here! My question is also about compatibility, but going back a step. I just finished setting up and running the 3.1 pipeline on some WES data. As the v3.2 HC is said to have significant improvements I would like to rerun with this version, but I wonder if it is necessary/advantageous to rerun the pipe starting from the realignment step, or can I start from my final merged bams?
Thanks,
Erik
From KinMok on 2014-07-21
Yes, I have similar question as Erik. Do I need to re-run indel realignment and base quality re-calibration for the samples that were done with V3.1? Or simply rerun with version 3.2 starting from HaplotypeCaller emitting gvcf?
Thanks
Kin
From Geraldine_VdAuwera on 2014-07-21
erikt &
KinMok
Glad you like the tools, Erik :)
No, it is not necessary to redo the data processing (realignment & BQSR) on data that was previously processed using versions 2.8 or later. You can just rerun from the HaplotypeCaller step.
From chenyu600 on 2014-07-23
Hi, @Geraldine_VdAuwera, should I use corresponding bundle dateset when I update GATK and I wonder where to get the bundle dateset to v3.2, I don’t find bundle 3.2 under FTP site:ftp.broadinstitute.org/gsapubftp-anonymous/bundle
From Geraldine_VdAuwera on 2014-07-23
Hi @chenyu600,
We don’t issue a new bundle for every version. Since nothing has changed in the resources files since 2.8, you can use that version for working with GATK 3.2.
From blueskypy on 2014-08-05
My project includes 700 WES samples that are divided into different sequencing batches. Since those batches came at different times over the past one and half year, they were processed using different versions of BWA and GATK. Now I want to re-run the whole cohort of 700 samples from HC using GATK v3.2, and I have two questions:
1. what would be the memory and time estimate to CombineGVCFs on all 700 sample in one step? will it be better to run in two steps, e.g. 350+350?
2. Will the bam files from different versions of BWA and GATK produce batch effect?
From Geraldine_VdAuwera on 2014-08-06
@blueskypy,
I can’t give you an estimate, sorry. Considering you’ve done quite a bit of testing on CombineGVCFs you probably know more than me about that at this point! I would expect that it’s more efficient to produce several smaller combined subsets than one huge one.
For batch effects, see my earlier response to a similar question. We do recommend using the same version for everything, but bam files processed with 2.8 don’t need to be redone.
From knho on 2014-08-07
Hi Geraldine,
When using HaplotypeCaller in GATK version 3.2, the Queue is recommended for parallel computing instead of multithreading because of reported issues according to the GATK document. Did the issues cause any problems to the output results? If I use multithreadiing (-nct) to parallel HaplotypeCaller, can the output results be wrong?
From Geraldine_VdAuwera on 2014-08-07
Hi @knho,
No, in cases of multithreading related issues, the program may fail to complete the run, but we are not aware of any incorrect results being output when a run completes successfully.
From knho on 2014-08-07
Thanks, Geraldine. In my experience with ~800 WGS dataset, several HaplotypeCaller jobs had error messages “error=‘Cannot allocate memory’”. Is it one of multithreading related issues?
From Geraldine_VdAuwera on 2014-08-07
@knho That sounds like a generic java memory error. It can happen in relation to multithreading.