180. Release notes for GATK version 32

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by ebanks

on 2014-07-15

GATK 3.2 was released on July 14, 2014. Itemized changes are listed below. For more details, see the user-friendly version highlights.

We also want to take this opportunity to thank super-user Phillip Dexheimer for all of his excellent contributions to the codebase, especially for this release.

Haplotype Caller

Various improvements were made to the assembly engine and likelihood calculation, which leads to more accurate genotype likelihoods (and hence better genotypes).
Reads are now realigned to the most likely haplotype before being used by the annotations, so AD and DP will now correspond directly to the reads that were used to generate the likelihoods.
The caller is now more conservative in low complexity regions, which significantly reduces false positive indels at the expense of a little sensitivity; mostly relevant for whole genome calling.
Small performance optimizations to the function to calculate the log of exponentials and to the Smith-Waterman code (thanks to Nigel Delaney).
Fixed small bug where indel discovery was inconsistent based on the active-region size.
Removed scary warning messages for "VectorPairHMM".
Made VECTORLOGLESSCACHING the default implementation for PairHMM.
When we subset PLs because alleles are removed during genotyping we now also subset the AD.
Fixed bug where reference sample depth was dropped in the DP annotation.

Variant Recalibrator

The -mode argument is now required.
The plotting script now uses the theme instead of opt functions to work with recent versions of the ggplot2 R library.

AnalyzeCovariates

The plotting script now uses the theme instead of opt functions to work with recent versions of the ggplot2 R library.

Variant Annotator

SB tables are created even if the ref or alt columns have no counts (used in the FS and SOR annotations).

Genotype GVCFs

Added missing arguments so that now it models more closely what's available in the Haplotype Caller.
Fixed recurring error about missing PLs.
No longer pulls the headers from all input rods including dbSNP, rather just from the input variants.
--includeNonVariantSites should now be working.

Select Variants

The dreaded "Invalid JEXL expression detected" error is now a kinder user error.

Indel Realigner

Now throws a user error when it encounters reads with I operators greater than the number of read bases.
Fixed bug where reads that are all insertions (e.g. 50I) were causing it to fail.

CalculateGenotypePosteriors

Now computes posterior probabilities only for SNP sites with SNP priors (other sites have flat priors applied).
Now computes genotype posteriors using likelihoods from all members of the trio.
Added annotations for calling potential de novo mutations.
Now uses PP tag instead of GP tag because posteriors are Phred-scaled.

Cat Variants

Can now process .list files with -V.
Can now handle BCF and Block-Compressed VCF files.

Validate Variants

Now works with gVCF files.
By default, all strict validations are performed; use --validationTypeToExclude to exclude specific tests.

FastaAlternateReferenceMaker

Now use '--useIUPACsample sample_name' to specify which sample's genotypes should be used for the IUPAC encoding with multi-sample VCF files.

Miscellaneous

Refactored maven directories and java packages replacing "sting" with "gatk".
- Extended on-the-fly sample renaming feature to VCFs with the --samplerenamemapping_file argument.
- Added a new read transformer that refactors NDN cigar elements to one N element.
- Now a Tabix index is created for block-compressed output formats.
- Switched outputRoot in SplitSamFile to an empty string instead of null (thanks to Carlos Barroto).
- Enabled the AB annotation in the reference model pipeline (thanks to John Wallace).
- We now check that output files are specified in a writeable location.
- We now allow blank lines in a (non-BAM) list file.
- Added legibility improvements to the Progress Meter.
- Allow for non-tab whitespace in sample names when performing on-the-fly sample-renaming (thanks to Mike McCowan).
- Made IntervalSharder respect the IntervalMergingRule specified on the command line.
- Sam, tribble, and variant jars updated to version 1.109.1722; htsjdk updated to version 1.112.1452.

Updated on 2014-10-23

From miked on 2014-07-15

Can I process a gVCF generated by HC v3.1 downstream with CombineGVCFs and GenotypeGVCFs v3.2 ?

Does this cause backwards incompatibility:

“Reads are now realigned to the most likely haplotype before being used by the annotations, so AD and DP will now correspond directly to the reads that were used to generate the likelihoods.”

I’m interested in using v3.2 CombineGVCFs and CatVariants because a bug has been fixed allowing it to support gzipped VCFs as input and output as previously reported here: http://gatkforums.broadinstitute.org/discussion/3904/incremental-joint-variant-discovery-and-number-of-samples

Thanks for the help.

From Geraldine_VdAuwera on 2014-07-15

The two versions of HaplotypeCaller are technically compatible, so running 3.1 output gVCFs through 3.2 should work, but it comes with a big caveat: if you do this for a dataset generated with 3.1, then add new samples called using 3.2 to your cohort, you may end up with batch effects. While the difference between 3.0 and 3.1 was minimal, there is substantially more difference between 3.1 and 3.2. Results coming out of 3.2 will be better and have qualitatively different information (e.g. the post-reassembly AD and DP values as you mention), which is undesirable for project consistency. So we do recommend sticking with one version for a given project. But if you have your entire cohort and just want to run it as a “one and done” analysis, that should be okay. Just don’t mix and match GVCFs from different versions.

From erikt on 2014-07-18

Hi! First comment, so I want to thank all of you at Broad for all your work on this incredible tool, for sharing it with the greater community, and for supporting it here! My question is also about compatibility, but going back a step. I just finished setting up and running the 3.1 pipeline on some WES data. As the v3.2 HC is said to have significant improvements I would like to rerun with this version, but I wonder if it is necessary/advantageous to rerun the pipe starting from the realignment step, or can I start from my final merged bams?

Thanks,

Erik

From KinMok on 2014-07-21

Yes, I have similar question as Erik. Do I need to re-run indel realignment and base quality re-calibration for the samples that were done with V3.1? Or simply rerun with version 3.2 starting from HaplotypeCaller emitting gvcf?

Thanks

Kin

From Geraldine_VdAuwera on 2014-07-21

erikt & KinMok‌

Glad you like the tools, Erik :)

No, it is not necessary to redo the data processing (realignment & BQSR) on data that was previously processed using versions 2.8 or later. You can just rerun from the HaplotypeCaller step.

From chenyu600 on 2014-07-23

Hi, @Geraldine_VdAuwera, should I use corresponding bundle dateset when I update GATK and I wonder where to get the bundle dateset to v3.2, I don’t find bundle 3.2 under FTP site:ftp.broadinstitute.org/gsapubftp-anonymous/bundle

From Geraldine_VdAuwera on 2014-07-23

Hi @chenyu600‌,

We don’t issue a new bundle for every version. Since nothing has changed in the resources files since 2.8, you can use that version for working with GATK 3.2.

From blueskypy on 2014-08-05

My project includes 700 WES samples that are divided into different sequencing batches. Since those batches came at different times over the past one and half year, they were processed using different versions of BWA and GATK. Now I want to re-run the whole cohort of 700 samples from HC using GATK v3.2, and I have two questions:

1. what would be the memory and time estimate to CombineGVCFs on all 700 sample in one step? will it be better to run in two steps, e.g. 350+350?

2. Will the bam files from different versions of BWA and GATK produce batch effect?

From Geraldine_VdAuwera on 2014-08-06

@blueskypy,

I can’t give you an estimate, sorry. Considering you’ve done quite a bit of testing on CombineGVCFs you probably know more than me about that at this point! I would expect that it’s more efficient to produce several smaller combined subsets than one huge one.

For batch effects, see my earlier response to a similar question. We do recommend using the same version for everything, but bam files processed with 2.8 don’t need to be redone.

From knho on 2014-08-07

Hi Geraldine,

When using HaplotypeCaller in GATK version 3.2, the Queue is recommended for parallel computing instead of multithreading because of reported issues according to the GATK document. Did the issues cause any problems to the output results? If I use multithreadiing (-nct) to parallel HaplotypeCaller, can the output results be wrong?

From Geraldine_VdAuwera on 2014-08-07

Hi @knho,

No, in cases of multithreading related issues, the program may fail to complete the run, but we are not aware of any incorrect results being output when a run completes successfully.

From knho on 2014-08-07

Thanks, Geraldine. In my experience with ~800 WGS dataset, several HaplotypeCaller jobs had error messages “error=‘Cannot allocate memory’”. Is it one of multithreading related issues?

From Geraldine_VdAuwera on 2014-08-07

@knho That sounds like a generic java memory error. It can happen in relation to multithreading.

Report abuse