IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click
here
created by Geraldine_VdAuwera
on 2015-11-25
GATK 3.5 was released on November 25, 2015. Itemized changes are listed below. For more details, see the user-friendly version highlights.
- MuTect2: somatic SNP and indel caller based on HaplotypeCaller and the original MuTect.
- ContEst: estimation of cross-sample contamination (primarily for use in somatic variant discovery).
- GatherBqsrReports: utility to gather recalibration tables from scatter-parallelized BaseRecalibrator runs.
Variant Context Annotations
- Added allele-specific version of existing annotations: ASBaseQualityRankSumTest, ASFisherStrand, ASMappingQualityRankSumTest, ASRMSMappingQuality, ASRankSumTest, ASReadPosRankSumTest, ASStrandOddsRatio, ASQualByDepth and AS_InbreedingCoeff.
- Added BaseCountsBySample annotation. Intended to provide insight into the pileup of bases used by HaplotypeCaller in the calling process, which may differ from the pileup observed in the original bam file because of the local realignment and additional filtering performed internally by HaplotypeCaller. Can only be requested from HaplotypeCaller, not VariantAnnotator.
- Added ExcessHet annotation. Estimates excess heterozygosity in a population of samples. Related to but distinct from InbreedingCoeff, which estimates evidence for inbreeding in a population. ExcessHet scales more reliably to large cohort sizes.
- Added FractionInformativeReads annotation. Reports the number of reads that were considered informative by HaplotypeCaller (over all samples).
- Enforced calculating GenotypeAnnotations before InfoFieldAnnotations. This ensures that the AD value is available to use in the QD calculation.
- Reorganized standard annotation groups processing to ensure that all default annotations always get annotated regardless of what is specified on the command line. This fixes a bug where default annotations were getting dropped when the command line included annotation requests.
- Made GenotypeGVCFs subset StrandAlleleCounts intelligently, i.e. subset the SAC values to the called alleles. Previously, when the StrandAlleleCountsBySample (SAC) annotation was present in GVCFs, GenotypeGVCFs carried it over to the final VCF essentially unchanged. This was problematic because SAC includes the counts for all alleles originally present (including NON-REF) even when some are not called in the final VCF. When the full list of original alleles is no longer available, parsing SAC could become difficult if not impossible.
- Added new MQ jittering functionality to improve how VQSR handles MQ. Note that HaplotypeCaller now calculates a new annotation called RAW_MQ per-sample, which is then integrated per-cohort by GenotypeGVCFs to produce the MQ annotation.
- VariantAnnotator can now annotate FILTER field from an external resource. Usage:
--resource:foo resource.vcf --expression foo.FILTER
- VariantAnnotator can now check allele concordance when annotating with an external resource. Usage:
--resourceAlleleConcordance
- Bug fix: The annotation framework was improved to allow for the collection of sufficient statistics during GVCF creation which are then used to compute the final annotation during the genotyping. This avoids the use of median as the representative annotation from the collection of values (one from each sample). TL;DR annotations will be more accurate when using the GVCF workflow for joint discovery.
Variant manipulation tools
- Allowed overriding hard-coded cutoff for allele length in ValidateVariants and in LeftAlignAndTrimVariants. Usage:
--reference_window_stop N
where N is the desired cutoff. - Also in LeftAlignAndTrimVariants, trimming multiallelic alleles is now the default behavior.
- Fixed ability to mask out snps with
--snpmask
in FastaAlternateReferenceMaker. - Also in FastaAlternateReferenceMaker, fixed merging of contiguous intervals properly, and made the tool produce more informative contig names.
- Fixed a bug in CombineVariants that occurred when one record has a spanning deletion and needs a padded reference allele.
- Added a new VariantEval evaluation module, MetricsCollection, that summarizes metrics from several EV modules.
- Enabled family-level stratification in MendelianViolationEvaluator of VariantEval (if a ped file is provided), making it possible to count Mendelian violations for each family in a callset with multiple families.
- Added the ability to SelectVariants to enforce 4.2 version output of the VCF spec when processing older files. Use case: the 4.2 spec specifies that GQ must be an integer; by default we don’t enforce it (so if reading an older file that used decimals, we don’t change it) but the new argument
--forceValidOutput
converts the values on request. Not made default because of some performance slowdown -- so writing VCFs is now fast by default, compliant by choice. - Improved VCF sequence dictionary validation. Note that as a side effect of the additional checks, some users have experienced an error that starts with "ERROR MESSAGE: Lexicographically sorted human genome sequence detected in variant." that is due to unintentional activation of a check that is not necessary. This will be fixed in the next release; in the meantime
-U ALLOW_SEQ_DICT_INCOMPATIBILITY
can be used (with caution) to override the check.
- Various improvements to the tools’ performance, especially HaplotypeCaller, by making the code more efficient and cutting out crud.
- GenotypeGVCFs now emits a no-call (./.) when the evidence is too ambiguous to make a call at all (e.g. all the PLs are zero). Previously this would have led to a hom-ref call with RGQ=0.
- Fixed a bug in GenotypeGVCFs that sometimes generated invalid VCFs for haploid callsets. The tool was carrying over the AD from alleles that had been trimmed out, causing field length mismatches.
- Changed the genotyping implementation for haploid organisms to address performance problems reported when running GenotypeGVCFs on haploid callsets. Note that this change may lead to a slight loss of sensitivity at low-coverage sites -- let us know if you observe anything dramatic.
- Ensured inputPriors get used if they are specified to the genotyper (previously they were ignored). Also improved docs on
--heterozygosity
and --indel_ heterozygosity
priors. - Fixed bug that affected the
--ignoreInputSamples
behavior of CalculateGenotypePosteriors. - Limited emission of the scary warning message about max number of alleles (“this tool is set to genotype at most x alleles but we found more; only x will be used”) to a single occurrence unless DEBUG logging mode is activated. Otherwise it fills up our output logs.
- Added option to OverclippedReadFilter to not require soft-clips on both ends. Contributed by Jacob Silterra.
- Fixed a bug in IndelRealigner where the tool was incorrectly "fixing" mates when supplementary alignments are present. The patch involves ignoring supplementary alignments.
- Fixed a bug in CatVariants. Previously, VCF files were being sorted solely on the base pair position of the first record, ignoring the chromosome. This can become problematic when merging files from different chromosomes, especially if you have multiple VCFs per chromosome. Contributed by John Wallace.
Engine-level behaviors and capabilities
- Support for reading and writing CRAM files. Some improvements are still expected in htsjdk. Contributed by Vadim Zalunin at EBI and collaborators at the Sanger Institute.
- Made interval-list output format dependent on the file extension (for RealignerTargetCreator). If the extension is
.interval_list
, output will be formatted as a proper Picard interval list (with sequence dictionary). Otherwise it will be a basic GATK interval list as previously. - Adding static binning capability for base recalibration (BQSR).
- Added a new JobRunner called ParallelShell that will run jobs locally on one node concurrently as specified by the DAG, with the option to limit the maximum number of concurrently running jobs using the flag
maximumNumberOfJobsToRunConcurrently
. Contributed by Johan Dahlberg. - Updated extension for Picard CalculateHsMetrics to include
PER_TARGET_COVERAGE
argument and added extension for Picard CollectWgsMetrics.
Removed:
- BeagleOutputToVCF, VariantsToBeagleUnphased, ProduceBeagleInput. These are tools for handling Beagle data. The latest versions of Beagle support VCF input and output, so there is no longer any reason for us to provide converters.
- ReadAdaptorTrimmer and VariantValidationAssessor. These were experimental tools which we think are not useful and not operating on a sufficiently sound basis.
- BaseCoverageDistribution and CoveredByNSamplesSites. These tools were redundant with DiagnoseTargets and/or DepthOfCoverage.
- LiftOverVariants, FilterLiftedVariants and liftOverVCF.pl. The Picard liftover tool LiftoverVCF works better and is easier to operate.
- sortByRef.pl. Use Picard SortVCF instead.
- ListAnnotations. This was intended as a utility for listing annotations easily from command line, but it has not proved useful.
- Made various documentation improvements.
- Updated date and street address in license text.
- Moved htsjdk & picard to version 1.141
Updated on 2016-02-17
From tommycarstensen on 2015-11-25
Congratulations on giving birth to 3.5!
> @Geraldine_VdAuwera said:
> – Added BaseCountsBySample annotation.
I’ll make sure to add this as a default option to my wrapper script.
> @Geraldine_VdAuwera said:
> – Added new MQ jittering functionality to improve how VQSR handles MQ.
Is this documented and evaluated in detail somewhere? Should I just check the diff/history on GitHub to find out more?
> @Geraldine_VdAuwera said:
> – LiftOverVariants, FilterLiftedVariants and liftOverVCF.pl. The Picard liftover tool LiftoverVCF works better and is easier to operate.
Thanks for suggesting an [alternative tool](https://broadinstitute.github.io/picard/command-line-overview.html#LiftoverVcf), which I didn’t know about for this task.
> – ListAnnotations. This was intended as a utility for listing annotations easily from command line, but it has not proved useful.
I would probably just look this up online myself, although I remember always having difficulty in the past finding it at the bottom of the top grey Categories box on the left side of the screen. And it’s impossible to bookmark a page with all annotations.
From Sheila on 2015-11-25
@tommycarstensen
Hi Tommy,
Yes, it will be best to check Github for more information. We are working on documenting the MQ jittering functionality.
-Sheila
From Geraldine_VdAuwera on 2015-11-26
Thanks Tommy :)
> bookmark a page with all annotations
I’ll look into options for doing this (and the other categories as well).
From jmm1 on 2015-12-16
Hi Guys,
Congratulation on the new build. I was wondering if you could provide a little more detail on the “Fixed ability to mask out snps with —snpmask in FastaAlternateReferenceMaker”. Does this address the issue with of being able to annotate no calls as Ns?
Thanks!
From Geraldine_VdAuwera on 2015-12-16
Thanks @jmm1. Yes, you should be able to use the snpmask with a VCF of the sites you want to mask out now.
From Geraldine_VdAuwera on 2016-01-04
Edited the release notes to add a note regarding RAW_MQ vs MQ.
From Geraldine_VdAuwera on 2016-02-17
Added note on improved VCF sequence dictionary validation and the unintentional side effect that causes sorting-related validation errors (+workaround pending a fix).