created by GATK_Team
on 2017-12-28
This document details the procedure used by HaplotypeCaller to define ActiveRegions on which to operate as a prelude to variant calling. For more context information on how this fits into the overall HaplotypeCaller method, please see the more general HaplotypeCaller documentation.
This procedure is also applied by Mutect2 for somatic short variant discovery. See this article for a direct comparison between HaplotypeCaller and Mutect2.
Note that some of the command line argument names in this article may not be up to date. If you encounter any problems, please let us know in the comments so we can fix them.
Contents
To define active regions, the HaplotypeCaller operates in three phases. First, it computes an activity score for each individual genome position, yielding the raw activity profile, which is a wave function of activity per position. Then, it applies a smoothing algorithm to the raw profile, which is essentially a sort of averaging process, to yield the actual activity profile. Finally, it identifies local maxima where the activity profile curve rises above the preset activity threshold, and defines appropriate intervals to encompass the active profile within the preset size constraints.
Active regions are determined by calculating a profile function that characterizes “interesting” regions likely to contain variants. The raw profile is first calculated locus by locus.
In the normal case (no special mode is enabled) the per-position score is the probability that the position contains a variant as calculated using the reference-confidence model applied to the original alignment.
If using the mode for genotyping given alleles (GGA) or the advanced-level flag -useAlleleTrigger
, and the site is overlapped by an allele in the VCF file provided through the -alleles
argument, the score is set to 1. If the position is not covered by a provided allele, the score is set to 0.
This operation gives us a single raw value for each position on the genome (or within the analysis intervals requested using the -L
argument).
The final profile is calculated by smoothing this initial raw profile following three steps. The first two steps consist in spreading individual position raw profile values to contiguous bases. As a result each position will have more than one raw profile value that are added up in the third and last step to obtain a final unique and smoothed value per position.
-bandPassSigma
argument (current default is 17 bp). The larger the sigma, the broader the spread will be.The resulting profile line is cut in regions where it crosses the non-active to active threshold (currently set to 0.002). Then we make some adjustments to these boundaries so that those regions that are to be considered active, with a profile running over that threshold, fall within the minimum (fixed to 50bp) and maximum region size (customizable using -activeRegionMaxSize
).
Of the resulting regions, those with a profile that runs over this threshold are considered active regions and progress to variant discovery and or calling whereas regions whose profile runs under the threshold are considered inactive regions and are discarded except if we are running HC in reference confidence mode.
There is a final post-processing step to clean up and trim the ActiveRegion:
-mbq
, 10 by default).-dontUseSoftClippedBases
), if the read and its mate map to the same chromosome, and if they are in the correct standard orientation (i.e. LR and RL).-dt
, -dfrac
, -dcov
etc.) and cannot be overriden from the command line.Updated on 2019-07-14
From davidguttermans on 2018-10-12
This is interesting I would like to read more about this.