created by Geraldine_VdAuwera
on 2014-05-08
This document describes the procedure used by HaplotypeCaller to define ActiveRegions on which to operate as a prelude to variant calling. For more context information on how this fits into the overall HaplotypeCaller method, please see the more general HaplotypeCaller documentation.
To define active regions, the HaplotypeCaller operates in three phases. First, it computes an activity score for each individual genome position, yielding the raw activity profile, which is a wave function of activity per position. Then, it applies a smoothing algorithm to the raw profile, which is essentially a sort of averaging process, to yield the actual activity profile. Finally, it identifies local maxima where the activity profile curve rises above the preset activity threshold, and defines appropriate intervals to encompass the active profile within the preset size constraints.
Active regions are determined by calculating a profile function that characterizes “interesting” regions likely to contain variants. The raw profile is first calculated locus by locus.
In the normal case (no special mode is enabled) the per-position score is the probability that the position contains a variant as calculated using the reference-confidence model applied to the original alignment.
If using the mode for genotyping given alleles (GGA) or the advanced-level flag -useAlleleTrigger
, and the site is overlapped by an allele in the VCF file provided through the -alleles
argument, the score is set to 1. If the position is not covered by a provided allele, the score is set to 0.
This operation gives us a single raw value for each position on the genome (or within the analysis intervals requested using the -L
argument).
The final profile is calculated by smoothing this initial raw profile following three steps. The first two steps consist in spreading individual position raw profile values to contiguous bases. As a result each position will have more than one raw profile value that are added up in the third and last step to obtain a final unique and smoothed value per position.
-bandPassSigma
argument (current default is 17 bp). The larger the sigma, the broader the spread will be.The resulting profile line is cut in regions where it crosses the non-active to active threshold (currently set to 0.002). Then we make some adjustments to these boundaries so that those regions that are to be considered active, with a profile running over that threshold, fall within the minimum (fixed to 50bp) and maximum region size (customizable using -activeRegionMaxSize
).
Of the resulting regions, those with a profile that runs over this threshold are considered active regions and progress to variant discovery and or calling whereas regions whose profile runs under the threshold are considered inactive regions and are discarded except if we are running HC in reference confidence mode.
There is a final post-processing step to clean up and trim the ActiveRegion:
-mbq
, 10 by default).-dontUseSoftClippedBases
), if the read and its mate map to the same chromosome, and if they are in the correct standard orientation (i.e. LR and RL).-dt
, -dfrac
, -dcov
etc.) and cannot be overriden from the command line.Updated on 2014-12-08
From gputzel on 2014-12-17
Hi Geraldine,
How is data entropy (as mentioned in the article title) involved in the process of defining ActiveRegions? I don’t see it mentioned in the article itself.
Thanks!
-Greg
From Geraldine_VdAuwera on 2014-12-18
Ah, this is just a minor inconsistency in the terminology we use. What we call entropy in the title is what we measure with the activity score in the article. Sorry for the confusion, I’ll tag this for revision.
From mscjulia on 2017-05-21
Hello GATK team,
I have a quick question about the smoothing function and defining active regions. If my data has evidence for a single SNV in a region that is otherwise variant free, will this position be included in an active region? Or do active regions generally require multiple variants to be called? Thank you very much!
From Geraldine_VdAuwera on 2017-05-21
Hi @mscjulia, a single variant is sufficient to trigger an active region.
From mscjulia on 2017-05-22
@Geraldine_VdAuwera Awesome Thanks a lot!