created by gauthier
on 2019-02-20
Authors of the blog: Laura Gauthier and David Benjamin
I like to think of Mutect2 and HaplotypeCaller as two sisters. One is a little older, but they live in the same house and share a fair amount of genetic material. One is gregarious and surrounds herself with friends. The other prefers the intimate company of a few special people. But they both have a formidable intellect, a determined work ethic, and a selfless instinct to help. Mutect2 and HaplotypeCaller both follow the same basic recipe: determine the active region, assemble the haplotypes, evaluate per-read allele likelihoods, and calculate variant likelihoods. Because their code is fundamentally alike for the first three steps, many changes in one tool are likely to impact the other. Recently Mutect2 development has been much more active than HaplotypeCaller development, but both tools are able to benefit. In his blog post (see https://software.broadinstitute.org/gatk/blog?id=23400), David Benjamin outlined the new Mutect2 features in more detail. Below you’ll see how many of them have consequences for HaplotypeCaller as well.
Previous algorithms did not count bases that were hardclipped during local assembly because they fell outside the active region. Overfiltering variants at the ends of reads made Mutect2 less sensitive, and penalized a number of germline variants as well. We’ve improved the active region logic by clarifying how the position of a variant within a read is measured, an update that benefits Mutect2 via the ReadPosition annotation, and HaplotypeCaller via ReadPosRankSum. Mutect2 has already demonstrated improved sensitivity and the current implementation should improve filtering accuracy in germline cases like these as well.
Inspired by Mutect2 work on mitochondrial data, David improved the tendency to assemble low quality haplotypes by introducing “adaptive pruning” into the assembly of candidate haplotypes (see https://github.com/broadinstitute/gatk/docs/local_assembly.pdf for more information about the local assembly used). Mitochondrial data typically have very high depth (thousands of reads per base), and many positions will contain at least three matching sequencing errors by chance. Because the standard threshold for removing errors from the graph is two reads, the result is many more “variants” in the graph. The new solution uses local coverage to determine the evidence for “pruning” likely erroneous paths from the graph. This improved the sensitivity for low allele fraction variants in mitochondrial genome data and has proven critical for mitochondrial exome analysis. This feature is off by default for HaplotypeCaller, but can be turned on in high depth regions with dense variants. Samples with ploidy higher than two or pooled experiments should benefit greatly as well.
Read visualization - based on per-read likelihood calculations - was improved in both tools by David’