Genotype Refinement workflow mathematical details

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by GATK_Team

on 2014-10-17

Overview

This document describes the mathematical details of the methods involved in the Genotype Refinement workflow. For an explanation of the purpose and general principles involved in this workflow, please see the main Genotype Refinement workflow article. For step-by-step instructions on how to apply this workflow to your data, please see the Genotype Refinement tutorial.

1. Review of Bayes’s Rule

HaplotypeCaller outputs the likelihoods of observing the read data given that the genotype is actually HomRef, Het, and HomVar. To convert these quantities to the probability of the genotype given the read data, we can use Bayes’s Rule. Bayes’s Rule dictates that the probability of a parameter given observed data is equal to the likelihood of the observations given the parameter multiplied by the prior probability that the parameter takes on the value of interest, normalized by the prior times likelihood for all parameter values:

$$ P(\theta|Obs) = \frac{P(Obs|\theta)P(\theta)}{\sum_{\theta} P(Obs|\theta)P(\theta)} $$

In the best practices pipeline, we interpret the genotype likelihoods as probabilities by implicitly converting the genotype likelihoods to genotype probabilities using non-informative or flat priors, for which each genotype has the same prior probability. However, in the Genotype Refinement Pipeline we use independent data such as the genotypes of the other samples in the dataset, the genotypes in a “gold standard” dataset, or the genotypes of the other samples in a family to construct more informative priors and derive better posterior probability estimates.

2. Calculation of Population Priors

Given a set of samples in addition to the sample of interest (ideally non-related, but from the same ethnic population), we can derive the prior probability of the genotype of the sample of interest by modeling the sample’s alleles as two independent draws from a pool consisting of the set of all the supplemental samples’ alleles. (This follows rather naturally from the Hardy-Weinberg assumptions.) Specifically, this prior probability will take the form of a multinomial Dirichlet distribution parameterized by the allele counts of each allele in the supplemental population. In the biallelic case the priors can be calculated as follows:

$$ P(GT = HomRef) = \dbinom{2}{0} \ln \frac{\Gamma(nSamples)\Gamma(RefCount + 2)}{\Gamma(nSamples + 2)\Gamma(RefCount)} $$

$$ P(GT = Het) = \dbinom{2}{1} \ln \frac{\Gamma(nSamples)\Gamma(RefCount + 1)\Gamma(AltCount + 1)}{\Gamma(nSamples + 2)\Gamma(RefCount)\Gamma(AltCount)} $$

$$ P(GT = HomVar) = \dbinom{2}{2} \ln \frac{\Gamma(nSamples)\Gamma(AltCount + 2)}{\Gamma(nSamples + 2)\Gamma(AltCount)} $$

where Γ is the Gamma function, an extension of the factorial function.

The prior genotype probabilities based on this distribution scale intuitively with number of samples. For example, a set of 10 samples, 9 of which are HomRef yield a prior probability of another sample being HomRef with about 90% probability whereas a set of 50 samples, 49 of which are HomRef yield a 97% probability of another sample being HomRef.

3. Calculation of Family Priors

Given a genotype configuration for a given mother, father, and child trio, we set the prior probability of that genotype configuration as follows:

$$ P(GM,GF,G_C) = P(\vec{G}) \cases{ 1-10\mu-2\mu^2 & no MV \cr \mu & 1 MV \cr \mu^2 & 2 MVs} $$

where the 10 configurations with a single Mendelian violation are penalized by the de novo mutation probability μ and the two configurations with two Mendelian violations by μ^2. The remaining configurations are considered valid and are assigned the remaining probability to sum to one.

This prior is applied to the joint genotype combination of the three samples in the trio. To find the posterior for any single sample, we marginalize over the remaining two samples as shown in the example below to find the posterior probability of the child having a HomRef genotype:

$$ P(GC = HomRef | \vec{D}) = \frac{LC(GC = HomRef) \sum{GF,GM} LF(GF)LM(GM)P(\vec{G})}{\sum_{\vec{H}}P(\vec{D}|\vec{H})P(\vec{H})} $$

This quantity P(Gc|D) is calculated for each genotype, then the resulting vector is Phred-scaled and output as the Phred-scaled posterior probabilities (PPs).

4. Order of the workflow

Family priors are calculated and applied before population priors. The opposite ordering results in overly strong population priors because they are applied to the child and parents and then compounded when the trio likelihoods are multiplied together.

Updated on 2014-10-22

From nblarson on 2015-08-27

I believe section 2 is a bit confusing and contains errors. Firstly, it should probably be clarified that nSamples means total number of alleles (i.e., 2 times number of samples under ploidy = 2). Secondly, there is no natural log in the formulation of the Dirichlet-multinomial distribution, and it’s unclear why this is in the equations for the priors. To verify, the code below replicates the first example given in R:

#GATK Example 1 nSamples<-10 #Assuming 10th sample is het RefCount<-19 #As Written gatk.pr<-choose(2,0)*log(gamma(nSamples)*gamma(RefCount+2)/(gamma(nSamples+2)*gamma(RefCount))) gatk.pr [1] 1.239691 #Make nSamples 20 instead of 10 (nSamples really means number of alleles) nSamples<-20 gatk.pr2<-choose(2,0)*log(gamma(nSamples)*gamma(RefCount+2)/(gamma(nSamples+2)*gamma(RefCount))) gatk.pr2 [1] -0.1000835 #Get rid of log (this shouldn’t be here) gatk.pr3<-choose(2,0)*(gamma(nSamples)*gamma(RefCount+2)/(gamma(nSamples+2)*gamma(RefCount))) gatk.pr3 [1] 0.9047619

From Sheila on 2015-08-28

@nblarson

Hi,

Thanks for checking on the math! Usually equations scare people away! :smile:

I will make a ticket to check on this.

-Sheila