created by Geraldine_VdAuwera
on 2015-07-29
PL is a sample-level annotation calculated by GATK variant callers such as HaplotypeCaller, recorded in the FORMAT/sample columns of variant records in VCF files. This annotation represents the normalized Phred-scaled likelihoods of the genotypes considered in the variant record for each sample, as described here.
This article clarifies how the PL values are calculated and how this relates to the value of the GQ field.
Contents
The basic formula for calculating PL is:
$$ PL = -10 * \log{P(Data | Genotype)} $$
where P(Data | Genotype)
is the conditional probability of the Genotype given the sequence Data that we have observed. The process by which we determine the value of P(Data | Genotype)
is described in the genotyping section of the Haplotype Caller documentation.
Once we have that probability, we simply take the log of it and multiply it by -10 to put it into Phred scale. Then we normalize the values across all genotypes so that the PL value of the most likely genotype is 0, which we do simply by subtracting the value of the lowest PL from all the values.
The reason we like to work in Phred scale is because it makes it much easier to work with the very small numbers involved in these calculations. One thing to keep in mind of course is that Phred is a log scale, so whenever we need to do a division or multiplication operation (e.g. multiplying probabilities), in Phred scale this will be done as a subtraction or addition.
Here’s a worked-out example to illustrate this process. Suppose we have a site where the reference allele is A, we observed one read that has a non-reference allele T at the position of interest, and we have in hand the conditional probabilities calculated by HaplotypeCaller based on that one read (if we had more reads, their contributions would be multiplied -- or in log space, added).
Please note that the values chosen for this example have been simplified and may not be reflective of actual probabilities calculated by Haplotype Caller.
# Alleles Reference: A Read: T # Conditional probabilities calculated by HC P(AA | Data) = 0.000001 P(AT | Data) = 0.000100 P(TT | Data) = 0.010000
Calculate the raw PL values
We want to determine the PLs of the genotype being 0/0, 0/1, and 1/1, respectively. So we apply the formula given earlier, which yields the following values:
| Genotype | A/A | A/T | T/T | |------------|------|------|-----| | Raw PL | -10 * log(0.000001) = 60 | -10 * log(0.000100) = 40 | -10 * log(0.010000) = 20 |
Our first observation here is that the genotype for which the conditional probability was the highest turns out to get the lowest PL value. This is expected because, as described in the VCF FAQ, the PL is the likelihood of the genotype, which means (rather unintuitively if you’re not a stats buff) it is the probability that the genotype is not correct. So, low values mean a genotype is more likely, and high values means it’s less likely.
Normalize
At this point we have one more small transformation to make before we emit the final PL values to the VCF: we are going to normalize the values so that the lowest PL value is zero, and the rest are scaled relative to that. Since we’re in log space, we do this simply by subtracting the lowest value, 20, from the others, yielding the following final PL values:
| Genotype | A/A | A/T | T/T | |------------|------|------|-----| | Normalized PL | 60 - 20 = 40 | 40 - 20 = 20 | 20 - 20 = 0 |
We see that there is a direct relationship between the scaling of the PLs and the original probabilities: we had chosen probabilities that were each 100 times more or less likely than the next, and in the final PLs we see that the values are spaced out by a factor of 20, which is the Phred-scale equivalent of 100. This gives us a very convenient way to estimate how the numbers relate to each other -- and how reliable the genotype assignment is -- with just a glance at the PL field in the VCF record.
Genotype quality
We actually formalize this assessment of genotype quality in the GQ annotation, as described also in the VCF FAQ.The value of GQ is simply the difference between the second lowest PL and the lowest PL (which is always 0). So, in our example GQ = 20 - 0 = 20. Note that the value of GQ is capped at 99 for practical reasons, so even if the calculated GQ is higher, the value emitted to the VCF will be 99.
When you run HaplotypeCaller with -ERC GVCF
to produce a gVCF, there is an additional calculation to determine the genotype likelihoods associated with the symbolic <NON-REF>
allele (which represents the possibilities that remain once you’ve eliminated the REF allele and any ALT alleles that are being evaluated explicitly).
The PL values for any possible genotype that includes the <NON-REF>
allele have to be calculated a little differently than what is explained above because HaplotypeCaller cannot directly determine the conditional probabilities of genotypes involving <NON-REF>
. Instead, it uses base quality scores to model the genotype likelihoods.
Updated on 2018-05-11
From ryhui on 2019-02-26
Part 1 mentions P(Data | Genotype); shouldn’t it be P(Genotype | Data)?