Post date: Nov 29, 2019 8:38:50 PM
Mutect2 did not work.
We are now running bcftools to call variants on three different subsets of the data:
- the 100 individuals chosen for the panel of normals
- the Pando individuals identified as such after the first variant call on the whole dataset with read depth > 4
- Pando friends = everything that is not Pando, with no read depth filter
Next steps:
1- Filter all vcf files to get rid of the non-real SNPs.
2- Create the filter for heterozygosity.
Revert the genotype likelihoods from Phred scale to probabilities.
Proba are between 0 and 1.
Sum all three probabilities to get the sum (which will be more than one) and normalize the obtained proba by this number.
Then apply the filters:
If the proba of heteroZ is >0.99 then call it an heterozygote and code it with a 1.
If the proba of heteroZ is <0.01 then call it an homozygote and assign a 0.
If the proba of heteroZ is between 0.01 and 0.99, assign "NA".
Then calculate proportion of heteroZ for each allele and make the graph of number of number of heteroZ kept as a function of the cut.
3- Compare the mutations identified as "rare" (not true heteroZ) to the Pando Friends and the Panel of Normals. How many mutations are found in both, how many are unique?
4- Re-do approach 2 but instead of dividing by the sum, divide by the prior: high if the mutation is not found in the other populations. low if it is. (talk about this step again once 1-3 are completed).
5- Next steps, when we are happy with our variants would be: PCA, distance matrix to make a heat map, clusters, network approach to reconstruct the tree of the tree (the usual phylogenetic tree assumes that ancestors are dead, while our ancestor are alive. They would be seen as leaves when we want them to be connections).
6- 4-gamete test to test for recombination. Control test as we do not expect recombination.
7- Use ABC model to age the clone.