Post date: Jan 28, 2020 3:46:37 PM
Meeting with Zach over Skype, below is the summary as well as answers to my questions.
- Drop mutations also found in PON and in friends from the set of SNPs in Pando: they may not be somatic mutations if they are commonly found, or they may be hypermutable and would render harder the thinking of the genetic structure in Pando.
- Check the scale on the AC/AN graph: is there more lines that are blue that lines that are red? Zach was worried the blue part was less than the red, when there should be the same number of SNPs.
- Thinking about triploidy: I was worried that the fact Pando is mostly triploid affects the variant calling and genotype estimation algorithms. It may affect the genotype estimation algorithm as it changes the probabilities to be hetero and homozygous. Redo the AC/AN graphs, and take this into account. p becomes : AN + 1/2*AN --> three copies of each chromosome
- Next step is to transform the vcf final filtered file into a binary matrix of 1 and 0. 1 = has the mutation - 0 = does not have the mutation. PCA
How much of the variation is explained by PC1 and PC2? Do you have to go till PC15 to have a substantial percentage of the variation explained?
Relation to the geography of the clone: plot the PC1 and PC2 on the map.
- then clustering: PC score, K-means
- ten tree-based thinking: hierarchical clustering
2 papers to check on the analysis for Pando sent by Zach
Previous email exchanges:
I am trying to "shell" the 2014 paper on Lycaides, as it sounds like a wealth of information for the questions I would like to address in Pando. I have several questions as I read, that I would love to discuss with you. I write them here, but it is no urgency in answering them. I will also have more as I advance. Thank you.
1 - Is what I plot here the allele frequency spectrum and the folded allele frequency spectrum? I used AC/AN as a quick and not super clean way of doing it. Would it be interesting doing it from the MLE of allele frequencies?
2- Fig 2b - I was surprised to see "allele counts" on the x-axis. Maybe it means I did not really understand. Could you have used frequency instead, based on the population size? Also, what is the purpose of the null model here?
3- Moran's I correlogram: is that what you mentioned by Moran model as on of the analysis I could do for genetic distance versus spatial distance?
4- Nei's genetic distance: I understood it another genetic diversity estimator, used to compare different populations: is that why it cannot really be useful for the Pando project, where we compare individuals to individuals (not population)?
5- The introduction of the paper posits a big genetic question, and then uses Lycaides as a model organism to answer this question. The organism is even more described in the methods. I like this way of framing things. Do you think it is compatible with the Pando project (where the organism is the "heart' of the study)? How much can we extend to a broader context, and is it even the purpose of the work? Do we need to wait to see what the data says, before we can frame the "big question" we would like to address?
Here are my thoughts on your first set of questions:
- I think it makes sense to filter out the SNPs found also in the PON and friends.
- What you did for heterozygosity sounds right. Let me dig in some on the graph.
Zach
On Jan 17, 2020, at 12:08 PM, Pineau, Rozenn M <rpineau3@gatech.edu> wrote:
Hi Zach,
Thank you for you answer. I have a few more questions as I am a bit lost now.
I checked how many SNPs that are considered somatic mutation in the Pando and that are also found in the other two populations we created (friends and PON). There are 421 SNPs out of 9754 when I set the lower limit for somatic mutations at .5.
Does that mean we should not consider them in the analysis, are they "more commonly happening" mutations?
I was trying to look at observed versus expected heterozygosity (2pq). I am a bit stuck. What I have taken as 2pq is 2*(AC/AN)*(1-AC/AN), the "expected" heterozigosity, calculated from the estimated allele frequency. Then I looked at the genotype calling (1/1, 0/1, ect.) in the vcf file, and counted the number of actually observed heterozygotes. I obtain the following graph, where He is the heterozygosity expected and Ho, observed. It does not look very right. Could you help me understand what I did wrong?
Once I have a set of SNPs we decide to work with, what are the next steps I should be taking? I am sorry to be asking so many questions. I think it would help me if we think of some steps I could take, so that I can read about how to go now.