Post date: Jun 25, 2020 8:42:8 PM
Email from me:
Hi Zach,
I went back to the first analysis we did (finding Pando samples among all samples) when I realized I had "centered" the data but not "scaled" them when doing the PCA. I first thought it was not necessary to scale the data as all variables had the same unit.
I preferred to check and I did the PCA again. Good news, the message is the same. I would like to know what you think about this scaling part. Scaling allows to adjust the variance of all variables, but it seems that here is actually is the variance between Pando and Friends that allow us to separate them. Does that make the scaling more important, or not necessary?
Thanks!
Rozenn.
Zach's answer:
Hi Rozenn,
As you note, when variables have different units you always want to scale them. When they have the same units, either way can be ok, it depends on whether you want each variable to contribute equally, or proportional to the variability of the variable. In our case, not scaling means intermediate frequency SNPs/mutations are weighted more heavily than rare ones. Scaling makes them all count the same. Neither is right or wrong, but my general preference with genetic data is not to scale them.
Zach