Post date: Apr 22, 2020 1:18:58 PM
- Choosing a distance and a clustering method
Distance
- I choose Manhattan distance because it gives the most robust results across datasets.
Clustering methods
Interesting: “One of the problems with Cluster Analysis is that different methods may produce different results –generally no accepted best method. Good News: If your data really has clear groups all methods will find them and give you similar results. Therefore it is best to try multiple algorithms and see what groups logically make sense” https://sites.ualberta.ca/~lkgray/uploads/7/3/6/2/7362679/slides-clusteranalysis.pdf
" ward.D" = Ward’s minimum variance method
" ward.D2" = Ward’s minimum variance method –however dissimilarities are squared before clustering
"single" = Nearest neighbours method
"complete" = distance between two clusters is defined as the maximum distance between an observation in one cluster and an observation in the other cluster. Links two elements that have the shortest distance linking them. The shortest distance is defined by the maximum difference a pair in of these elements bear – I do not like this one because it may give importance to non-important difference in one SNPs.Ward.D2 ; minimizes within-cluster variance – this might be more interesting to keep.
"average" = distance between two clusters is defined as the mean distance between an observation in one cluster and an observation in the other cluster
"mcquitty" = when two clusters are be joined, the distance of the new cluster to any other cluster is calculated as the average of the distances of the soon to be joined clusters to that other cluster
"median" = uses group median
"centroid" = uses group centroid
Maybe the best is really to keep trying different clustering methods.
Checking the scripts:
We start of by eliminating SNPs that are found in each of the three groups (PONs, Pando, friends). Why is this important?
Maybe answer :
- We check in the two groups we created as comparisons: PON and friends what are the shared mutations between them.
- we get rid of these shared mutations (sequencing errors, or hypermutable states that will perturb our signal)
What I have started to do:
I have found a mistake in the script I used to extract the hets probabilities, so I am restarting the analysis from the beginning.