(Dis)similarity & distance
Notice
More references are being added to this page. == incomplete ==
Choosing the right measure
Your choice of (dis)similarity measure is likely to have major impact on your results. Understanding how each measure affects your data and which one is suitable is an essential part of many analyses. The page below discusses some of these measures. If you're unable to decide on a measure, consider using our (dis)similarity wizard to help you decide what sort of measure may be most appropriate..
(Dis)similarity, distance, and dependence measures are powerful tools in determining ecological association and resemblance. Choosing an appropriate measure is essential as it will strongly affect how your data is treated during analysis and what kind of interpretations are meaningful. Non-metric dimensional scaling, principal coordinate analysis, and cluster analysis are examples of analyses that are strongly influenced by the choice of (dis)similarity measure used. Note, that while these measures may draw out certain types of relationships in your raw data, they may do so at the expense of other information present therein. Below, several key measures for asserting ecological resemblance are introduced. For a more complete overview, see chapter seven in Legendre & Legendre's Numerical Ecology (1998). For a critical view on the use of dissimilarity and distance, see Warton et al. (2012).
When choosing a distance measure, ensure that the measure reflects the ecological relationships you are concerned with. Further, some measures have mathematical properties that make them unsuitable for certain analyses. Similarly, certain analyses will only produce meaningful results when certain measures are used. If a measure listed below sounds suited to your data, use more detailed resources to learn about its properties and limitations before drawing any conclusions from analyses based upon it. The list below is not exhaustive, but aims to familiarise you with a set of commonly used measures and their uses.
Key terminology
The following terminology will be used in the measure descriptions
Q mode similarity measures
As noted above, similarity measures (S) are never metric, thus objects cannot be ordinated in a metric or Euclidean space based on their similarities. Converting similarities to distances can allow such ordination. This can be done simply by taking their one-complement (1-S) or its square root. Below, a few common measures are described below. For an extensive overview, see Legendre and Legendre (1998).
Binary measures
Binary measures are appropriate for data sets where variables can only take the values "1" or "0", such as presence/absence data sets.
Simple-matching coefficient
Jaccard coefficient
Sørensen / Dice coefficient
This coefficient gives equal weight to both forms of match - double zeros and double ones, and is thus a symmetrical coefficient.
This coefficient excludes double zeros, giving equal weight to non-zero agreements ("1", "1") and disagreements ("1", "0" and "0", "1") when comparing two objects. Given a "sites x species" matrix, the Jaccard coefficient can be used to express species/OTU turnover.
This coefficient is similar to the Jaccard coefficient, however, gives double weight to non-zero agreements. This asserts that the co-occurrence or coincidence of variable states among objects is more informative or important than disagreements. This is based on the logic of the harmonic mean and is thus suitable for data sets with large-valued outliers. It may, however, increase the influence of small-valued outliers.
Other binary measures are available which treat double-zero agreements, double-one agreements, and disagreements differently for a variety of reasons. Consider carefully if any special meaning is indicated by the different matching states of the binary variables in your data set and ensure that the measure chosen adequately reflects these.
Quantitative measures
Quantitative coefficients take into account values other than "0" and "1". Some quantitative measures lessen the effect of relatively large or small variable values in a data set to preserve overall interpretability. However, other measures are sensitive to large quantitative differences and perform better on transformed data.
Q mode dissimilarity and distance measures
There are three groups of dissimilarity measues: metric, semimetric, and nonmetric. See the "Key terminology" section of this page for definitions.
Metric distances
Semimetric measures
As described above, semimetric measures do not always satisfy the triangle inequality and hence cannot be fully relied upon to represent dissimilarities in a Euclidean space without appropriate transformation. That being said, they often do behave metrically and can be used in principal coordinates analysis (following an adjustment for negative eigenvalues if necessary) and non-metric dimensional scaling.
Nonmetric measures
As noted by Legendre and Legendre (1998), nonmetric dissimilarity measures, such as a binary coefficient proposed by Kulczynski (1928) which is the quotient of double presences and disagreements, may assume negative values. As negative dissimilarities are intuitively nonsensical, they are problematic for interpretation. In general, these should be avoided unless there is a very clear reason to use them.
R mode measures of dependence
R mode measures express the relationships between variables. With some exceptions, Q-mode measures are generally not useful or meaningful in R-mode analysis. See Legendre and Legendre (1998) and Ludwig and Reynolds (1988) for an explanation of what constitutes a permissible R-mode measure. Often, R-mode measures are referred to as dependence coefficients as they express how much the values of one variable can be said to depend on the states of another variable. Well-known correlation measures are examples of R mode measures.
Implementations
R
vegdist() in the vegan package
dist() in the package
distance() or bcdist() in the ecodist package
daisy() in the cluster package can compute a Gower index for both quantitative and categorical variables
References
Bray JR, Curtis JT (1957). An ordination of upland forest communities of southern Wisconsin. Ecol Monogr. 27:325-349.
Legendre P, Legendre L. Numerical Ecology. 2nd ed. Amsterdam: Elsevier, 1998. ISBN 978-0444892508.
Gower JC (1971) A General Coefficient of Similarity and Some of Its Properties. Biometrics. 27(4):857-871
Gower JC, Legendre P (1986) Metric and Euclidean properties of dissimilarity coefficients. J Classif. 3(1): 5-48.
Hellinger E (1909) Neue Begründung der Theorie quadratischer Formen von unendlichvielen Veränderlichen. J Reine Angew Math. 136:210–271.
Kulczynski S (1928) Die Pflanzenassoziationen der Pieninen. Bull Int Acad Pol Sci Lett Cl Sci Math Nat Ser B. Suppl. II (1927):57-203.
Lance GN, Williams WT (1966) Computer programs for hierarchical polythetic classification (“similarity analysis”). Comput J. 9:60-64.
Ludwig JA, Reynolds JF. Statistical ecology: A primer on methods and computing. New York: Wiley, 1988.
Orlóci L (1967) An agglomerative method for classification of plant communities. J Ecol. 55:193-205
Mahalanobis, PC (1936) On the generalised distance in statistics Proceedings of the National Institute of Sciences of India 2(1): 49–55.
Pearson K (1926) On the coefficient of racial likeness. Biometrika 18:105-117.
Penrose LS (1952) Distance, size and shape. Ann Eugen. 17(1):337-343.
Rao CR. The use of Hellinger distance in graphical displays of contingency table data. In: Multivariate Statistics and Matrices in Statistics: Proceedings of the 5th Tartu Conference, Tartu, Pühajärve, Estonia, 23-25 May 1994. Tiit EM, Kollo T, Niemi H (Eds.) Zeist: VSP BV, 1995. ISBN 90-6764-195-2.
Warton DI, Wright ST, Wang Y (2012). Distance-based Multivariate Analyses Confound Location and Dispersion Effects. Methods Ecol Evol. 3(1):89–101.